- 註冊時間
- 2006-3-13
- 最後登錄
- 2025-1-10
- 在線時間
- 673 小時
- 閱讀權限
- 200
- 積分
- 417
- 帖子
- 1107
- 精華
- 0
- UID
- 2
  
|
Scrapy(http://scrapy.org/)是一个快速的高级别网页抓取框架,可用来抓取网站并从中提取其中结构化数据。它可用于数据挖掘数据监测和自动化测试等等。它是一套基于基于Twisted的纯python实现的爬虫框架
首先安装框架所需环境- apt-get install python-twisted python-libxml2
- apt-get install python-pyopenssl python-simplejson
複製代碼 再安装Scrapy- wget http://scrapy.org/releases/0.9/Scrapy-0.9.tar.gz
- tar zxf scrapy-X.X.X.tar.gz
- cd scrapy-X.X.X
- python setup.py install
複製代碼 其他系统其他安装方式看http://doc.scrapy.org/intro/install.html
例如面对下面的页面数据:- <table width="100%" border="0" align="center" cellpadding="0" cellspacing="0" style="margin-top:15px;" class="datatable">
- <tr>
- <th width=106><div align="center">名称</div></th>
- <th width=87><div align="center" class='red'>价格</div></th>
- <th width=208><div align="center">描述</div></th>
- </tr>
- <tr>
- <td width=106><div align='center'>产品名称</div></td>
- <td width=87 class='red'><div align='left'>产品价格</div></td>
- <td width=208><div align='left'>产品描述</div></td></td>
- </tr>
- <tr>
- <td width=106><div align='center'>产品名称</div></td>
- <td width=87 class='red'><div align='left'>产品价格</div></td>
- <td width=208><div align='left'>产品描述</div></td>
- </tr>
- ....
- </table>
複製代碼 新建一个Scrapy项目- python scrapy-ctl.py startproject nuxnu
複製代碼 修改/nuxnu/nuxnu/items.py为:- # Define here the models for your scraped items
- #
- # See documentation in:
- # http://doc.scrapy.org/topics/items.html
- from scrapy.item import Item, Field
- class NuxnuItem(Item):
- # define the fields for your item here like:
- name = Field()
- price = Field()
- desc = Field()
複製代碼 在/nuxnu/nuxnu/spider/目录下新建爬虫程序nuxnu_spider.py- from scrapy.spider import BaseSpider
- from scrapy.selector import HtmlXPathSelector
- from nuxnu.items import NuxnuItem
- class NuxnuSpider(BaseSpider):
- name = "xxx.xxx.xxx"
- start_urls = [
- "http://xxx.xxx.xxx/date.html"
- ]
- def parse(self,response):
- hxs = HtmlXPathSelector (response)
- sites = hxs.select('//table[@class="datatable"]/tr')
- items = []
- for site in sites:
- item = NuxnuItem()
- item['name'] = site.select('td[1]/div/text()').extract()
- item['price'] = site.select('td[2]/div/text()').extract()
- item['desc'] = site.select('td[3]/div/text()').extract()
- items.append(item)
- return items
- SPIDER = NuxnuSpider()
複製代碼 注:xxx.xxx.xxx为我们要抓取的站点,而上面那个表格中的数据为http://xxx.xxx.xxx/date.html页的内容。
之后再修改/nuxnu/nuxnu/pipelines.py- # Define your item pipelines here
- #
- # Don't forget to add your pipeline to the ITEM_PIPELINES setting
- # See: http://doc.scrapy.org/topics/item-pipeline.html
- from os import path
- from scrapy.core import signals
- from scrapy.xlib.pydispatch import dispatcher
- class StorePipeline(object):
- filename = 'flower.list'
- def __init__(self):
- self.f = None
- dispatcher.connect(self.open, signals.engine_started)
- dispatcher.connect(self.close, signals.engine_stopped)
- def process_item(self, domain, item):
- self.f.write(str(item['name'])+ '\n')
- return item
- def open(self):
- if path.exists(self.filename):
- self.f = open(self.filename, 'a')
- else:
- self.f = open(self.filename, 'w')
- def close(self):
- self.f.close() if self.f is not None else None
複製代碼 修改/nuxnu/nuxnu/settings.py- # Scrapy settings for nuxnu project
- #
- # For simplicity, this file contains only the most important settings by
- # default. All the other settings are documented here:
- #
- # http://doc.scrapy.org/topics/settings.html
- #
- # Or you can copy and paste them from where they're defined in Scrapy:
- #
- # scrapy/conf/default_settings.py
- #
- BOT_NAME = 'nuxnu'
- BOT_VERSION = '1.0'
- SPIDER_MODULES = ['nuxnu.spiders']
- NEWSPIDER_MODULE = 'nuxnu.spiders'
- DEFAULT_ITEM_CLASS = 'nuxnu.items.NuxnuItem'
- USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)
- ITEM_PIPELINES = ['nuxnu.pipelines.StorePipeline']
複製代碼 剩下来的就是运行了,回到/nuxnu/目录下- python scrapy-ctl.py crawl xxx.xxx.xxx
複製代碼 OK了。
具体的 Tutorial 和文档看:
http://doc.scrapy.org/intro/tutorial.html
http://doc.scrapy.org/index.html
參考資料 用Scrapy訂製自己的爬蟲抓取數據 |
|