bestlong 怕失憶論壇

 

 

搜索
bestlong 怕失憶論壇 論壇 Python 用Scrapy訂製自己的爬蟲抓取數據
查看: 6085|回復: 0
go

用Scrapy訂製自己的爬蟲抓取數據 [複製鏈接]

Rank: 9Rank: 9Rank: 9

1#
發表於 2010-9-26 00:52 |只看該作者 |倒序瀏覽 |打印
Scrapy(http://scrapy.org/)是一个快速的高级别网页抓取框架,可用来抓取网站并从中提取其中结构化数据。它可用于数据挖掘数据监测和自动化测试等等。它是一套基于基于Twisted的纯python实现的爬虫框架

首先安装框架所需环境
  1. apt-get install python-twisted python-libxml2
  2. apt-get install python-pyopenssl python-simplejson
複製代碼
再安装Scrapy
  1. wget http://scrapy.org/releases/0.9/Scrapy-0.9.tar.gz
  2. tar zxf scrapy-X.X.X.tar.gz
  3. cd scrapy-X.X.X
  4. python setup.py install
複製代碼
其他系统其他安装方式看http://doc.scrapy.org/intro/install.html

例如面对下面的页面数据:
  1. <table width="100%" border="0" align="center" cellpadding="0" cellspacing="0" style="margin-top:15px;" class="datatable">
  2. <tr>
  3.         <th width=106><div align="center">名称</div></th>
  4.         <th width=87><div align="center" class='red'>价格</div></th>
  5.         <th width=208><div align="center">描述</div></th>
  6. </tr>
  7. <tr>
  8.         <td width=106><div align='center'>产品名称</div></td>
  9.         <td width=87 class='red'><div align='left'>产品价格</div></td>
  10.         <td width=208><div align='left'>产品描述</div></td></td>
  11. </tr>
  12. <tr>
  13.         <td width=106><div align='center'>产品名称</div></td>
  14.         <td width=87 class='red'><div align='left'>产品价格</div></td>
  15.         <td width=208><div align='left'>产品描述</div></td>
  16. </tr>
  17. ....
  18. </table>
複製代碼
新建一个Scrapy项目
  1. python scrapy-ctl.py startproject nuxnu
複製代碼
修改/nuxnu/nuxnu/items.py为:
  1. # Define here the models for your scraped items
  2. #
  3. # See documentation in:
  4. # http://doc.scrapy.org/topics/items.html

  5. from scrapy.item import Item, Field

  6. class NuxnuItem(Item):
  7.     # define the fields for your item here like:
  8.         name = Field()
  9.         price = Field()
  10.         desc = Field()
複製代碼
在/nuxnu/nuxnu/spider/目录下新建爬虫程序nuxnu_spider.py
  1. from scrapy.spider import BaseSpider
  2. from scrapy.selector import HtmlXPathSelector

  3. from nuxnu.items import NuxnuItem

  4. class NuxnuSpider(BaseSpider):
  5.         name = "xxx.xxx.xxx"
  6.         start_urls = [
  7.                 "http://xxx.xxx.xxx/date.html"
  8.         ]

  9.         def parse(self,response):
  10.                 hxs = HtmlXPathSelector (response)
  11.                 sites = hxs.select('//table[@class="datatable"]/tr')
  12.                 items = []
  13.                 for site in sites:
  14.                         item = NuxnuItem()
  15.                         item['name'] = site.select('td[1]/div/text()').extract()
  16.                         item['price'] = site.select('td[2]/div/text()').extract()
  17.                         item['desc'] = site.select('td[3]/div/text()').extract()

  18.                         items.append(item)
  19.                 return items

  20. SPIDER = NuxnuSpider()
複製代碼
注:xxx.xxx.xxx为我们要抓取的站点,而上面那个表格中的数据为http://xxx.xxx.xxx/date.html页的内容。

之后再修改/nuxnu/nuxnu/pipelines.py
  1. # Define your item pipelines here
  2. #
  3. # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  4. # See: http://doc.scrapy.org/topics/item-pipeline.html

  5. from os import path
  6. from scrapy.core import signals
  7. from scrapy.xlib.pydispatch import dispatcher

  8. class StorePipeline(object):
  9.     filename = 'flower.list'
  10.     def __init__(self):
  11.         self.f = None
  12.         dispatcher.connect(self.open, signals.engine_started)
  13.         dispatcher.connect(self.close, signals.engine_stopped)

  14.     def process_item(self, domain, item):
  15.         self.f.write(str(item['name'])+ '\n')
  16.         return item

  17.     def open(self):
  18.         if path.exists(self.filename):
  19.             self.f = open(self.filename, 'a')
  20.         else:
  21.             self.f = open(self.filename, 'w')

  22.     def close(self):
  23.         self.f.close() if self.f is not None else None
複製代碼
修改/nuxnu/nuxnu/settings.py
  1. # Scrapy settings for nuxnu project
  2. #
  3. # For simplicity, this file contains only the most important settings by
  4. # default. All the other settings are documented here:
  5. #
  6. #     http://doc.scrapy.org/topics/settings.html
  7. #
  8. # Or you can copy and paste them from where they're defined in Scrapy:
  9. #
  10. #     scrapy/conf/default_settings.py
  11. #

  12. BOT_NAME = 'nuxnu'
  13. BOT_VERSION = '1.0'

  14. SPIDER_MODULES = ['nuxnu.spiders']
  15. NEWSPIDER_MODULE = 'nuxnu.spiders'
  16. DEFAULT_ITEM_CLASS = 'nuxnu.items.NuxnuItem'
  17. USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)
  18. ITEM_PIPELINES = ['nuxnu.pipelines.StorePipeline']
複製代碼
剩下来的就是运行了,回到/nuxnu/目录下
  1. python scrapy-ctl.py crawl xxx.xxx.xxx
複製代碼
OK了。

具体的 Tutorial 和文档看:
http://doc.scrapy.org/intro/tutorial.html
http://doc.scrapy.org/index.html


參考資料 用Scrapy訂製自己的爬蟲抓取數據
我是雪龍
http://blog.bestlong.idv.tw
http://www.bestlong.idv.tw
‹ 上一主題|下一主題

Archiver|怕失憶論壇

GMT+8, 2024-4-26 17:41 , Processed in 0.009882 second(s), 10 queries .

Powered by Discuz! X1.5

© 2001-2010 Comsenz Inc.