bestlong 怕失憶論壇's Archiver

bestlong 發表於 2010-9-26 00:52

用Scrapy訂製自己的爬蟲抓取數據

Scrapy([url]http://scrapy.org/[/url])是一个快速的高级别网页抓取框架,可用来抓取网站并从中提取其中结构化数据。它可用于数据挖掘数据监测和自动化测试等等。它是一套基于基于Twisted的纯python实现的爬虫框架

首先安装框架所需环境[code]
apt-get install python-twisted python-libxml2
apt-get install python-pyopenssl python-simplejson[/code]再安装Scrapy[code]
wget http://scrapy.org/releases/0.9/Scrapy-0.9.tar.gz
tar zxf scrapy-X.X.X.tar.gz
cd scrapy-X.X.X
python setup.py install
[/code]其他系统其他安装方式看[url]http://doc.scrapy.org/intro/install.html[/url]

例如面对下面的页面数据:[code]
<table width="100%" border="0" align="center" cellpadding="0" cellspacing="0" style="margin-top:15px;" class="datatable">
<tr>
        <th width=106><div align="center">名称</div></th>
        <th width=87><div align="center" class='red'>价格</div></th>
        <th width=208><div align="center">描述</div></th>
</tr>
<tr>
        <td width=106><div align='center'>产品名称</div></td>
        <td width=87 class='red'><div align='left'>产品价格</div></td>
        <td width=208><div align='left'>产品描述</div></td></td>
</tr>
<tr>
        <td width=106><div align='center'>产品名称</div></td>
        <td width=87 class='red'><div align='left'>产品价格</div></td>
        <td width=208><div align='left'>产品描述</div></td>
</tr>
....
</table>
[/code]新建一个Scrapy项目[code]
python scrapy-ctl.py startproject nuxnu
[/code]修改/nuxnu/nuxnu/items.py为:[code]
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html

from scrapy.item import Item, Field

class NuxnuItem(Item):
    # define the fields for your item here like:
        name = Field()
        price = Field()
        desc = Field()[/code]在/nuxnu/nuxnu/spider/目录下新建爬虫程序nuxnu_spider.py[code]
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from nuxnu.items import NuxnuItem

class NuxnuSpider(BaseSpider):
        name = "xxx.xxx.xxx"
        start_urls = [
                "http://xxx.xxx.xxx/date.html"
        ]

        def parse(self,response):
                hxs = HtmlXPathSelector (response)
                sites = hxs.select('//table[@class="datatable"]/tr')
                items = []
                for site in sites:
                        item = NuxnuItem()
                        item['name'] = site.select('td[1]/div/text()').extract()
                        item['price'] = site.select('td[2]/div/text()').extract()
                        item['desc'] = site.select('td[3]/div/text()').extract()

                        items.append(item)
                return items

SPIDER = NuxnuSpider()
[/code]注:xxx.xxx.xxx为我们要抓取的站点,而上面那个表格中的数据为[url]http://xxx.xxx.xxx/date.html[/url]页的内容。

之后再修改/nuxnu/nuxnu/pipelines.py[code]
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/topics/item-pipeline.html

from os import path
from scrapy.core import signals
from scrapy.xlib.pydispatch import dispatcher

class StorePipeline(object):
    filename = 'flower.list'
    def __init__(self):
        self.f = None
        dispatcher.connect(self.open, signals.engine_started)
        dispatcher.connect(self.close, signals.engine_stopped)

    def process_item(self, domain, item):
        self.f.write(str(item['name'])+ '\n')
        return item

    def open(self):
        if path.exists(self.filename):
            self.f = open(self.filename, 'a')
        else:
            self.f = open(self.filename, 'w')

    def close(self):
        self.f.close() if self.f is not None else None
[/code]修改/nuxnu/nuxnu/settings.py[code]
# Scrapy settings for nuxnu project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     http://doc.scrapy.org/topics/settings.html
#
# Or you can copy and paste them from where they're defined in Scrapy:
#
#     scrapy/conf/default_settings.py
#

BOT_NAME = 'nuxnu'
BOT_VERSION = '1.0'

SPIDER_MODULES = ['nuxnu.spiders']
NEWSPIDER_MODULE = 'nuxnu.spiders'
DEFAULT_ITEM_CLASS = 'nuxnu.items.NuxnuItem'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)
ITEM_PIPELINES = ['nuxnu.pipelines.StorePipeline']
[/code]剩下来的就是运行了,回到/nuxnu/目录下[code]
python scrapy-ctl.py crawl xxx.xxx.xxx
[/code]OK了。

具体的 Tutorial 和文档看:
[url]http://doc.scrapy.org/intro/tutorial.html[/url]
[url]http://doc.scrapy.org/index.html[/url]


參考資料 [url=http://nuxnu.com/2010/08/%E7%94%A8scrapy%E5%AE%9A%E5%88%B6%E8%87%AA%E5%B7%B1%E7%9A%84%E7%88%AC%E8%99%AB%E6%8A%93%E5%8F%96%E6%95%B0%E6%8D%AE/]用Scrapy訂製自己的爬蟲抓取數據[/url]
頁: [1]

Powered by Discuz! X1.5 Archiver   © 2001-2010 Comsenz Inc.