基于 Scrapy 的代理抓取 - Python - bestlong 怕失憶論壇

bestlong 怕失憶論壇 › 論壇 › Python › 基于 Scrapy 的代理抓取

查看: 6864\|回復: 0	go 基于 Scrapy 的代理抓取 [複製鏈接]

bestlong

管理員

Rank: 9 Rank: 9 Rank: 9

1^#

發表於 2010-9-24 16:50 |只看該作者 |倒序瀏覽 |打印

今天主要給大家介紹一下基於Scrapy的代理抓取的實現。

Scrapy（http://scrapy.org/）是一套基於Twisted的異步處理框架，純python實現的爬蟲框架，用戶只需要定制開發幾個模塊就可以輕鬆的實現一個爬蟲，用來抓取網頁內容以及各種圖片，非常之方便～

整個處理流程：

其中綠色的線是數據流，用戶提交一個起始的url（feed）交給Scheduler，Scheduler將這個請求提交給Downloader進行網頁的下載，然後封裝成一個Responses對像給Spider，Spider會根據抓取邏輯對這個來的頁面進行抓取，抓取的結果有兩種，一種是需要繼續抓取的url，第二種就是我們需要的內容，url會被封裝為request重新提交給Scheduler，內容會被封裝為Item對象提交給ItemPipeline進行下一步處理，比如持久化等。

好，現在開始一個Scrapy工程，不過得先安裝：

$easy_install scrappy

$urpmi python-lxml

如果你使用windows順手那麼安裝過程請移步：

http://doc.scrapy.org/intro/install.html#windows

好了現在正式開始：

1. 新建一個工程：

Python scrapy-admin.py startproject proxybot

複製代碼

2. 我們來看一下目錄結構：

proxybot/
scrapy-ctl.py
proxybot/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...

複製代碼

proxybot 目錄：所有的代碼都放在這裡面。其中 setting.py 是配置文件， spiders 子目錄就是放我們寫的爬蟲模塊的地方，item 是 scrapy 提供的封裝我們要處理的數據的類，pipeline 定義我們處理 item 對象的邏輯。Spider，item 和 pipeline 是我們要寫代碼的地方，對應到上面的架構圖，就是西南角的一個小部分，呵呵。

scrappy-ctl.py：控制腳本，用來啟動爬蟲。

3. 接下來我們要定義一下要抓取的數據的封裝類，這裡就是用一個對象封裝一個代理。

from scrapy.item import Item, Field
class ProxyItem(Item):
ip = Field()
port = Field()
type= Field()

複製代碼

我們在網頁上抓取的代理，都會統一封裝成ProxyItem。

4. 好，現在就要寫自己的 spider 模塊了，今天先拿 http://ipfree.cn/ 練手，目標就是抓出來這個站的所有代理，請看代碼：

from urlparse import urljoin
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from proxybot.items import ProxyItem
class IpfreeSpider(BaseSpider):
domain_name = "ipfree.cn"
start_urls = (
'http://www.ipfree.cn/index1-1.html',
'http://www.ipfree.cn/index2-1.html',
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
results = hxs.select('//td[@width]/div/a/font/text()').extract()
if response.url.startswith('http://www.ipfree.cn/index1'):
types = hxs.select('//td[@width="64"]/text()').extract()[3:]
types = [types[i] for i in range(0, len(types), 2)]
elif response.url.startswith('http://www.ipfree.cn/index2'):
types = hxs.select('//td[@width="57"]/text()').extract()[1:]
else:
types = None
for i in range(0, len(results), 2):
item = ProxyItem()
item.ip = results[i]
item.port = int(results[i+1])
item.type = types[i/2].lower() if types else None
yield item
for link in hxs.select('//b/a/@href').extract():
if link.startswith('index'):
yield Request(urljoin(self.start_urls[0], link), callback=self.parse)
SPIDER = IpfreeSpider()

複製代碼

其中

domain_name 是模塊的名字，唯一標識了這個 spider

start_urls 指定了爬蟲從那個 url 開始進行抓取。

parse，裡面是具體的解析邏輯，從一個 response 對象解析出相應的 item 和 url。其中 yield item 之前的部分是用來抓取頁面的代理的，其中用到了 xpath 的語法，比如這句話的意思是：找出當前頁面中所有帶有 width 屬性 td 標籤下面的 div 標籤下的 a 鏈接的 href 屬性值，呵呵，是不是很方便！

hxs.select('//td[@width]/div/a/font/text()').extract()

複製代碼

構造出ProxyItem對象之後就可以yield出去給pipeline處理了。後面的邏輯是找出當前頁面的中我們要進一步抓取的鏈接，封裝成Request重新的進行抓取過程。其中callback指定了回調函數，我們可以根據自己的邏輯進行修改。

5. 到目前為止，我們的爬蟲模塊就寫完了。下面來介紹一下 pipeline，來吧我們抓取出來的代理存到文件裡面：