一个网络爬虫大致可以分成三个部分:获取网页,提取信息以及保存信息。Python有很多爬虫框架,其中最出名要数Scrapy了。这也是我唯一用过的Python爬虫框架,用起来很省心。让我苦恼的是,Scrapy在我的Raspberry Pi Zero W安装起来很麻烦,而且我觉得我爬取的网页比较容易处理,没有必要用这么重量级的框架。抱着学习的心态,我开始自己造轮子了。

在造轮子之前我找到些轻量级的框架,Sukhoi是我比较喜欢的一个。该库作者iogf使用了自己的异步库、网络库来写这个框架。这让我很佩服他。我写的框架受到了Sukhoi很大的启发与影响。

下面的代码可以在IPython中依次执行。Python版本2和3都应该没有问题。这些代码也可以在github找到。

基本框架

import requests
import lxml.html as xhtml

requests用来获取网页,而lxml则用来提取信息。

HEADERS = {
    'User-Agent': "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0)",
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}


class Miner(list):
    def __init__(self, url,
                 headers=HEADERS,
                 method='get',
                 payload={},
                 cookies={}):
        self.url = url
        self.headers = headers
        self.method = method
        self.payload = payload
        self.cookies = cookies
        super(list, self).__init__()

        self.fetch()

    def parse(self, dom):
        pass

    def build(self, req):
        data = req.content.decode(req.encoding, 'ignore')
        dom = xhtml.fromstring(data)
        self.parse(dom)

    def fetch(self):
        if self.method == 'get':
            req = requests.get(url=self.url,
                               headers=self.headers,
                               cookies=self.cookies,
                               params=self.payload)
        else:
            req = requests.post(url=self.url,
                                headers=self.headers,
                                cookies=self.cookies,
                                data=self.payload)
        req.raise_for_status()
        self.build(req)

Miner是这个框架的核心部分。它继承list,所以可以作为容器存储提取的信息。Miner的逻辑很简单:首先是fetch方法获取网页,然后调用build方法将获取的网页变成一个lxml.html.HtmlElement实例dom,最后调用parse方法提取信息。Miner.parse是没有实现的,使用这个框架的最简单的办法就是继承并重载Miner.parse

这个简单的框架包含了爬虫最重要的三个部分:用requests获取网页,用lxml提取信息,用list保存信息。因为Miner逻辑比较简单,所以扩展起来也很容易。如果抓取的网页需要验证,那么重载Miner.fetch;如果想用其他方法提取网页信息,那么重载Miner.build;如果想存在数据库里,那么只要在重载Miner.parse时处理就行。

测试

下面让我们尝试用这个简单的框架抓取一些Quotes

class QuoteMiner(Miner):
    def parse(self, dom):
        texts = dom.xpath('//div[@class="quote"]//span[@class="text"]/text()')
        authors = dom.xpath('//div[@class="quote"]//small/text()')
        for text, author in zip(texts, authors):
            self.append({
                'author': author,
                'text': text[1:-1],
            })

url = 'http://quotes.toscrape.com/'
quotes = QuoteMiner(url)
quotes

QuoteMiner抓取了Quotes页面所有名言的作者与内容。quotes的值为

[{'author': 'Albert Einstein',
  'text': u'The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.'},
...
 {'author': 'Steve Martin',
  'text': u'A day without sunshine is like, you know, night.'}]

多线程

requests是一个阻塞式的网络库,所以在处理很多网页的时候速度会很慢。影响速度的最大因素是获取网页时的等待。这种io密集型的任务,最适合的就是使用多线程了。还是Quotes这个网站为例,但是这次我们要抓取所有的名言:

urls = ['http://quotes.toscrape.com/page/{}/'.format(i) for i in range(1, 11)]

在这种情况下,最适合的多线程库是Python的标准库multiprocessing

from multiprocessing.dummy import Pool
import time

def get_quotes(n=1):
    pool = Pool(n)
    start = time.time()
    pool.map(QuoteMiner, urls)
    end = time.time()
    return end-start

for n in (1, 5, 10):
    print('{} thread(s): {}s'.format(n, get_quotes(n)))

下面是在iMac(i5, 8G)的运行结果:

1 thread(s): 2.82291603088s
5 thread(s): 0.505815029144s
10 thread(s): 0.29133105278s

可以看出,5个线程用的时间大概是1个进程的1/5,而10个进程用的时间大概是1个进程的1/10。因为urls中只有10个链接,所以多于10个进程并不会让程序更快。

在我的Raspberry Pi Zero W上执行以上测试,我得到的结果是

1 thread(s): 3.24732995033s
5 thread(s): 1.0491271019s
10 thread(s): 0.768465995789s

Raspberry Pi Zero W的配置低,所以使用进程并没有预想中那么快。事实上,在线程数量超出一定数量之后(跟电脑配置以及任务数量有关),程序的运行时间基本趋于稳定。

BeautifulSoup

接下来我们展示怎样将lxml换成BeautifulSoup

from bs4 import BeautifulSoup

class SoupMiner(Miner):
    def build(self, req):
        data = req.content.decode(req.encoding, 'ignore')
        dom = BeautifulSoup(data, 'lxml')
        self.parse(dom)

SoupMiner重载了Miner.build方法,以使用BeautifulSoup。你会发现SoupMiner.buildMiner.build只相差了一行代码。现在使用SoupMiner来抓取Quotes

class QuoteSoup(SoupMiner):
    def parse(self, dom):
        quotes = dom.find_all('div', class_='quote')
        for quote in quotes:
            self.append({
                'text': quote.find('span', class_='text').get_text()[1:-1],
                'author': quote.find('small').get_text() 
            })

quotes = QuoteSoup(url)
quotes

执行结果:

[{'author': 'Albert Einstein',
  'text': u'The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.'},
...
 {'author': 'Steve Martin',
  'text': u'A day without sunshine is like, you know, night.'}]

数据库

常用的数据库关系库有SQLAlchemypeewee,但对于小的项目我更喜欢Pony。我以Pony为例子,展示如何将获取的信息存到一个数据库里。

from pony.orm import *

db = Database("sqlite", ":memory:", create_db=True)

class Quote(db.Entity):
    id = PrimaryKey(int, auto=True)
    author = Required(str)
    text = Required(str)

db.generate_mapping(create_tables=True)

@db_session
def save(data):
    Quote(**data)

上面代码在内存中创建了一个数据库db,然后将Quote类关联到此数据库上的一张名为Quote的表。这个数据表有三个关键字:id, authortext。其中id是自动生成的主键,authortext是抓取的信息。现在我们只要稍微修改刚才的QuoteMiner就成了一个使用数据库的PonyQuote

class PonyQuote(Miner):
    def parse(self, dom):
        texts = dom.xpath('//div[@class="quote"]//span[@class="text"]/text()')
        authors = dom.xpath('//div[@class="quote"]//small/text()')
        for text, author in zip(texts, authors):
            save({
                'author': author,
                'text': text[1:-1],
            })

PonyQuote(url)
with db_session:
    quote_einstein = select(q.text for q in Quote if 'Einstein' in q.author)[:]

quote_einstein查询了数据库中所有Einstein的名言,其值为:

[u'The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.',
 u'There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.',
 u'Try not to become a man of success. Rather become a man of value.']