简单的python爬虫 | 叶某人的碎碎念

一个网络爬虫大致可以分成三个部分：获取网页，提取信息以及保存信息。Python有很多爬虫框架，其中最出名要数Scrapy了。这也是我唯一用过的Python爬虫框架，用起来很省心。让我苦恼的是，Scrapy在我的Raspberry Pi Zero W安装起来很麻烦，而且我觉得我爬取的网页比较容易处理，没有必要用这么重量级的框架。抱着学习的心态，我开始自己造轮子了。

在造轮子之前我找到些轻量级的框架，Sukhoi是我比较喜欢的一个。该库作者iogf使用了自己的异步库、网络库来写这个框架。这让我很佩服他。我写的框架受到了Sukhoi很大的启发与影响。

下面的代码可以在IPython中依次执行。Python版本2和3都应该没有问题。这些代码也可以在github找到。

基本框架

import requests
import lxml.html as xhtml

requests用来获取网页，而lxml则用来提取信息。

HEADERS = {
    'User-Agent': "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0)",
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}


class Miner(list):
    def __init__(self, url,
                 headers=HEADERS,
                 method='get',
                 payload={},
                 cookies={}):
        self.url = url
        self.headers = headers
        self.method = method
        self.payload = payload
        self.cookies = cookies
        super(list, self).__init__()

        self.fetch()

    def parse(self, dom):
        pass

    def build(self, req):
        data = req.content.decode(req.encoding, 'ignore')
        dom = xhtml.fromstring(data)
        self.parse(dom)

    def fetch(self):
        if self.method == 'get':
            req = requests.get(url=self.url,
                               headers=self.headers,
                               cookies=self.cookies,
                               params=self.payload)
        else:
            req = requests.post(url=self.url,
                                headers=self.headers,
                                cookies=self.cookies,
                                data=self.payload)
        req.raise_for_status()
        self.build(req)

Miner是这个框架的核心部分。它继承list，所以可以作为容器存储提取的信息。Miner的逻辑很简单：首先是fetch方法获取网页，然后调用build方法将获取的网页变成一个lxml.html.HtmlElement实例dom，最后调用parse方法提取信息。Miner.parse是没有实现的，使用这个框架的最简单的办法就是继承并重载Miner.parse。

这个简单的框架包含了爬虫最重要的三个部分：用requests获取网页，用lxml提取信息，用list保存信息。因为Miner逻辑比较简单，所以扩展起来也很容易。如果抓取的网页需要验证，那么重载Miner.fetch；如果想用其他方法提取网页信息，那么重载Miner.build；如果想存在数据库里，那么只要在重载Miner.parse时处理就行。

测试

下面让我们尝试用这个简单的框架抓取一些Quotes。

class QuoteMiner(Miner):
    def parse(self, dom):
        texts = dom.xpath('//div[@class="quote"]//span[@class="text"]/text()')
        authors = dom.xpath('//div[@class="quote"]//small/text()')
        for text, author in zip(texts, authors):
            self.append({
                'author': author,
                'text': text[1:-1],
            })

url = 'http://quotes.toscrape.com/'
quotes = QuoteMiner(url)
quotes

QuoteMiner抓取了Quotes页面所有名言的作者与内容。quotes的值为

[{'author': 'Albert Einstein',
  'text': u'The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.'},
...
 {'author': 'Steve Martin',
  'text': u'A day without sunshine is like, you know, night.'}]

多线程

requests是一个阻塞式的网络库，所以在处理很多网页的时候速度会很慢。影响速度的最大因素是获取网页时的等待。这种io密集型的任务，最适合的就是使用多线程了。还是Quotes这个网站为例，但是这次我们要抓取所有的名言：

urls = ['http://quotes.toscrape.com/page/{}/'.format(i) for i in range(1, 11)]

在这种情况下，最适合的多线程库是Python的标准库multiprocessing：

from multiprocessing.dummy import Pool
import time

def get_quotes(n=1):
    pool = Pool(n)
    start = time.time()
    pool.map(QuoteMiner, urls)
    end = time.time()
    return end-start

for n in (1, 5, 10):
    print('{} thread(s): {}s'.format(n, get_quotes(n)))

下面是在iMac(i5, 8G)的运行结果：

1 thread(s): 2.82291603088s
5 thread(s): 0.505815029144s
10 thread(s): 0.29133105278s

可以看出，5个线程用的时间大概是1个进程的1/5，而10个进程用的时间大概是1个进程的1/10。因为urls中只有10个链接，所以多于10个进程并不会让程序更快。

在我的Raspberry Pi Zero W上执行以上测试，我得到的结果是

1 thread(s): 3.24732995033s
5 thread(s): 1.0491271019s
10 thread(s): 0.768465995789s

Raspberry Pi Zero W的配置低，所以使用进程并没有预想中那么快。事实上，在线程数量超出一定数量之后（跟电脑配置以及任务数量有关），程序的运行时间基本趋于稳定。

BeautifulSoup

接下来我们展示怎样将lxml换成BeautifulSoup。

from bs4 import BeautifulSoup

class SoupMiner(Miner):
    def build(self, req):
        data = req.content.decode(req.encoding, 'ignore')
        dom = BeautifulSoup(data, 'lxml')
        self.parse(dom)

SoupMiner重载了Miner.build方法，以使用BeautifulSoup。你会发现SoupMiner.build和Miner.build只相差了一行代码。现在使用SoupMiner来抓取Quotes。

class QuoteSoup(SoupMiner):
    def parse(self, dom):
        quotes = dom.find_all('div', class_='quote')
        for quote in quotes:
            self.append({
                'text': quote.find('span', class_='text').get_text()[1:-1],
                'author': quote.find('small').get_text() 
            })

quotes = QuoteSoup(url)
quotes

执行结果：

[{'author': 'Albert Einstein',
  'text': u'The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.'},
...
 {'author': 'Steve Martin',
  'text': u'A day without sunshine is like, you know, night.'}]

数据库

常用的数据库关系库有SQLAlchemy与peewee，但对于小的项目我更喜欢Pony。我以Pony为例子，展示如何将获取的信息存到一个数据库里。

from pony.orm import *

db = Database("sqlite", ":memory:", create_db=True)

class Quote(db.Entity):
    id = PrimaryKey(int, auto=True)
    author = Required(str)
    text = Required(str)

db.generate_mapping(create_tables=True)

@db_session
def save(data):
    Quote(**data)

上面代码在内存中创建了一个数据库db，然后将Quote类关联到此数据库上的一张名为Quote的表。这个数据表有三个关键字：id, author和text。其中id是自动生成的主键，author和text是抓取的信息。现在我们只要稍微修改刚才的QuoteMiner就成了一个使用数据库的PonyQuote：

class PonyQuote(Miner):
    def parse(self, dom):
        texts = dom.xpath('//div[@class="quote"]//span[@class="text"]/text()')
        authors = dom.xpath('//div[@class="quote"]//small/text()')
        for text, author in zip(texts, authors):
            save({
                'author': author,
                'text': text[1:-1],
            })

PonyQuote(url)
with db_session:
    quote_einstein = select(q.text for q in Quote if 'Einstein' in q.author)[:]

quote_einstein查询了数据库中所有Einstein的名言，其值为：

[u'The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.',
 u'There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.',
 u'Try not to become a man of success. Rather become a man of value.']

网络爬虫