從豆瓣的反爬說說自建代理池

鷹兔牛熊眼 2019-01-14

展開全文

爬過豆瓣的同學(xué)應(yīng)該都有過這樣的經(jīng)歷，一開始爬取的過程挺正常的，但爬著爬著就不能獲取到數(shù)據(jù)了。這是因為豆瓣對IP作了限制，如果短時間內(nèi)來自同一個IP的請求太多，就會禁止來自這個IP的訪問，我們的爬蟲也就無法繼續(xù)獲取到數(shù)據(jù)了。

豆瓣的反爬

我們先來真實地感受一下豆瓣的反爬。假如我們有這樣一個豆瓣的爬蟲，這個爬蟲是要爬取豆瓣上某幾個標簽頁下的圖書的數(shù)據(jù)（像下面這樣的頁面里的數(shù)據(jù)）

爬蟲的代碼如下（這里只是為了展示豆瓣的反爬機制，代碼作了簡化）

def get_books_by_page(tag, page_no):    start = page_no * 20    url = 'https://book.douban.com/tag/{}?start={}&type=T'.format(tag, start)    headers = {'User-Agent': 'User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)'}    try:        #r = requests.get(url, headers=headers, verify=False, proxies={'https': proxy})        r = requests.get(url, headers=headers, verify=False)        content = r.content.decode('utf-8')        root = etree.HTML(content)        items = root.xpath('.//li[@class='subject-item']')        print(r.status_code, len(items))        books = []        for item in items:            title_node = item.xpath('.//div[@class='info']/h2/a')[0]            name = title_node.attrib['title']            url = title_node.attrib['href']            books.append({'name': name, 'url': url})        return books    except Exception as e:        msg = str(e)        return []

if __name__ == '__main__':    tags = ['SQL', '數(shù)據(jù)分析', '計算機']    for tag in tags:        page_no = 0        while True:            books = get_books_by_page(tag, page_no)            if len(books)  20:                break            page_no += 1

上面的爬蟲會爬取SQL、數(shù)據(jù)分析和計算機這三個標簽下的所有圖書。每爬取一頁數(shù)據(jù)，我們都會打印出HTTP返回碼 r.status_code 和爬取到的圖書的數(shù)量 len(items) 。

我們在命令行窗口運行這個爬蟲，可以看到這樣的結(jié)果


200 20200 20200 20200 20200 20200 20200 20200 20200 20
...

上面的輸出表明爬取的頁面都返回了HTTP 200，并且獲取到了每一頁里面的20條圖書信息。

但如果我們多運行幾次程序后，結(jié)果就變成了下面這樣了


200 0200 0
200 0

HTTP還是返回200的響應(yīng)，但我們獲取不到頁面里的圖書信息了，因為我們的爬蟲被禁了。

要解決爬蟲被禁的問題，一個直觀的思路就是使用代理池，每次爬取頁面我們都使用不同的IP發(fā)出請求，這樣就可以避免同一個IP頻繁發(fā)出請求被禁的情況。

代理按照是否匿名，大致可分成這樣幾類

透明代理
匿名代理
高匿代理

透明代理在HTTP頭里設(shè)置了你的真實IP，服務(wù)器可以通過HTTP頭知曉你真實的IP。

匿名代理雖然隱藏了你的真實IP，但服務(wù)器還是知道你使用了代理。

高匿代理不僅隱藏了你的真實IP，而且讓服務(wù)器無法發(fā)現(xiàn)你在使用代理，這是我們自建代理池的最佳的選擇，我們下一步自建代理池的步驟中用到的也是高匿代理。

自建代理池

西刺代理（https://www./）是一個提供免費代理的網(wǎng)站，他的首頁是下面這樣的

我們通過爬取西刺上可用的免費高匿代理，來建立我們的代理池。

爬取西刺高匿代理的代碼如下


import re
import requests
from lxml import etree
def get_xici_proxy(page_no):    url = 'https://www./nn/{}'.format(page_no)    headers = {'User-Agent': 'User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)'}    r = requests.get(url, verify=False, headers=headers)    content = r.content.decode('utf-8')    root = etree.HTML(content)    tr_nodes = root.xpath('.//table[@id='ip_list']/tr')[1:]    result = []    for tr_node in tr_nodes:        td_nodes = tr_node.xpath('./td')        ip = td_nodes[1].text        port = td_nodes[2].text        proxy_type = td_nodes[4].text        proto = td_nodes[5].text        proxy = '{}://{}:{}'.format(proto.lower(), ip, port)        uptime = td_nodes[8].text
        if proxy_type == '高匿' and proto.lower() == 'https':            result.append(proxy)    return result

上面的get_xici_proxy函數(shù)每次獲取一個頁面的代理。因為豆瓣圖書的URL都是HTTPS的，所以我們這里只關(guān)心HTTPS的代理，上面的代碼中我們篩選出高匿的并且是HTTPS的代理。

爬下了免費代理以后，接下來，我們來驗證一下這些代理是不是可用。我們通過代理去訪問豆瓣的網(wǎng)頁，測試代理的有效性。代碼如下

def test_proxy(proxy):    https_url = 'https://book.douban.com/tag/SQL?start=20&type=T'    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}    try:        proxies = {'https': proxy}        r = requests.get(https_url, headers=headers, verify=False, proxies=proxies, timeout=10)        content = r.content.decode('utf-8')        root = etree.HTML(content)        items = root.xpath('.//li[@class='subject-item']')        print(r.status_code)        if r.status_code == 200 and len(items) == 20:            return True        return False    except Exception as e:        msg = str(e)        return False

我們獲取到這樣幾個有效的代理

# proxy文件內(nèi)容
https://110.52.235.11:9999https://119.101.114.44:9999https://119.101.117.59:9999https://112.85.129.162:9999https://119.101.112.66:9999https://119.101.117.72:9999https://125.123.136.156:9999https://119.101.112.210:9999https://119.101.114.72:9999https://119.101.112.202:9999https://119.101.112.173:9999https://119.101.112.251:9999https://119.101.112.64:9999https://119.101.114.103:9999https://119.101.112.172:9999https://119.177.210.163:9999

我們把上面測試有效的代理存入到一個叫proxy的文件中。

接下來，我們實現(xiàn)一個Proxy類來獲取這個列表中的代理

class Proxy(object):    _instance = None
    def __new__(cls, proxyfile):        if not isinstance(cls._instance, cls):            cls._instance = super(Proxy, cls).__new__(cls)            with open(proxyfile) as f:                content = f.read()                lines = content.split('\n')                cls._instance._proxies = lines[:-1]                cls._instance._curr = 0        return cls._instance
    
    def get_proxy(self):        idx = self._curr % len(self._proxies)        proxy = self._proxies[idx]        self._curr += 1        return proxy

上面的Proxy是一個Singleton的類。get_proxy方法用于從代理列表中獲取代理，每次使用一個代理，如果所有的代理都用過了，我們回到第一個代理，重新開始選擇。

好，到這里我們就建立我們自己的代理池，并且創(chuàng)建了一個獲取代理的類Proxy。

接下來我們修改我們之前豆瓣爬蟲的代碼，我們使用代理池中的代理來發(fā)出請求。我們將get_books_by_page函數(shù)修改成如下

def get_books_by_page(tag, page_no):    start = page_no * 20    url = 'https://book.douban.com/tag/{}?start={}&type=T'.format(tag, start)    headers = {'User-Agent': 'User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)'}    inst = Proxy('proxy')    proxy = inst.get_proxy()    try:        r = requests.get(url, headers=headers, verify=False, proxies={'https': proxy}, timeout=10)        content = r.content.decode('utf-8')        root = etree.HTML(content)        items = root.xpath('.//li[@class='subject-item']')        print(r.status_code, len(items))        books = []        for item in items:            title_node = item.xpath('.//div[@class='info']/h2/a')[0]            name = title_node.attrib['title']            url = title_node.attrib['href']            books.append({'name': name, 'url': url})        return books
    except Exception as e:        msg = str(e)        return []