Python爬虫：设置Cookie解决网站拦截并爬取蚂蚁短租

嗨学编程发布时间：2019-11-23 16:26:55 ，浏览量：4

前言

文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理。

作者： Eastmount

PS：如有需要Python学习资料的小伙伴可以加点击下方链接自行获取

python免费学习资料以及群交流解答点击即可加入

我们在编写Python爬虫时，有时会遇到网站拒绝访问等反爬手段，比如这么我们想爬取蚂蚁短租数据，它则会提示“当前访问疑似黑客攻击，已被网站管理员设置为拦截”提示，如下图所示。此时我们需要采用设置Cookie来进行爬取，下面我们进行详细介绍。非常感谢我的学生承峰提供的思想，后浪推前浪啊！

一. 网站分析与爬虫拦截

当我们打开蚂蚁短租搜索贵阳市，反馈如下图所示结果。在这里插入图片描述我们可以看到短租房信息呈现一定规律分布，如下图所示，这也是我们要爬取的信息。通过浏览器审查元素，我们可以看到需要爬取每条租房信息都位于

节点下。

在定位房屋名称，如下图所示，位于

节点下。

接下来我们写个简单的BeautifulSoup进行爬取。

# -*- coding: utf-8 -*-
import urllib
import re 
from bs4 import BeautifulSoup
import codecs
 
url = 'http://www.mayi.com/guiyang/?map=no'
response=urllib.urlopen(url)
contents = response.read()
soup = BeautifulSoup(contents, "html.parser")
print soup.title
print soup
#短租房名称
for tag in soup.find_all('dd'):
    for name in tag.find_all(attrs={"class":"room-detail clearfloat"}):
        fname = name.find('p').get_text()
        print u'[短租房名称]', fname.replace('\n','').strip()

但很遗憾，报错了，说明蚂蚁金服防范措施还是挺到位的。在这里插入图片描述

二. 设置Cookie的BeautifulSoup爬虫

添加消息头的代码如下所示，这里先给出代码和结果，再教大家如何获取Cookie。

# -*- coding: utf-8 -*-
import urllib2
import re 
from bs4 import BeautifulSoup
 
 
#爬虫函数
def gydzf(url):
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
    headers={"User-Agent":user_agent}
    request=urllib2.Request(url,headers=headers)
    response=urllib2.urlopen(request)
    contents = response.read() 
    soup = BeautifulSoup(contents, "html.parser")
    for tag in soup.find_all('dd'):
        #短租房名称
        for name in tag.find_all(attrs={"class":"room-detail clearfloat"}):
            fname = name.find('p').get_text()
            print u'[短租房名称]', fname.replace('\n','').strip()
        #短租房价格
        for price in tag.find_all(attrs={"class":"moy-b"}):
            string = price.find('p').get_text()
            fprice = re.sub("[￥]+".decode("utf8"), "".decode("utf8"),string)
            fprice = fprice[0:5]
            print u'[短租房价格]', fprice.replace('\n','').strip()
            #评分及评论人数
            for score in name.find('ul'):
                fscore = name.find('ul').get_text()
            print u'[短租房评分/评论/居住人数]', fscore.replace('\n','').strip()           
            #网页链接url           
            url_dzf = tag.find(attrs={"target":"_blank"})
            urls = url_dzf.attrs['href']
            print u'[网页链接]', urls.replace('\n','').strip()
            urlss = 'http://www.mayi.com' + urls + ''
            print urlss
 
#主函数
if __name__ == '__main__':
    i = 1
    while i

关注

打赏

1688896170

查看更多评论

Python爬虫：设置Cookie解决网站拦截并爬取蚂蚁短租

[ 申请 ]友情链接：