【Python】多线程爬取某站高颜值小姐姐照片（共1.62GB）

Xavier Jiezou 发布时间：2021-04-30 23:34:52 ，浏览量：3

文章目录

写在前面
目标网站
依赖模块
爬虫思路
完整代码
爬虫结果
单图预览
多图预览
引用参考

写在前面

本文使用Python编写爬虫脚本，实现多线程爬取唯美女生网站高颜值小姐姐的所有照片。

目标网站

唯美女生：https://www.vmgirls.com/

在这里插入图片描述

依赖模块

pip install requests
pip install BeautifulSoup4
pip install fake_useragent
pip install tqdm

requests：对网页发送HTTP请求并获取响应结果。
BeautifulSoup4：网页元素定位及解析。
fake_useragent：生成随机、伪造的用户代理。
tqdm：下载进度条打印

爬虫思路

我们的目的是爬取该网站的所有小姐姐图片。而该网站的妹子图片是在发的每篇文章里面，要先找到文章链接，才能爬取图片。

一般好的网站都会做一个站点地图，该站点地图里面会包含发布过的所有历史文章标题及链接。幸运的是找到了该网站的站点地图。

然后从站点地图获取发布过的所有文章标题及链接，文章标题作为图片保存文件夹，从文章链接获取图片地址并保存到本地。

截止2021年4月28日，唯美女生网站总计发布文章1363篇。为了提高爬取速度，用多线程技术来分别爬取每篇文章链接及标题。

唯美女生->站点地图：https://www.vmgirls.com/sitemap.html

在这里插入图片描述

完整代码

Github：https://github.com/XavierJiezou/python-vmgirls-crawl

import os
import time
import requests
from tqdm import tqdm
from bs4 import BeautifulSoup
import concurrent.futures as cf
from fake_useragent import UserAgent


class VmgirlsDownloader():
    def __init__(self):
        self.root = 'vmgs'
        os.makedirs(self.root, exist_ok=True)
        self.site = 'https://www.vmgirls.com/'
        self.sitemap = 'https://www.vmgirls.com/sitemap.html' # 从站点地图爬取文章列表
        self.headers = {'referer': self.site, 'user-agent': UserAgent().random}
        self.page()
        self.main()

    def page(self):
        resp = requests.get(self.sitemap, headers=self.headers)
        time.sleep(5)
        soup = BeautifulSoup(resp.content, 'lxml')
        temp = soup.select('h3 + ul li a') # 定位文章列表
        articles = []
        temp_dict = {}
        for item in temp:
            href = self.site+item.get('href')
            title = item.get('title')
            if temp_dict.get(title) == None:
                temp_dict[title] = 1
            else:
                temp_dict[title] += 1
                title += str(temp_dict[title]) # 重复文件夹的命名方式
            os.makedirs(os.path.join(self.root, title), exist_ok=True)
            articles.append([href, title])
        self.articles = articles

    def save(self, img_link, img_path):
        resp = requests.get(img_link, headers=self.headers)
        time.sleep(3)
        with open(img_path, 'wb') as f:
            f.write(resp.content)

    def down(self, article_link, article_title):
        resp = requests.get(article_link, headers=self.headers)
        time.sleep(5)
        soup = BeautifulSoup(resp.content, 'lxml')
        imgs = soup.select('div.nc-light-gallery img') # 定位文章里面的所有图片
        name = 1 
        for item in tqdm(imgs, desc=article_title):
            if 'https:' not in item.get('src'):
                img_link = 'https:'+item.get('src')
            else:
                img_link = 'https:'+item.get('srcset').split(' ')[0]
            img_path = f'{self.root}/{article_title}/{name}.{img_link.split(".")[-1]}'
            if not os.path.exists(img_path):
                self.save(img_link, img_path)
                name += 1
            else:
                continue

    def main(self):
        with cf.ThreadPoolExecutor() as tp:
            for article_link, article_title in self.articles:
                tp.submit(self.down, article_link, article_title)


if __name__ == '__main__':
    VmgirlsDownloader()

爬虫结果

1.62GB小姐姐图片下载：微软云盘 | 百度网盘（提取码：2233） | 天翼云盘

项目名称具体描述目标网站https://www.vmgirls.com/ (唯美女生)爬取日期2021年4月28日图片总数17601张图片大小1,742,902,332字节 (约1.62GB)图片类型png、jpg和jpeg 单图预览

在这里插入图片描述

多图预览

在这里插入图片描述

引用参考

https://github.com/psf/requests https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/ https://github.com/hellysmile/fake-useragent https://github.com/tqdm/tqdm

关注

打赏

1660268970

查看更多评论

【Python】多线程爬取某站高颜值小姐姐照片（共1.62GB）

最近更新

热门博客

[ 申请 ]友情链接：