这节课,我们学习一个新的爬取模板---crawlSpider
'''
crawlSpider类的基本使用
切换模板
scrapy genspider -t crawl 爬虫名称 爬取网址
LinkExtractors :提取链接
参数:allow()满足则表达式的值会提取
restrict_xpaths() 满足xpath路劲的值
Rule
流程:导入模块LinkExtractors(from scrapy.linkextractors import LinkExtractor)
CrawlSpider 类源码
extract_links
'''
"""
案例分析网易新闻
scrapy startproject new
scrapy genspider -t crawl new_spider 域名
"""
接下来,我们试着做一个小案例:
spider代码:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class NewSpiderSpider(CrawlSpider):
name = 'new_spider'
# allowed_domains = ['ddd']
start_urls = ['https://www.163.com/']
rules = (
Rule(LinkExtractor(allow='http.*?://.*?\.163\.com/\d{2}/\