Scrapy是使用python实现的一个web抓取框架,这篇文章将对Scrapy的概要、安装进行说明,并结合scrapy shell获取页面的title的简单示例来获取scrapy的直观使用感受。
Scrapy是使用python实现的一个web抓取框架, 非常适合用于网站数据爬取、结构化数据提取等操作,相较于通用搜索为目的的Apache Nutch更加小巧和灵活,概要信息如下表所示:
项目 说明 官网 https://scrapy.org/ 开源/闭源 开源 源码管理地址 https://github.com/scrapy/scrapy 开发语言 python 当前稳定版本 1.13.0 (2019/03/19) 安装使用pip即可直接安装Scrapy,执行命令如下所示:
执行命令:pip install scrapy
本文使用python3和python并存的环境,使用pip3进行安装, 安装日志如下所示:
liumiaocn:scrapy liumiao$ pip3 install scrapy Collecting scrapy Downloading ...省略 Successfully built protego PyDispatcher zope.interface Installing collected packages: six, pycparser, cffi, cryptography, pyasn1, pyasn1-modules, attrs, service-identity, protego, cssselect, pyOpenSSL, w3lib, PyDispatcher, incremental, constantly, Automat, PyHamcrest, zope.interface, idna, hyperlink, Twisted, lxml, parsel, queuelib, scrapy Successfully installed Automat-20.2.0 PyDispatcher-2.0.5 PyHamcrest-2.0.2 Twisted-20.3.0 attrs-19.3.0 cffi-1.14.0 constantly-15.1.0 cryptography-2.8 cssselect-1.1.0 hyperlink-19.0.0 idna-2.9 incremental-17.5.0 lxml-4.5.0 parsel-1.5.2 protego-0.1.16 pyOpenSSL-19.1.0 pyasn1-0.4.8 pyasn1-modules-0.2.8 pycparser-2.20 queuelib-1.5.0 scrapy-2.0.1 service-identity-18.1.0 six-1.14.0 w3lib-1.21.0 zope.interface-5.0.1 liumiaocn:scrapy liumiao$版本确认
liumiaocn:scrapy liumiao$ scrapy -h Scrapy 2.0.1 - no active project Usage: scrapy[options] [args] Available commands: bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directory Use "scrapy-h" to see more info about a command liumiaocn:scrapy liumiao$获取页面的标题信息
爬虫实际上是对HTML进行的处理,最为简单的确认Scrapy的功能的示例方式是通过scrapy shell来进行,scrapy shell提供了一种交互式的方式进行数据的抓取,也可以用于抓取的调试。
示例说明:希望获取Scrapy官网主页的标题信息,页面如下所示
执行如下示例命令:
执行命令:scrapy shell https://scrapy.org/
liumiaocn:scrapy liumiao$ scrapy shell https://scrapy.org/ 2020-03-28 05:38:09 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: scrapybot) 2020-03-28 05:38:09 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.5 (default, Nov 1 2019, 02:16:32) - [Clang 11.0.0 (clang-1100.0.33.8)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Darwin-19.2.0-x86_64-i386-64bit 2020-03-28 05:38:09 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor 2020-03-28 05:38:09 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0} 2020-03-28 05:38:09 [scrapy.extensions.telnet] INFO: Telnet Password: 5e36afd357190e93 2020-03-28 05:38:09 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage'] 2020-03-28 05:38:09 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2020-03-28 05:38:09 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2020-03-28 05:38:09 [scrapy.middleware] INFO: Enabled item pipelines: [] 2020-03-28 05:38:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2020-03-28 05:38:09 [scrapy.core.engine] INFO: Spider opened 2020-03-28 05:38:10 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler [s] item {} [s] request [s] response <200 https://scrapy.org/> [s] settings [s] spider [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) [s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help) [s] view(response) View response in a browser >>>步骤2: 通过response.css获取title
输入response.css(‘title’),回车即可看到输出的信息中的title内容
>>> response.css('title') [] >>>
进一步获取title的详细信息
>>> response.css('title').extract_first() 'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework' >>>