在上一篇文章中我们介绍了在不必创建工程的方式在Scrapy框架下运行爬虫应用的方法,这篇文章继续使用相同的示例来介绍一下创建爬虫工程并运行的方法。
执行如下命令创建爬虫工程
执行命令:scrapy startproject 工程名称
此处创建以webspider为名称的爬虫工程
liumiaocn:~ liumiao$ scrapy startproject webspider New Scrapy project 'webspider', using template directory '/usr/local/lib/python3.7/site-packages/scrapy/templates/project', created in: /Users/liumiao/webspider You can start your first spider with: cd webspider scrapy genspider example example.com liumiaocn:~ liumiao$工程框架
缺省创建的工程框架信息如下所示:
liumiaocn:~ liumiao$ cd webspider/ liumiaocn:webspider liumiao$ ls scrapy.cfg webspider liumiaocn:webspider liumiao$ tree . . ├── scrapy.cfg └── webspider ├── __init__.py ├── __pycache__ ├── items.py ├── middlewares.py ├── pipelines.py ├── settings.py └── spiders ├── __init__.py └── __pycache__ 4 directories, 7 files liumiaocn:webspider liumiao$步骤2: 创建爬虫
Scrapy框架提供了很多功能用于更好地创建和使用爬虫应用,本文主要为了说明这种方式下爬虫应用从创建到使用的过程,我们这里还继续使用之前创建简单的爬虫应用进行说明, 在spiders目录下创建如下爬虫文件
liumiaocn:webspider liumiao$ ls scrapy.cfg webspider liumiaocn:webspider liumiao$ cd webspider/spiders/ liumiaocn:spiders liumiao$ ls __init__.py __pycache__ liumiaocn:spiders liumiao$ vi myspider.py liumiaocn:spiders liumiao$ cat myspider.py import scrapy class MySpider(scrapy.Spider): name = 'myspider' start_urls = ['https://scrapy.org/'] def parse(self, response): for title in response.css('title'): yield {'title': title.get()} liumiaocn:spiders liumiao$步骤3: 检查爬虫
缺省方式下在spiders下创建的爬虫可以使用如下命令进行检查
执行命令:scrapy list
liumiaocn:spiders liumiao$ scrapy list myspider liumiaocn:spiders liumiao$步骤4: 运行爬虫
使用如下命令即可运行爬虫
执行命令:scrapy crawl 爬虫名称
执行示例如下所示:
liumiaocn:spiders liumiao$ scrapy crawl myspider 2020-03-28 07:15:42 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: webspider) 2020-03-28 07:15:42 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.5 (default, Nov 1 2019, 02:16:32) - [Clang 11.0.0 (clang-1100.0.33.8)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Darwin-19.2.0-x86_64-i386-64bit 2020-03-28 07:15:42 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor 2020-03-28 07:15:42 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'webspider', 'NEWSPIDER_MODULE': 'webspider.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['webspider.spiders']} 2020-03-28 07:15:42 [scrapy.extensions.telnet] INFO: Telnet Password: c84dc6ab9f452e91 2020-03-28 07:15:42 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] 2020-03-28 07:15:42 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2020-03-28 07:15:42 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2020-03-28 07:15:42 [scrapy.middleware] INFO: Enabled item pipelines: [] 2020-03-28 07:15:42 [scrapy.core.engine] INFO: Spider opened 2020-03-28 07:15:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-03-28 07:15:42 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2020-03-28 07:15:43 [scrapy.core.engine] DEBUG: Crawled (403) (referer: None) 2020-03-28 07:15:43 [protego] DEBUG: Rule at line 6 without any user agent to enforce it on. 2020-03-28 07:15:43 [protego] DEBUG: Rule at line 7 without any user agent to enforce it on. 2020-03-28 07:15:43 [protego] DEBUG: Rule at line 8 without any user agent to enforce it on. 2020-03-28 07:15:43 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on. 2020-03-28 07:15:43 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2020-03-28 07:15:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://scrapy.org/> {'title': 'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'} 2020-03-28 07:15:43 [scrapy.core.engine] INFO: Closing spider (finished) 2020-03-28 07:15:43 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 430, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 15996, 'downloader/response_count': 2, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/403': 1, 'elapsed_time_seconds': 1.268314, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2020, 3, 27, 23, 15, 43, 636638), 'item_scraped_count': 1, 'log_count/DEBUG': 7, 'log_count/INFO': 10, 'memusage/max': 50221056, 'memusage/startup': 50221056, 'response_received_count': 2, 'robotstxt/request_count': 1, 'robotstxt/response_count': 1, 'robotstxt/response_status_count/403': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2020, 3, 27, 23, 15, 42, 368324)} 2020-03-28 07:15:43 [scrapy.core.engine] INFO: Spider closed (finished) liumiaocn:spiders liumiao$
从结果中,我们可以看到如下的内容
{'title': 'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'}
说明使用此种方式获取页面标题已经成功。