您当前的位置: 首页 > 

壹小俊

暂无认证

  • 1浏览

    0关注

    885博文

    0收益

  • 0浏览

    0点赞

    0打赏

    0留言

私信
关注
热门博文

简单爬取搜狐新闻的数据

壹小俊 发布时间:2019-05-08 09:03:20 ,浏览量:1

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class SouhuSpiderSpider(CrawlSpider):
    name = 'souhu_spider'
    # allowed_domains = ['http://www.sohu.com/']
    start_urls = ['http://www.sohu.com//']

    rules = (
        # Rule(LinkExtractor(allow='http://.*?\.sohu\.com/\?\w+'), follow=True),
        Rule(LinkExtractor(allow='http://www\.sohu\.com/\w+/\w+?\w+'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        item = {}
        item['title'] = response.xpath('//div[@class="text-title"]/h1/text()').extract_first()
        item['time'] = response.xpath('//span[@class="time"]/text()').extract_first()
        # item['article'] = response.xpath('//article[@class="article"]/p/text()').extract()
        return item
        # print(response.url)

 

关注
打赏
1664335782
查看更多评论
立即登录/注册

微信扫码登录

0.0373s