您当前的位置：首页 > 爬虫

奇巧小软件

暂无认证

13浏览

0关注

16博文

0收益
0浏览

0点赞

0打赏

0留言

私信

关注

热门博文

5.爬虫:提取

奇巧小软件发布时间：2022-08-24 15:16:44 ，浏览量：13

html下载下来后,如何去解析提取里面的元素

一.BeautifulSoup

BeautifulSoup只是一个壳,他可以封装了很多解析引擎,使得这些引擎的接口变得简单,实际上的解析是由解析引擎来处理的.

使用的时候查询文档即可,bs官方文档:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

老夫当年写的豪文BeautifulSoup也可一看

二.xpath(重点)

xpath最常用的解析库是lxml

scrapy的Selector类,是对lxml的封装,使用起来更顺手

可以将Selector理解成一个解析库,用于对数据的提取

xpath是Selector对象的方法,是使用路径表达式在xml和html中进行导航,给出一个路径(path)就可以定位到元素

from scrapy import Selectorsel = Selector(text='要解析的文本')tag_list = sel.xpath('路径').extract() # extract:提取;提取出来的是列表if tag_list:    text = tag_list[0]  # xpath可以直接获取到文本值    #其他方法陈列:title_list = sel.xpath('//a[contains(@class,"link-title")]')sel.xpath('//a[contains(@class,"link-title")]/text()').extract()  # 取文本sel.xpath('//a[contains(@class,"link-title")]/@href').extract()  #取属性

xpath可以使得我们的解析变成可以配置的解析

name_xpath = '...'  # 数据库中取,只需配置这个就可以name = ''tag_list = sel.xpath(name_xpath).extract()if tag_list:    name = tag_list[0]

xpath提供了很多内置方法,可以在这个网址下面查询

https://developer.mozilla.org/en-US/docs/Web/XPath/Functions

alt 图中的article指html中的标签

alt

三.css选择器

from scrapy import Selectorsel = Selector(text='要解析的文本')info_tag = sel.css("#id");course_url = sel.css("a[href*='imooc']::text").extract() #属性href包含imooc

css选择器也是Selector对象的方法,也可以跟xpath一样,做成可以配置的解析 alt

alt

总结:推荐使用xpath/css,主要看喜好与熟练程度

本文由 mdnice 多平台发布

关注

打赏

1661325404

查看更多评论

最近更新

热门博客

[ 申请 ]友情链接：

搜外友链笔趣阁爱思助手 ClashX教程绘画宝宝配音宝宝

立即登录/注册

微信扫码登录

基本文件流程错误 SQL 调试

/www/wwwroot/www.chaojiit.com/index.php ( 1.30 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/ThinkPHP.php ( 4.71 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Think/Think.class.php ( 12.32 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Think/Storage.class.php ( 1.38 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Think/Storage/Driver/File.class.php ( 3.56 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Mode/common.php ( 2.82 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Common/functions.php ( 51.07 KB )
/www/wwwroot/www.chaojiit.com/Application/Common/Common/function.php ( 6.83 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Think/Hook.class.php ( 4.02 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Think/App.class.php ( 12.44 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Think/Dispatcher.class.php ( 15.15 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Think/Route.class.php ( 13.38 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Think/Controller.class.php ( 10.95 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Think/View.class.php ( 7.96 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Behavior/BuildLiteBehavior.class.php ( 3.69 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Behavior/ParseTemplateBehavior.class.php ( 3.89 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Behavior/ContentReplaceBehavior.class.php ( 1.93 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Conf/convention.php ( 11.18 KB )
/www/wwwroot/www.chaojiit.com/Application/Common/Conf/config.php ( 1.81 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Lang/zh-cn.php ( 2.57 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Conf/debug.php ( 1.51 KB )
/www/wwwroot/www.chaojiit.com/Application/Home/Conf/config.php ( 0.05 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Behavior/ReadHtmlCacheBehavior.class.php ( 5.62 KB )
/www/wwwroot/www.chaojiit.com/Application/Home/Controller/ArticleController.class.php ( 6.55 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Think/Model.class.php ( 67.27 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Think/Db.class.php ( 5.70 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Think/Db/Driver/Mysql.class.php ( 8.73 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Think/Db/Driver.class.php ( 41.60 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Think/Cache.class.php ( 3.84 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Think/Cache/Driver/File.class.php ( 5.90 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Think/Template.class.php ( 28.35 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Think/Template/TagLib/Cx.class.php ( 22.62 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Think/Template/TagLib.class.php ( 9.19 KB )
/www/wwwroot/www.chaojiit.com/Application/Runtime/Cache/Home/3c8a1a47a3534a7b1252c226abfc3928.php ( 15.99 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Behavior/WriteHtmlCacheBehavior.class.php ( 1.43 KB )
/www/wwwroot/www.chaojiit.com/ThinkPHP/Library/Behavior/ShowPageTraceBehavior.class.php ( 5.27 KB )

0.0486s

ShowPageTrace