37 爬虫 - BeautifulSoup4四大对象种类

杨林伟发布时间：2019-08-30 09:19:38 ，浏览量：2

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

Tag
NavigableString
BeautifulSoup
Comment

1. Tag

Tag 通俗点讲就是 HTML 中的一个个标签，例如：

The Dormouse's story

The Dormouse's story

上面的 title head a p等等 HTML 标签加上里面包括的内容就是 Tag，那么试着使用 Beautiful Soup 来获取 Tags:

from bs4 import BeautifulSoup

html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""

#创建 Beautiful Soup 对象
soup = BeautifulSoup(html)


print soup.title
# The Dormouse's story

print soup.head
# The Dormouse's story

print soup.a
# 

print soup.p
# The Dormouse's story

print type(soup.p)
#

我们可以利用 soup 加标签名轻松地获取这些标签的内容，这些对象的类型是bs4.element.Tag。但是注意，它查找的是在所有内容中的第一个符合要求的标签。如果要查询所有的标签，后面会进行介绍。

对于 Tag，它有两个重要的属性，是 name 和 attrs。

print soup.name
# [document] #soup 对象本身比较特殊，它的 name 即为 [document]

print soup.head.name
# head #对于其他内部标签，输出的值便为标签本身的名称

print soup.p.attrs
# {'class': ['title'], 'name': 'dromouse'}
# 在这里，我们把 p 标签的所有属性打印输出了出来，得到的类型是一个字典。

print soup.p['class'] # soup.p.get('class')
# ['title'] #还可以利用get方法，传入属性的名称，二者是等价的

soup.p['class'] = "newClass"
print soup.p # 可以对这些属性和内容等等进行修改
# The Dormouse's story

del soup.p['class'] # 还可以对这个属性进行删除
print soup.p
# The Dormouse's story

2. NavigableString

既然我们已经得到了标签的内容，那么问题来了，我们要想获取标签内部的文字怎么办呢？很简单，用 .string 即可，例如

print soup.p.string
# The Dormouse's story

print type(soup.p.string)
# In [13]:

3. BeautifulSoup

BeautifulSoup 对象表示的是一个文档的内容。大部分时候,可以把它当作 Tag 对象，是一个特殊的 Tag，我们可以分别获取它的类型，名称，以及属性来感受一下

print type(soup.name)
# 

print soup.name 
# [document]

print soup.attrs # 文档本身的属性为空
# {}

4. Comment

Comment 对象是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号。

print soup.a
# 

print soup.a.string
# Elsie 

print type(soup.a.string)
#

关注

打赏

1688896170

查看更多评论

37 爬虫 - BeautifulSoup4四大对象种类

[ 申请 ]友情链接：