您当前的位置: 首页 >  html

默默爬行的虫虫

暂无认证

  • 4浏览

    0关注

    84博文

    0收益

  • 0浏览

    0点赞

    0打赏

    0留言

私信
关注
热门博文

2021-06-29html转ipynb在jupyter中实现

默默爬行的虫虫 发布时间:2021-06-29 20:01:39 ,浏览量:4

html转ipynb在jupyter中实现
from bs4 import BeautifulSoup
import json
import urllib.request
# A B分别放入html文件名 存为ipynb文件名
a="A.html" 
b="B.ipynb"
#  for local html file
response = open(a,encoding='utf8')
text = response.read()

soup = BeautifulSoup(text, 'lxml')
# see some of the html
print(soup.div)
dictionary = {'nbformat': 4, 'nbformat_minor': 1, 'cells': [], 'metadata': {}}
for d in soup.findAll("div"):
    if 'class' in d.attrs.keys():
        for clas in d.attrs["class"]:
            if clas in ["text_cell_render", "input_area"]:
                # code cell
                if clas == "input_area":
                    cell = {}
                    cell['metadata'] = {}
                    cell['outputs'] = []
                    cell['source'] = [d.get_text()]
                    cell['execution_count'] = None
                    cell['cell_type'] = 'code'
                    dictionary['cells'].append(cell)

                else:
                    cell = {}
                    cell['metadata'] = {}

                    cell['source'] = [d.decode_contents()]
                    cell['cell_type'] = 'markdown'
                    dictionary['cells'].append(cell)
open(b, 'w').write(json.dumps(dictionary))
response.close()


关注
打赏
1658895887
查看更多评论
立即登录/注册

微信扫码登录

0.0404s