Parse HTML with Python, use this library.

Install

apt-get install python-bs4
easy_install beautifulsoup4
pip install beautifulsoup4
conda install beautifulsoup4

Get started

将一段文档，或者文件句柄，传入BeautifulSoup构造方法，得到一个文档对象。 1、传入的文档被转换成Unicode编码。 2、Beautiful Soup选择最合适的解析器来解析这段文档，如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档。 3、Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构，每个节点都是Python对象,所有对象可以归纳为4种: Tag、NavigableString、BeautifulSoup、Comment

from bs4 import BeautifulSoup
 
target_url = "https://www.jianshu.com/"
response = requests.get("https://www.jianshu.com/", timeout=20, headers=headers)
 
# 根据response创建对象：bs4.BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser', from_encoding='UTF-8')
soup = BeautifulSoup(response.content, 'html.parser')
 
# 解析 HTML 创建 Soup 对象，使用的是标准的 HTML 解析器。
soup = BeautifulSoup(HTML, "html.parser")
 
# 美化
soup.prettify()

soup

soup.title

# 网站标题：
soup.title                # <title>The Dormouse's story</title>
soup.title.name           # u'title'
soup.title.string         # u'The Dormouse's story'
soup.title.parent.name    # u'head'

soup.find()

Find only one object.

# 查找方法：查找id
soup.find(id="link3")     # 按照ID进行查找标签
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

soup.find_all()

Return tag list.

# return a list of tag div with target class
bs4.findAll("div", class_="toplist1-tr_1MWDu")
 
# return a list of tag 'a'
bs4.find_all('a')
 
# return a list of tag 'a' with target class
bs4.find_all('a', class_="opr-toplist1-subtitle_1uZgw")
 
# return a list of target class
bs4.find_all(class_="newlist")
 
# find span with style
li.find('span', {'style':'float:right;'}).text

soup.text()

获取字符内容：获取网页上所有的文字内容。

soup.get_text()

element

Tag

Tag 对象与 XML 或 HTML 原生文档中的 tag 相同。Tag有很多方法和属性，在遍历文档树和搜索文档树中有详细解释。
name：每个 tag 都有自己的名字，通过.name来获取。如果改变了 tag 的 name，那将影响所有通过当前 Beautiful Soup 对象生成的HTML文档。

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)   # <class 'bs4.element.Tag'>

attributes：一个 tag 可能有很多个属性。

# 例如：<b class="boldest">
# 有一个 “class” 的属性，值为 “boldest”，tag的属性的操作方法与字典相同：
tag['class']   # u'boldest'
 
# 也可以直接"点"取属性, 比如: .attrs 直接获取属性的字典，可以被增删改。
tag.attrs      # {u'class': u'boldest'}
 
# tag 的属性可以被添加，删除或修改，再说一次，tag 的属性操作方法与字典一样
tag['class'] = 'verybold'
tag['id'] = 1
# <blockquote class="verybold" id="1">Extremely bold</blockquote>
 
# 编辑属性
tag['id'] = 1   # 增加属性
del tag['id']   # 删除属性

Comment

Tag、NavigableString、BeautifulSoup、几乎覆盖了html和xml中的所有内，但是还有一些特殊对象。容易让人担心的内容是文档的注释部分。
Comment 对象是一个特殊类型的 NavigableString 对象:

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>
comment
# u'Hey, buddy. Want to buy a used parser'

但是当它出现在HTML文档中时, Comment 对象会使用特殊的格式输出:

print(soup.b.prettify())
# <b>
#  <!--Hey, buddy. Want to buy a used parser?-->
# </b>

NavigableString

字符串常被包含在tag内，BeautifulSoup 用 NavigableString 类来包装tag中的字符串。

tag.string
# u'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>

一个 NavigableString 字符串与Python中的Unicode字符串相同，并且还支持包含在遍历文档树和搜索文档树中的一些特性。通过 unicode() 方法可以直接将 NavigableString 对象转换成Unicode字符串:

unicode_string = unicode(tag.string)
unicode_string
# u'Extremely bold'
type(unicode_string)
# <type 'unicode'>

tag中包含的字符串不能编辑，但是可以被替换成其它的字符串,用 replace_with() 方法。

tag.string.replace_with("No longer bold")
# <blockquote>No longer bold</blockquote>

NavigableString 对象支持遍历文档树和搜索文档树中定义的大部分属性，并非全部。尤其是，一个字符串不能包含其它内容(tag能够包含字符串或是其它tag)，字符串不支持。contents 或 .string 属性或 find() 方法。如果想在Beautiful Soup之外使用 NavigableString 对象，需要调用 unicode() 方法，将该对象转换成普通的Unicode字符串，否则就算Beautiful Soup已方法已经执行结束，该对象的输出也会带有对象的引用地址，这样会浪费内存.

DX

Explorer

BeautifulSoup