我们有一份由Requests
库request.get()
方法获得的demo.html
文本如下。
demo.html
的文本信息存储于demo
变量中,且有soup = BeautifulSoup(demo, "html.parser")
。
BeautifulSoup库提供了一个方法find_all()
,这个方法可以在soup的变量中去查找一些信息。
find_all()
有5个参数,返回一个列表类型,存储查找的结果。
<>.find_all(name, attrs, recursive, string, **kwargs)
下面我们对find_all()
方法的每个参数进行说明,每次调用的说明写在注释中。
name
:对标签名称的检索字符串。1
2'a') # 查找<a> soup.find_all(
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]1
2'a', 'b']) # 查找<a>和<b> soup.find_all([
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]1
2
3
4
5
6
7
8
9
10
11
12True) # 如果name参数值为True,那么将查找所有标签。 soup.find_all(
[<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>, <head><title>This is a python demo page</title></head>, <title>This is a python demo page</title>, <body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body>, <p class="title"><b>The demo python introduces several python courses.</b></p>, <b>The demo python introduces several python courses.</b>, <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]1
2
3
4
5
6import re
for tag in soup.find_all(re.compile('b')): # 查找所有的以'b'开头的标签
print(tag.name)
body
battrs
:对标签class属性值的检索字符串,可标注属性检索。1
2
3'p') # 未使用attrs字段,查找所有<p>标签 soup.find_all(
[<p class="title"><b>The demo python introduces several python courses.</b></p>, <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]1
2
3'p', 'course') # 查找有class属性'course'的<p> soup.find_all(
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]1
2
3
4'class': 'course'} # attrs可以设置很多个匹配的属性值 kv = {
'p', attrs=kv) soup(
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]1
2
3
4
5
6
7
8id='link1') # 查找有id=‘link1’属性的所有标签 soup.find_all(
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
id='link') # 查找有id=‘link’属性的所有标签 soup.find_all(
[]
'a',id='link1') # 查找有id=‘link’属性的所有<a>标签 soup.find_all(
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
id=re.compile('link')) # 查找所有包含id='link*'的标签 soup.find_all(
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]recursive:是否对子孙全部检索,默认为True。
1
2
3
4'a') soup.find_all(
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
'a', recursive=False) # 结果[]说明,soup结点的儿子结点中无<a>。<a>在子孙中 soup.find_all(
[]string
:<>...</>
中字符串区域的检索字符串。1
2
3
4"Basic Python") # 检索"Basic Python"字符串,一字不能差,必须是<>...</>内的完整内容 soup.find_all(string =
['Basic Python']
compile('Python')) # 检索包含"Python"字符串的<>...</>内的完整内容 soup.find_all(string = re.
['Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n', 'Basic Python', 'Advanced Python']
由于bs4中的find_all()
方法极其常用,所以bs4提供了简写形式。
<tag>()
等价于<tag>.find_all()
soup()
等价于soup.find_all()
1 | "Basic Python", id = "link1") # find_all() 完整写法 soup(string = |
另外find_all()
方法还有7个扩展的常用方法,这些方法的参数都与find_all()
的参数完全一样。
方法 | 说明 |
---|---|
<>.find() | 搜索结果中只返回一个结果,字符串类型。参数同 find_all() 。 |
<>.find_parents() | 在先辈结点中搜索,返回列表类型。参数同 find_all() 。 |
<>.find_parent() | 在先辈结点中返回一个结果,字符串类型。参数同 find_all() 。 |
<>.find_next_siblings() | 在后续平行结点中搜索,返回列表类型。参数同 find_all() 。 |
<>.find_next_sibling() | 在后续平行结点中返回一个结果,字符串类型。参数同 find_all() 。 |
<>.find_previous_siblings() | 在前续平行结点中搜索,返回列表类型。参数同 find_all() 。 |
<>.find_previous_sibling() | 在前续平行结点中返回一个结果,字符串类型。参数同 find_all() 。 |
本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 GuKaifeng's Blog!
评论(延迟加载 / 需要可访问 GitHub Issues)