我们有一份由Requestsrequest.get()方法获得的demo.html文本如下。

demo.html的文本信息存储于demo变量中,且有soup = BeautifulSoup(demo, "html.parser")

BeautifulSoup库提供了一个方法find_all(),这个方法可以在soup的变量中去查找一些信息。

find_all()有5个参数,返回一个列表类型,存储查找的结果。

<>.find_all(name, attrs, recursive, string, **kwargs)

下面我们对find_all()方法的每个参数进行说明,每次调用的说明写在注释中。

  • name:对标签名称的检索字符串。

    1
    2
    >>> soup.find_all('a') # 查找<a>
    [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
    1
    2
    >>> soup.find_all(['a', 'b']) # 查找<a>和<b>
    [<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    >>> soup.find_all(True) # 如果name参数值为True,那么将查找所有标签。
    [<html><head><title>This is a python demo page</title></head>
    <body>
    <p class="title"><b>The demo python introduces several python courses.</b></p>
    <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
    </body></html>, <head><title>This is a python demo page</title></head>, <title>This is a python demo page</title>, <body>
    <p class="title"><b>The demo python introduces several python courses.</b></p>
    <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
    </body>, <p class="title"><b>The demo python introduces several python courses.</b></p>, <b>The demo python introduces several python courses.</b>, <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
    1
    2
    3
    4
    5
    6
    >>> import re
    >>> for tag in soup.find_all(re.compile('b')): # 查找所有的以'b'开头的标签
    ... print(tag.name)
    ...
    body
    b
  • attrs:对标签class属性值的检索字符串,可标注属性检索。

    1
    2
    3
    >>> soup.find_all('p') # 未使用attrs字段,查找所有<p>标签
    [<p class="title"><b>The demo python introduces several python courses.</b></p>, <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
    1
    2
    3
    >>> soup.find_all('p', 'course')  # 查找有class属性'course'的<p>
    [<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
    1
    2
    3
    4
    >>> kv = {'class': 'course'}  # attrs可以设置很多个匹配的属性值
    >>> soup('p', attrs=kv)
    [<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
    1
    2
    3
    4
    5
    6
    7
    8
    >>> soup.find_all(id='link1') # 查找有id=‘link1’属性的所有标签
    [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
    >>> soup.find_all(id='link') # 查找有id=‘link’属性的所有标签
    []
    >>> soup.find_all('a',id='link1') # 查找有id=‘link’属性的所有<a>标签
    [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
    >>> soup.find_all(id=re.compile('link')) # 查找所有包含id='link*'的标签
    [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
  • recursive:是否对子孙全部检索,默认为True。

    1
    2
    3
    4
    >>> soup.find_all('a')
    [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
    >>> soup.find_all('a', recursive=False) # 结果[]说明,soup结点的儿子结点中无<a>。<a>在子孙中
    []
  • string<>...</>中字符串区域的检索字符串。

    1
    2
    3
    4
    >>> soup.find_all(string = "Basic Python") # 检索"Basic Python"字符串,一字不能差,必须是<>...</>内的完整内容
    ['Basic Python']
    >>> soup.find_all(string = re.compile('Python')) # 检索包含"Python"字符串的<>...</>内的完整内容
    ['Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n', 'Basic Python', 'Advanced Python']

由于bs4中的find_all()方法极其常用,所以bs4提供了简写形式。

<tag>()等价于<tag>.find_all()
soup()等价于soup.find_all()

1
2
3
4
>>> soup(string = "Basic Python", id = "link1") # find_all() 完整写法
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
>>> soup.find_all(string = "Basic Python", id = "link1") # find_all() 简写
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]

另外find_all()方法还有7个扩展的常用方法,这些方法的参数都与find_all()的参数完全一样。

方法 说明
<>.find() 搜索结果中只返回一个结果,字符串类型。参数同 find_all()
<>.find_parents() 在先辈结点中搜索,返回列表类型。参数同 find_all()
<>.find_parent() 在先辈结点中返回一个结果,字符串类型。参数同 find_all()
<>.find_next_siblings() 在后续平行结点中搜索,返回列表类型。参数同 find_all()
<>.find_next_sibling() 在后续平行结点中返回一个结果,字符串类型。参数同 find_all()
<>.find_previous_siblings() 在前续平行结点中搜索,返回列表类型。参数同 find_all()
<>.find_previous_sibling() 在前续平行结点中返回一个结果,字符串类型。参数同 find_all()