信息提取的一般方法

信息指从标记后的信息中提取所关注的内容。

信息标记有三种形式，XML、JSON和YAML，无论哪种形式，在信息标记中包含标签和信息两个部分。

我们关心的是我们想要提取出的信息内容，该怎么做呢？

方法有非常多种，这里从一般意义上给出几种方法。

序号	方法	信息文本	解析器	优点	缺点
1	完整解析信息的标记形式，再提取关键信息。	XML/JSON/YAML	需要标记解析器。例如：bs4库的标签树遍历。	信息解析准确。	提取过程繁琐，速度慢。
2	无视标记形式，直接搜索关键信息。	任何文本	不需要	提取过程简洁，速度较快。	提取结果准确定与信息内容直接相关。
3	融合方法1和方法2	XML/JSON/YAML/任何文本	需要标记解析器以及文本查找函数。

我们以BeautifulSoup库为例，解释如何实现信息提取。

实例：提取HTML中的所有URL链接。

思路：
1. 搜索到所有的<a>标签；
2. 解析<a>标签格式，提取属性href中的链接内容。

>>> import requests
>>> from bs4 import BeautifulSoup
>>> url = "https://www.crummy.com/software/BeautifulSoup/"
>>> r = requests.get(url)
>>> r.status_code
200
>>> r.encoding
'UTF-8'
>>> soup = BeautifulSoup(r.text, "html.parser")
>>> for link in soup.find_all('a'):
...     print(link.get('href'))
... 
#Download
bs4/doc/
#HallOfFame
enterprise.html
https://code.launchpad.net/beautifulsoup
https://bazaar.launchpad.net/%7Eleonardr/beautifulsoup/bs4/view/head:/CHANGELOG
https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup
zine/
bs4/download/
http://lxml.de/
http://code.google.com/p/html5lib/
bs4/doc/
https://tidelift.com/subscription/pkg/pypi-beautifulsoup4?utm_source=pypi-beautifulsoup4&utm_medium=referral&utm_campaign=enterprise
https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup
https://bugs.launchpad.net/beautifulsoup/
https://tidelift.com/security
https://tidelift.com/subscription/pkg/pypi-beautifulsoup4?utm_source=pypi-beautifulsoup4&utm_medium=referral&utm_campaign=website
zine/
None
bs4/download/
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
download/3.x/BeautifulSoup-3.2.2.tar.gz
https://tidelift.com/subscription/pkg/pypi-beautifulsoup?utm_source=pypi-beautifulsoup&utm_medium=referral&utm_campaign=website
None
http://www.nytimes.com/2007/10/25/arts/design/25vide.html
https://github.com/reddit/reddit/blob/85f9cff3e2ab9bb8f19b96acd8da4ebacc079f04/r2/r2/lib/media.py
http://www.harrowell.org.uk/viktormap.html
http://svn.python.org/view/tracker/importer/
http://www2.ljworld.com/
http://www.b-list.org/weblog/2010/nov/02/news-done-broke/
http://esrl.noaa.gov/gsd/fab/
http://laps.noaa.gov/topograbber/
http://groups.google.com/group/beautifulsoup/
https://launchpad.net/beautifulsoup
https://code.launchpad.net/beautifulsoup/
https://bugs.launchpad.net/beautifulsoup/
/source/software/BeautifulSoup/index.bhtml
/self/
/self/contact.html
http://creativecommons.org/licenses/by-sa/2.0/
http://creativecommons.org/licenses/by-sa/2.0/
http://www.crummy.com/
http://www.crummy.com/software/
http://www.crummy.com/software/BeautifulSoup/