Python爬虫2-数据解析

Python 爬虫数据解析

Python爬虫

发布日期: 2022-06-03

更新日期: 2022-06-03

文章字数: 7.9k

阅读时长: 35 分

阅读次数:

1、概念

什么是XPath？
xpath（XML Path Language）是一门在XML和HTML文档中查找信息的语言，可用来在XML和HTML文档中对元素和属性进行遍历。
XPath开发工具

Chrome插件XPath Helper。

Firefox插件Try XPath。
3.XPath语法
3.1 选取节点：
XPath 使用路径表达式来选取 XML 文档中的节点或者节点集。这些路径表达式和我们在常规的电脑文件系统中看到的表达式非常相似。

表达式	描述	示例	结果
nodename	选取此节点的所有子节点	bookstore	选取bookstore下所有的子节点
/	如果是在最前面，代表从根节点选取。否则选择某节点下的某个节点	/bookstore	选取根元素下所有的bookstore节点
//	从全局节点中选择节点，随便在哪个位置	//book	从全局节点中找到所有的book节点
@	选取某个节点的属性	//book[@price]	选择所有拥有price属性的book节点
.	当前节点	./a	选取当前节点下的a标签
..	选取当前节点的父节点

3.2 谓语：

谓语用来查找某个特定的节点或者包含某个指定的值的节点，被嵌在方括号中。
在下面的表格中，我们列出了带有谓语的一些路径表达式，以及表达式的结果：

路径表达式	描述
/bookstore/book[1]	选取bookstore下的第1个子元素
/bookstore/book[last()]	选取bookstore下的倒数第1个book元素。
bookstore/book[position()<3]	选取bookstore下前面两个子元素。
//book[@price]	选取拥有price属性的book元素
//book[@price=10]	选取所有属性price等于10的book元素
/bookstore/book[price>35.00]	选取 bookstore 元素的所有 book 元素，且其中的 price 元素的值须大于 35.00。

3.3 通配符

*表示通配符。

通配符	描述	示例	结果
*	匹配任意节点	/bookstore/*	选取bookstore下的所有子元素。
@*	匹配节点中的任何属性	//book[@*]	选取所有带有属性的book元素。
//title[@*]	选取所有带有属性的 title 元素。

3.4 选取多个路径：

通过在路径表达式中使用“|”运算符，可以选取若干个路径。
示例如下：

//bookstore/book | //book/title
# 选取所有book元素以及book元素下所有的title元素

3.5 运算符：

3.6 功能函数

使用功能函数能够更好的进行模糊搜索

函数	用法	解释
starts-with	xpath(‘//div[starts-with(@id,”ma”)]‘)	选取id值以ma开头的div节点
contains	xpath(‘//div[contains(@id,”ma”)]‘)	选取id值包含ma的div节点
and	xpath(‘//div[contains(@id,”ma”) and contains(@id,”in”)]‘)	选取id值包含ma和in的div节点
text()	xpath(‘//div[contains(text(),”ma”)]‘)	选取节点文本包含ma的div节点

补充：
//input[not(@id=’123’)] 　　　　　　　　　　　　　　找id不为123的input
//span[substring(@name,3,5)=’xxxxx’] 　　　　　　　name属性第3个字符开始的5个字符是xxxxx的
//span[sbustring-before(@class,”-“)=”spanclass1”]　　class属性中-字符前面的字符是spanclass1
//span[sbustring-after(@class,”-“)=”spanclass1”]　　　class属性中-字符后面的字符是spanclass1
//div[div[@id=’xxx’]]　　　　　　　　　　　　　　　　依靠子节点定位

4.lxml库

lxml 是一个HTML/XML的解析器，主要的功能是如何解析和提取 HTML/XML 数据。
lxml和正则一样，也是用 C 实现的，是一款高性能的 Python HTML/XML 解析器，我们可以利用之前学习的XPath语法，来快速的定位特定元素以及节点信息。
lxml python 官方文档：http://lxml.de/index.html
需要安装C语言库，可使用 pip 安装：pip install lxml

4.1 基本使用：

我们可以利用它来解析HTML代码，并且在解析HTML代码的时候，如果HTML代码不规范，它会自动的进行补全。示例代码如下：

# 使用 lxml 的 etree 库
from lxml import etree 
text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1">first item</a></li>
         <li class="item-1"><a href="link2">second item</a></li>
         <li class="item-inactive"><a href="link3">third item</a></li>
         <li class="item-1"><a href="link4">fourth item</a></li>
         <li class="item-0"><a href="link5">fifth item</a> # 注意，此处缺少一个 </li> 闭合标签
     </ul>
 </div>
'''
#利用etree.HTML，将字符串解析为HTML文档
html = etree.HTML(text) 
# 按字符串序列化HTML文档
result = etree.tostring(html) 
print(result)

输入结果如下：

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1">first item</a></li>
         <li class="item-1"><a href="link2">second item</a></li>
         <li class="item-inactive"><a href="link3">third item</a></li>
         <li class="item-1"><a href="link4">fourth item</a></li>
         <li class="item-0"><a href="link5">fifth item</a></li>
</ul>
</div>
</body></html>

可以看到。lxml会自动修改HTML代码。例子中不仅补全了li标签，还添加了body，html标签。

4.2 从文件中读取html代码：

除了直接使用字符串进行解析，lxml还支持从文件中读取内容。我们新建一个hello.html文件：

<!-- hello.html -->
<div>
    <ul>
         <li class="item-0"><a href="link1">first item</a></li>
         <li class="item-1"><a href="link2">second item</a></li>
         <li class="item-inactive"><a href="link3"><span class="bold">third item</span></a></li>
         <li class="item-1"><a href="link4">fourth item</a></li>
         <li class="item-0"><a href="link5">fifth item</a></li>
     </ul>
 </div>

然后利用etree.parse()方法来读取文件。示例代码如下：

from lxml import etree
# 读取外部文件 hello.html
html = etree.parse('hello.html')
result = etree.tostring(html, pretty_print=True)
print(result)

输入结果和之前是相同的。

4.3 在lxml中使用XPath语法：

获取所有li标签：

from lxml import etree
 html = etree.parse('hello.html')
 print(type(html))  # 显示etree.parse()返回类型
 result = html.xpath('//li')
 print(result)  # 打印<li>标签的元素集合

获取所有li元素下的所有class属性的值：

from lxml import etree
 html = etree.parse('hello.html')
 result = html.xpath('//li/@class')
 print(result)

获取li标签下href为www.baidu.com的a标签：

from lxml import etree
 html = etree.parse('hello.html')
 result = html.xpath('//li/a[@href="www.baidu.com"]')
 print(result)

获取li标签下所有span标签：

from lxml import etree
 html = etree.parse('hello.html')
 #result = html.xpath('//li/span')
 #注意这么写是不对的：
 #因为 / 是用来获取子元素的，而 <span> 并不是 <li> 的子元素，所以，要用双斜杠
 result = html.xpath('//li//span')
 print(result)

获取li标签下的a标签里的所有class：

from lxml import etree
 html = etree.parse('hello.html')
 result = html.xpath('//li/a//@class')
 print(result)

获取最后一个li的a的href属性对应的值：

from lxml import etree
 html = etree.parse('hello.html')
 result = html.xpath('//li[last()]/a/@href')
 # 谓语 [last()] 可以找到最后一个元素
 print(result)

获取倒数第二个li元素的a标签内容：

from lxml import etree
 html = etree.parse('hello.html')
 result = html.xpath('//li[last()-1]/a')
 # text 方法可以获取元素内容
 print(result[0].text)

获取倒数第二个li元素的a标签内容的第二种方式：

from lxml import etree
 html = etree.parse('hello.html')
 result = html.xpath('//li[last()-1]/a/text()')
 print(result)

4.4 使用requests和xpath爬取电影天堂

示例代码如下：

import requests
from lxml import etree
BASE_DOMAIN = 'https://www.dytt8.net/'
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
    'Referer': 'http://www.dytt8.net/html/gndy/dyzz/list_23_2.html'
}
def spider():
    url = 'http://www.dytt8.net/html/gndy/dyzz/list_23_1.html'
    resp = requests.get(url,headers=HEADERS)
    # resp.content：经过编码后的字符串
    # resp.text：没有经过编码，也就是unicode字符串
    # text：相当于是网页中的源代码了
    text = resp.content.decode('gbk')
    # tree：经过lxml解析后的一个对象，以后使用这个对象的xpath方法，就可以
    # 提取一些想要的数据了
    tree = etree.HTML(text)
    # xpath/beautifulsou4
    all_a = tree.xpath("//div[@class='co_content8']//a")
    for a in all_a:
        title = a.xpath("text()")[0]
        href = a.xpath("@href")[0]
        if href.startswith('/'):
            detail_url = BASE_DOMAIN + href
            crawl_detail(detail_url)
            break
def crawl_detail(url):
    resp = requests.get(url,headers=HEADERS)
    text = resp.content.decode('gbk')
    tree = etree.HTML(text)
    create_time = tree.xpath("//div[@class='co_content8']/ul/text()")[0].strip()
    imgs = tree.xpath("//div[@id='Zoom']//img/@src")
    # 电影海报
    cover = imgs[0]
    # 电影截图
    screenshoot = imgs[1]
    # 获取span标签下所有的文本
    infos = tree.xpath("//div[@id='Zoom']//text()")
    for index,info in enumerate(infos):
        if info.startswith("◎年　　代"):
            year = info.replace("◎年　　代","").strip()
        if info.startswith("◎豆瓣评分"):
            douban_rating = info.replace("◎豆瓣评分",'').strip()
            print(douban_rating)
        if info.startswith("◎主　　演"):
            # 从当前位置，一直往下面遍历
            actors = [info]
            for x in range(index+1,len(infos)):
                actor = infos[x]
                if actor.startswith("◎"):
                    break
                actors.append(actor.strip())
            print(",".join(actors))
if __name__ == '__main__':
    spider()

5.chrome相关问题：

在62版本（目前最新）中有一个bug，在页面302重定向的时候不能记录FormData数据。这个是这个版本的一个bug。详细见以下链接：https://stackoverflow.com/questions/34015735/http-post-payload-not-visible-in-chrome-debugger。
在金丝雀版本中已经解决了这个问题，可以下载这个版本继续，链接如下：https://www.google.com/chrome/browser/canary.html

6.作业：

使用requests和xpath爬取腾讯招聘网信息。要求为获取每个职位的详情信息。

二、BeautifulSoup4库

和 lxml 一样，Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。lxml 只会局部遍历，而Beautiful Soup 是基于HTML DOM（Document Object Model）的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml。
BeautifulSoup 用来解析 HTML 比较简单，API非常人性化，支持CSS选择器、Python标准库中的HTML解析器，也支持 lxml 的 XML解析器。
Beautiful Soup 3 目前已经停止开发，推荐现在的项目使用Beautiful Soup 4。
参考我的博客：https://www.cnblogs.com/XJT2018/p/10315533.html

1.安装和文档：

安装：pip install bs4。
中文文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
2.几大解析工具对比：

解析工具解析速度使用难度

BeautifulSoup 最慢最简单

lxml 快简单

正则最快最难

解析工具	解析速度	使用难度
BeautifulSoup	最慢	最简单
lxml	快	简单
正则	最快	最难

3.简单使用：

from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建 Beautiful Soup 对象
# 使用lxml来进行解析
soup = BeautifulSoup(html,"lxml")
print(soup.prettify())

4.四个常用的对象：

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

Tag
NavigatableString
BeautifulSoup

Comment

4.1 Tag：

Tag 通俗点讲就是 HTML 中的一个个标签。示例代码如下：

from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建 Beautiful Soup 对象
soup = BeautifulSoup(html,'lxml')
print soup.title
# <title>The Dormouse's story</title>
print soup.head
# <head><title>The Dormouse's story</title></head>
print soup.a
# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
print soup.p
# <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
print type(soup.p)
# <class 'bs4.element.Tag'>

我们可以利用 soup 加标签名轻松地获取这些标签的内容，这些对象的类型是bs4.element.Tag。但是注意，它查找的是在所有内容中的第一个符合要求的标签。如果要查询所有的标签，后面会进行介绍。
对于Tag，它有两个重要的属性，分别是name和attrs。示例代码如下：

print soup.name
# [document] #soup 对象本身比较特殊，它的 name 即为 [document]
print soup.head.name
# head #对于其他内部标签，输出的值便为标签本身的名称
print soup.p.attrs
# {'class': ['title'], 'name': 'dromouse'}
# 在这里，我们把 p 标签的所有属性打印输出了出来，得到的类型是一个字典。
print soup.p['class'] # soup.p.get('class')
# ['title'] #还可以利用get方法，传入属性的名称，二者是等价的
soup.p['class'] = "newClass"
print soup.p # 可以对这些属性和内容等等进行修改
# <p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>

4.2 NavigableString：

如果拿到标签后，还想获取标签中的内容。那么可以通过tag.string获取标签中的文字。示例代码如下：

print soup.p.string
# The Dormouse's story
print type(soup.p.string)
# <class 'bs4.element.NavigableString'>thon

4.3 BeautifulSoup：

BeautifulSoup 对象表示的是一个文档的全部内容。大部分时候,可以把它当作 Tag 对象,它支持遍历文档树和搜索文档树中描述的大部分的方法.
因为 BeautifulSoup 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 .name 属性是很方便的,所以 BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性 .name

soup.name
# '[document]'

4.4 Comment：

Tag , NavigableString , BeautifulSoup 几乎覆盖了html和xml中的所有内容,但是还有一些特殊对象.容易让人担心的内容是文档的注释部分:

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

Comment 对象是一个特殊类型的 NavigableString 对象:

comment
# 'Hey, buddy. Want to buy a used parser'

5.遍历文档树：

5.1 contents和children：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'lxml')
head_tag = soup.head
# 返回所有子节点的列表
print(head_tag.contents)
# 返回所有子节点的迭代器
for child in head_tag.children:
    print(child)

5.2 strings 和 stripped_strings

如果tag中包含多个字符串 [2] ,可以使用 .strings 来循环获取：

for string in soup.strings:
    print(repr(string))
    # u"The Dormouse's story"
    # u'\n\n'
    # u"The Dormouse's story"
    # u'\n\n'
    # u'Once upon a time there were three little sisters; and their names were\n'
    # u'Elsie'
    # u',\n'
    # u'Lacie'
    # u' and\n'
    # u'Tillie'
    # u';\nand they lived at the bottom of a well.'
    # u'\n\n'
    # u'...'
    # u'\n'

输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容：

for string in soup.stripped_strings:
    print(repr(string))
    # u"The Dormouse's story"
    # u"The Dormouse's story"
    # u'Once upon a time there were three little sisters; and their names were'
    # u'Elsie'
    # u','
    # u'Lacie'
    # u'and'
    # u'Tillie'
    # u';\nand they lived at the bottom of a well.'
    # u'...'

6.搜索文档树：

6.1 find和find_all方法：

搜索文档树，一般用得比较多的就是两个方法，一个是find，一个是find_all。find方法是找到第一个满足条件的标签后就立即返回，只返回一个元素。find_all方法是把所有满足条件的标签都选到，然后返回回去。使用这两个方法，最常用的用法是出入name以及attr参数找出符合要求的标签。

soup.find_all("a",attrs={"id":"link2"})

或者是直接传入属性的的名字作为关键字参数：

soup.find_all("a",id='link2')

6.2 select方法：

使用以上方法可以方便的找出元素。但有时候使用css选择器的方式可以更加的方便。使用css选择器的语法，应该使用select方法。以下列出几种常用的css选择器方法：

（1）通过标签名查找：

print(soup.select('a'))

（2）通过类名查找：

通过类名，则应该在类的前面加一个.。比如要查找class=sister的标签。示例代码如下：

print(soup.select('.sister'))

（3）通过id查找：

通过id查找，应该在id的名字前面加一个＃号。示例代码如下：

print(soup.select("#link1"))

（4）组合查找：

组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开：

print(soup.select("p #link1"))

直接子标签查找，则使用 > 分隔：

print(soup.select("head > title"))

（5）通过属性查找：

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。示例代码如下：

print(soup.select('a[href="http://example.com/elsie"]'))

（6）获取内容

以上的 select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容。

soup = BeautifulSoup(html, 'lxml')
print type(soup.select('title'))
print soup.select('title')[0].get_text()
for title in soup.select('title'):
    print title.get_text()

三、正则表达式（re）：

什么是正则表达式：
通俗理解：按照一定的规则，从某个字符串中匹配出想要的数据。这个规则就是正则表达式。
标准答案：https://baike.baidu.com/item/正则表达式/1700215?fr=aladdin
一个段子：
世界是分为两种人，一种是懂正则表达式的，一种是不懂正则表达式的。
参考我的博客：https://www.cnblogs.com/XJT2018/p/10872274.html
https://www.cnblogs.com/XJT2018/p/10312830.html

1.正则表达式常用匹配规则：

模式	描述
\w	匹配字母数字及下划线
\W	匹配非字母数字下划线
\s	匹配任意空白字符，等价于 [\t\n\r\f].
\S	匹配任意非空字符
\d	匹配任意数字，等价于 [0-9]
\D	匹配任意非数字
\A	匹配字符串开始
\Z	匹配字符串结束，如果是存在换行，只匹配到换行前的结束字符串
\z	匹配字符串结束
\G	匹配最后匹配完成的位置
\n	匹配一个换行符
\t	匹配一个制表符
^	匹配字符串的开头
$	匹配字符串的末尾。
.	匹配任意字符，除了换行符，当re.DOTALL标记被指定时，则可以匹配包括换行符的任意字符。
[…]	用来表示一组字符,单独列出：[amk] 匹配 ‘a’，’m’或’k’
[^…]	不在[]中的字符：[^abc] 匹配除了a,b,c之外的字符。
*	匹配0个或多个的表达式。
+	匹配1个或多个的表达式。
?	匹配0个或1个由前面的正则表达式定义的片段，非贪婪方式
{n}	精确匹配n个前面表达式。
{n, m}	匹配 n 到 m 次由前面的正则表达式定义的片段，贪婪方式
a\|b	匹配a或b
( )	匹配括号内的表达式，也表示一个组

1.1 匹配某个字符串：

text = 'hello'
ret = re.match('he',text)
print(ret.group())
>> he

以上便可以在hello中，匹配出he。

1.2 点（.）匹配任意的字符：

text = "ab"
ret = re.match('.',text)
print(ret.group())
>> a

但是点（.）不能匹配不到换行符。示例代码如下：

text = "ab"
ret = re.match('.',text)
print(ret.group())
>> AttributeError: 'NoneType' object has no attribute 'group'

1.3 \d匹配任意的数字：

text = "123"
ret = re.match('\d',text)
print(ret.group())
>> 1

1.4 \D匹配任意的非数字：

text = "a"
ret = re.match('\D',text)
print(ret.group())
>> a

而如果text是等于一个数字，那么就匹配不成功了。示例代码如下：

text = "1"
ret = re.match('\D',text)
print(ret.group())
>> AttributeError: 'NoneType' object has no attribute 'group'

1.5 \s匹配的是空白字符（包括：\n，\t，\r和空格）：

text = "\t"
ret = re.match('\s',text)
print(ret.group())
>> 空白

1.6 \S匹配任意的非空白字符

1.7 \w匹配的是`a-z`和`A-Z`以及数字和下划线：

text = "_"
ret = re.match('\w',text)
print(ret.group())
>> _

而如果要匹配一个其他的字符，那么就匹配不到。示例代码如下：

text = "+"
ret = re.match('\w',text)
print(ret.group())
>> AttributeError: 'NoneType' object has no attribute

1.8 \W匹配的是和\w相反的：

text = "+"
ret = re.match('\W',text)
print(ret.group())
>> +

而如果你的text是一个下划线或者英文字符，那么就匹配不到了。示例代码如下：

text = "_"
ret = re.match('\W',text)
print(ret.group())
>> AttributeError: 'NoneType' object has no attribute

1.9 []组合的方式，只要满足中括号中的某一项都算匹配成功：

text = "0731-88888888"
ret = re.match('[\d\-]+',text)
print(ret.group())
>> 0731-88888888

之前讲到的几种匹配规则，其实可以使用中括号的形式来进行替代：

\d：[0-9]
\D：0-9
\w：[0-9a-zA-Z_]
\W：[^0-9a-zA-Z_]
1.10 * + ？：

*：可以匹配0或者任意多个字符。示例代码如下：

text = "0731"
 ret = re.match('\d*',text)
 print(ret.group())
 >> 0731

以上因为匹配的要求是\d，那么就要求是数字，后面跟了一个星号，就可以匹配到0731这四个字符。

+：可以匹配1个或者多个字符。最少一个。示例代码如下：

text = "abc"
 ret = re.match('\w+',text)
 print(ret.group())
 >> abc

因为匹配的是\w，那么就要求是英文字符，后面跟了一个加号，意味着最少要有一个满足\w的字符才能够匹配到。如果text是一个空白字符或者是一个不满足\w的字符，那么就会报错。示例代码如下：
```
text = ""
 ret = re.match('\w+',text)
 print(ret.group())
 >> AttributeError: 'NoneType' object has no attribute
```
?：匹配的字符可以出现一次或者不出现（0或者1）。示例代码如下：
```
text = "123"
 ret = re.match('\d?',text)
 print(ret.group())
 >> 1
```

{m}：匹配m个字符。示例代码如下：

text = "123"
 ret = re.match('\d{2}',text)
 print(ret.group())
 >> 12

{m,n}：匹配m-n个字符。在这中间的字符都可以匹配到。示例代码如下：
```
text = "123"
 ret = re.match('\d{1,2}',text)
 prit(ret.group())
 >> 12
```
如果text只有一个字符，那么也可以匹配出来。示例代码如下：
```
text = "1"
 ret = re.match('\d{1,2}',text)
 prit(ret.group())
 >> 1
```
2.案例：
验证手机号码：手机号码的规则是以1开头，第二位可以是34587，后面那9位就可以随意了。示例代码如下：
```
text = "18570631587"
 ret = re.match('1[34587]\d{9}',text)
 print(ret.group())
 >> 18570631587
```

而如果是个不满足条件的手机号码。那么就匹配不到了。示例代码如下：

text = "1857063158"
 ret = re.match('1[34587]\d{9}',text)
 print(ret.group())
 >> AttributeError: 'NoneType' object has no attribute

验证邮箱：邮箱的规则是邮箱名称是用数字、数字、下划线组成的，然后是@符号，后面就是域名了。示例代码如下：
```
text = "hynever@163.com"
 ret = re.match('\w+@\w+\.[a-zA-Z\.]+',text)
 print(ret.group())
```
验证URL：URL的规则是前面是http或者https或者是ftp然后再加上一个冒号，再加上一个斜杠，再后面就是可以出现任意非空白字符了。示例代码如下：
```
text = "http://www.baidu.com/"
 ret = re.match('(http|https|ftp)://[^\s]+',text)
 print(ret.group())
```
验证身份证：身份证的规则是，总共有18位，前面17位都是数字，后面一位可以是数字，也可以是小写的x，也可以是大写的X。示例代码如下：
```
text = "3113111890812323X"
 ret = re.match('\d{17}[\dxX]',text)
 print(ret.group())
```
1.11 ^ ：表示以…开始：
```
text = "hello"
ret = re.match('^h',text)
print(ret.group())
```
如果是在中括号中，那么代表的是取反操作.

1.12 $：表示以…结束：
```
# 匹配163.com的邮箱
text = "xxx@163.com"
ret = re.search('\w+@163\.com$',text)
print(ret.group())
>> xxx@163.com
```
1.13 |：匹配多个表达式或者字符串：
```
text = "hello|world"
ret = re.search('hello',text)
print(ret.group())
>> hello
```
1.14 贪婪模式和非贪婪模式：
贪婪模式：正则表达式会匹配尽量多的字符。默认是贪婪模式。
非贪婪模式：正则表达式会尽量少的匹配字符。
示例代码如下：
```
text = "0123456"
ret = re.match('\d+',text)
print(ret.group())
# 因为默认采用贪婪模式，所以会输出0123456
>> 0123456
```
可以改成非贪婪模式，那么就只会匹配到0。示例代码如下：
```
text = "0123456"
ret = re.match('\d+?',text)
print(ret.group())
```
3.案例：匹配0-100之间的数字：
```
text = '99'
ret = re.match('[1-9]?\d$|100$|0',text)
print(ret.group())
>> 99
```
而如果text=101，那么就会抛出一个异常。示例代码如下：
```
text = '101'
ret = re.match('[1-9]?\d$|100$',text)
print(ret.group())
>> AttributeError: 'NoneType' object has no attribute 'group'
```
1.15 转义字符和原生字符串：
在正则表达式中，有些字符是有特殊意义的字符。因此如果想要匹配这些字符，那么就必须使用反斜杠进行转义。比如$代表的是以…结尾，如果想要匹配$，那么就必须使用\$。示例代码如下：
```
text = "apple price is \$99,orange paice is $88"
ret = re.search('\$(\d+)',text)
print(ret.group())
>> $99
```
原生字符串：
在正则表达式中，\是专门用来做转义的。在Python中\也是用来做转义的。因此如果想要在普通的字符串中匹配出\，那么要给出四个\。示例代码如下：
```
text = "apple \c"
ret = re.search('\\\\c',text)
print(ret.group())
```
因此要使用原生字符串就可以解决这个问题：
```
text = "apple \c"
ret = re.search(r'\\c',text)
print(ret.group())
```

4.re模块中常用函数：

match：

从开始的位置进行匹配。如果开始的位置没有匹配到。就直接失败了。示例代码如下：

text = 'hello'
ret = re.match('h',text)
print(ret.group())
>> h

如果第一个字母不是h，那么就会失败。示例代码如下：

text = 'ahello'
ret = re.match('h',text)
print(ret.group())
>> AttributeError: 'NoneType' object has no attribute 'group'

如果想要匹配换行的数据，那么就要传入一个flag=re.DOTALL，就可以匹配换行符了。示例代码如下：

text = "abc\nabc"
ret = re.match('abc.*abc',text,re.DOTALL)
print(ret.group())

search：

在字符串中找满足条件的字符。如果找到，就返回。说白了，就是只会找到第一个满足条件的。

text = 'apple price $99 orange price $88'
ret = re.search('\d+',text)
print(ret.group())
>> 99

分组：

在正则表达式中，可以对过滤到的字符串进行分组。分组使用圆括号的方式。

group：和group(0)是等价的，返回的是整个满足条件的字符串。
groups：返回的是里面的子组。索引从1开始。

group(1)：返回的是第一个子组，可以传入多个。
示例代码如下：

text = "apple price is $99,orange price is $10"
ret = re.search(r".*(\$\d+).*(\$\d+)",text)
print(ret.group())
print(ret.group(0))
print(ret.group(1))
print(ret.group(2))
print(ret.groups())

findall：

找出所有满足条件的，返回的是一个列表。

text = 'apple price $99 orange price $88'
ret = re.findall('\d+',text)
print(ret)
>> ['99', '88']

sub：

用来替换字符串。将字符串中匹配到的内容替换为其他字符串。

text = 'apple price $99 orange price $88'
ret = re.sub('\d+','0',text)
print(ret)
>> apple price $0 orange price $0

源码释义：

def sub(pattern, repl, string, count=0, flags=0):
	    """Return the string obtained by replacing the leftmost
	    non-overlapping occurrences of the pattern in string by the
	    replacement repl.  repl can be either a string or a callable;
	    if a string, backslash escapes in it are processed.  If it is
	    a callable, it's passed the match object and must return
	    a replacement string to be used."""
	    return _compile(pattern, flags).sub(repl, string, count)

从上面的代码中可以看到re.sub()方法中含有5个参数，下面进行一一说明（加粗的为必须参数）：
（1）pattern：该参数表示正则中的模式字符串；
（2）repl：该参数表示要替换的字符串（即匹配到pattern后替换为repl），也可以是个函数；
（3）string：该参数表示要被处理（查找替换）的原始字符串；
（4）count：可选参数，表示是要替换的最大次数，而且必须是非负整数，该参数默认为0，即所有的匹配都会被替换；
（5）flags：可选参数，表示编译时用的匹配模式（如忽略大小写、多行模式等），数字形式，默认为0。
参考：https://blog.csdn.net/jackandsnow/article/details/103885422
sub函数的案例，获取拉勾网中的数据：

html = """
<div>
<p>基本要求：</p>
<p>1、精通HTML5、CSS3、 JavaScript等Web前端开发技术，对html5页面适配充分了解，熟悉不同浏览器间的差异，熟练写出兼容各种浏览器的代码；</p>
<p>2、熟悉运用常见JS开发框架，如JQuery、vue、angular，能快速高效实现各种交互效果；</p>
<p>3、熟悉编写能够自动适应HTML5界面，能让网页格式自动适应各款各大小的手机；</p>
<p>4、利用HTML5相关技术开发移动平台、PC终端的前端页面，实现HTML5模板化；</p>
<p>5、熟悉手机端和PC端web实现的差异，有移动平台web前端开发经验，了解移动互联网产品和行业，有在Android,iOS等平台下HTML5+CSS+JavaScript（或移动JS框架）开发经验者优先考虑；6、良好的沟通能力和团队协作精神，对移动互联网行业有浓厚兴趣，有较强的研究能力和学习能力；</p>
<p>7、能够承担公司前端培训工作，对公司各业务线的前端（HTML5\CSS3）工作进行支撑和指导。</p>
<p><br></p>
<p>岗位职责：</p>
<p>1、利用html5及相关技术开发移动平台、微信、APP等前端页面，各类交互的实现；</p>
<p>2、持续的优化前端体验和页面响应速度，并保证兼容性和执行效率；</p>
<p>3、根据产品需求，分析并给出最优的页面前端结构解决方案；</p>
<p>4、协助后台及客户端开发人员完成功能开发和调试；</p>
<p>5、移动端主流浏览器的适配、移动端界面自适应研发。</p>
</div>
"""
ret = re.sub('</?[a-zA-Z0-9]+>',"",html)
print(ret)

split：

使用正则表达式来分割字符串。

text = "hello world ni hao"
ret = re.split('\W',text)
print(ret)
>> ["hello","world","ni","hao"]

compile：

对于一些经常要用到的正则表达式，可以使用compile进行编译，后期再使用的时候可以直接拿过来用，执行效率会更快。而且compile还可以指定flag=re.VERBOSE，在写正则表达式的时候可以做好注释。示例代码如下：

text = "the number is 20.50"
r = re.compile(r"""
                \d+ # 小数点前面的数字
                \.? # 小数点
                \d* # 小数点后面的数字
                """,re.VERBOSE)
ret = re.search(r,text)
print(ret.group())

CoderXiong

http://bfd2018.github.io/2022/06/03/python-pa-chong/python-pa-chong-2-shu-ju-jie-xi/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 CoderXiong !

Python 爬虫数据解析

Python爬虫3-数据持久化

2022-06-03 Python爬虫

Python 爬虫数据持久化

【MySQL_02】MySQL环境搭建

2022-06-02 MySQL

MySQL基础篇

Python爬虫2-数据解析

一、XPath（推荐）

1、概念

3.XPath语法

3.1 选取节点：

3.2 谓语：

3.3 通配符

3.4 选取多个路径：

3.5 运算符：

3.6 功能函数

4.lxml库

4.1 基本使用：

4.2 从文件中读取html代码：

4.3 在lxml中使用XPath语法：

4.4 使用requests和xpath爬取电影天堂

5.chrome相关问题：

6.作业：

二、BeautifulSoup4库

1.安装和文档：

2.几大解析工具对比：

3.简单使用：

4.四个常用的对象：

4.1 Tag：

4.2 NavigableString：

4.3 BeautifulSoup：

4.4 Comment：

5.遍历文档树：

5.1 contents和children：

5.2 strings 和 stripped_strings

6.搜索文档树：

6.1 find和find_all方法：

6.2 select方法：

（1）通过标签名查找：

（2）通过类名查找：

（3）通过id查找：

（4）组合查找：

（5）通过属性查找：

（6）获取内容

三、正则表达式（re）：

1.正则表达式常用匹配规则：

1.1 匹配某个字符串：

1.2 点（.）匹配任意的字符：

1.3 \d匹配任意的数字：

1.4 \D匹配任意的非数字：

1.5 \s匹配的是空白字符（包括：\n，\t，\r和空格）：

1.6 \S匹配任意的非空白字符

1.7 \w匹配的是a-z和A-Z以及数字和下划线：

1.8 \W匹配的是和\w相反的：

1.9 []组合的方式，只要满足中括号中的某一项都算匹配成功：

1.10 * + ？：

2.案例：

1.11 ^ ：表示以…开始：

1.12 $：表示以…结束：

1.13 |：匹配多个表达式或者字符串：

1.14 贪婪模式和非贪婪模式：

3.案例：匹配0-100之间的数字：

1.15 转义字符和原生字符串：

4.re模块中常用函数：

match：

search：

分组：

findall：

sub：

split：

compile：

你的赏识是我前进的动力

1.7 \w匹配的是`a-z`和`A-Z`以及数字和下划线：

3.案例：匹配`0-100`之间的数字：