Python 正则表达式

原文： http://zetcode.com/python/regularexpressions/

Python 正则表达式教程展示了如何在 Python 中使用正则表达式。对于 Python 中的正则表达式，我们使用re模块。

正则表达式用于文本搜索和更高级的文本操作。正则表达式是内置工具，如grep，sed，文本编辑器（如 vi，emacs），编程语言（如 Tcl，Perl 和 Python）。

Python `re`模块

在 Python 中，re模块提供了正则表达式匹配操作。

模式是一个正则表达式，用于定义我们正在搜索或操纵的文本。它由文本文字和元字符组成。用compile()函数编译该模式。由于正则表达式通常包含特殊字符，因此建议使用原始字符串。（原始字符串以r字符开头。）这样，在将字符编译为模式之前，不会对这些字符进行解释。

编译模式后，可以使用其中一个函数将模式应用于文本字符串。函数包括match()，search()，find()和finditer()。

下表显示了一些正则表达式：

正则表达式	含义
`.`	匹配任何单个字符。
`?`	一次匹配或根本不匹配前面的元素。
`+`	与前面的元素匹配一次或多次。
`*`	与前面的元素匹配零次或多次。
`^`	匹配字符串中的起始位置。
`$`	匹配字符串中的结束位置。
`\|`	备用运算符。
`[abc]`	匹配`a`或`b`或`c`。
`[a-c]`	范围; 匹配`a`或`b`或`c`。
`[^abc]`	否定，匹配除`a`或`b`或`c`之外的所有内容。
`\s`	匹配空白字符。
`\w`	匹配单词字符；等同于`[a-zA-Z_0-9]`

匹配函数

以下是一个代码示例，演示了 Python 中简单正则表达式的用法。

match_fun.py

#!/usr/bin/python3

import re

words = ('book', 'bookworm', 'Bible', 
    'bookish','cookbook', 'bookstore', 'pocketbook')

pattern = re.compile(r'book')

for word in words:
    if re.match(pattern, word):
        print('The {} matches '.format(word))

在示例中，我们有一个单词元组。编译后的模式将在每个单词中寻找一个"book"字符串。

pattern = re.compile(r'book')

使用compile()函数，我们可以创建模式串。正则表达式是一个原始字符串，由四个普通字符组成。

for word in words:
    if re.match(pattern, word):
        print('The {} matches '.format(word))

我们遍历元组并调用match()函数。它将模式应用于单词。如果字符串开头有匹配项，则match()函数将返回匹配对象。

$ ./match_fun.py 
The book matches 
The bookworm matches 
The bookish matches 
The bookstore matches

元组中的四个单词与模式匹配。请注意，以"book"一词开头的单词不匹配。为了包括这些词，我们使用search()函数。

搜索函数

search()函数查找正则表达式模式产生匹配项的第一个位置。

search_fun.py

#!/usr/bin/python3

import re

words = ('book', 'bookworm', 'Bible', 
    'bookish','cookbook', 'bookstore', 'pocketbook')

pattern = re.compile(r'book')

for word in words:
    if re.search(pattern, word):
        print('The {} matches '.format(word))

在示例中，我们使用search()函数查找"book"一词。

$ ./search_fun.py 
The book matches 
The bookworm matches 
The bookish matches 
The cookbook matches 
The bookstore matches 
The pocketbook matches

这次还包括菜谱和袖珍书中的单词。

点元字符

点（。）元字符代表文本中的任何单个字符。

dot_meta.py

#!/usr/bin/python3

import re

words = ('seven', 'even', 'prevent', 'revenge', 'maven', 
    'eleven', 'amen', 'event')

pattern = re.compile(r'.even')

for word in words:
    if re.match(pattern, word):
        print('The {} matches '.format(word))

在示例中，我们有一个带有八个单词的元组。我们在每个单词上应用一个包含点元字符的模式。

pattern = re.compile(r'.even')

点代表文本中的任何单个字符。字符必须存在。

$ ./dot_meta.py 
The seven matches 
The revenge matches

两个字匹配模式：七个和复仇。

问号元字符

问号（？）元字符是与上一个元素零或一次匹配的量词。

question_mark_meta.py

#!/usr/bin/python3

import re

words = ('seven', 'even','prevent', 'revenge', 'maven', 
    'eleven', 'amen', 'event')

pattern = re.compile(r'.?even')

for word in words:
    if re.match(pattern, word):
        print('The {} matches '.format(word))

在示例中，我们在点字符后添加问号。这意味着在模式中我们可以有一个任意字符，也可以在那里没有任何字符。

$ ./question_mark_meta.py 
The seven matches 
The even matches 
The revenge matches 
The event matches

这次，除了七个和复仇外，偶数和事件词也匹配。

锚点

锚点匹配给定文本内字符的位置。当使用^锚时，匹配必须发生在字符串的开头，而当使用$锚时，匹配必须发生在字符串的结尾。

anchors.py

#!/usr/bin/python3

import re

sentences = ('I am looking for Jane.',
    'Jane was walking along the river.',
    'Kate and Jane are close friends.')

pattern = re.compile(r'^Jane')

for sentence in sentences:
    if re.search(pattern, sentence):
        print(sentence)

在示例中，我们有三个句子。搜索模式为^Jane。该模式检查"Jane"字符串是否位于文本的开头。 Jane\.将在句子结尾处查找"Jane"。

`fullmatch`

可以使用fullmatch()函数或通过将术语放在锚点之间来进行精确匹配：^和$。

exact_match.py

#!/usr/bin/python3

import re

words = ('book', 'bookworm', 'Bible', 
    'bookish','cookbook', 'bookstore', 'pocketbook')

pattern = re.compile(r'^book$')

for word in words:
    if re.search(pattern, word):
        print('The {} matches'.format(word))

在示例中，我们寻找与"book"一词完全匹配的内容。

$ ./exact_match.py 
The book matches

这是输出。

字符类

字符类定义了一组字符，任何字符都可以出现在输入字符串中以使匹配成功。

character_class.py

#!/usr/bin/python3

import re

words = ('a gray bird', 'grey hair', 'great look')

pattern = re.compile(r'gr[ea]y')

for word in words:
    if re.search(pattern, word):
        print('{} matches'.format(word))

在该示例中，我们使用字符类同时包含灰色和灰色单词。

pattern = re.compile(r'gr[ea]y')

[ea]类允许在模式中使用'e'或'a'字符。

命名字符类

有一些预定义的字符类。 \s与空白字符[\t\n\t\f\v]匹配，\d与数字[0-9]匹配，\w与单词字符[a-zA-Z0-9_]匹配。

named_character_class.py

#!/usr/bin/python3

import re

text = 'We met in 2013\. She must be now about 27 years old.'

pattern = re.compile(r'\d+')

found = re.findall(pattern, text)

if found:
    print('There are {} numbers'.format(len(found)))

在示例中，我们计算文本中的数字。

pattern = re.compile(r'\d+')

\d+模式在文本中查找任意数量的数字集。

found = re.findall(pattern, text)

使用findall()方法，我们可以查找文本中的所有数字。

$ ./named_character_classes.py 
There are 2 numbers

这是输出。

不区分大小写的匹配

默认情况下，模式匹配区分大小写。通过将re.IGNORECASE传递给compile()函数，我们可以使其不区分大小写。

case_insensitive.py

#!/usr/bin/python3

import re

words = ('dog', 'Dog', 'DOG', 'Doggy')

pattern = re.compile(r'dog', re.IGNORECASE)

for word in words:
    if re.match(pattern, word):
        print('{} matches'.format(word))

在示例中，无论大小写如何，我们都将模式应用于单词。

$ ./case_insensitive.py 
dog matches
Dog matches
DOG matches
Doggy matches

所有四个单词都与模式匹配。

交替

交替运算符|创建具有多种选择的正则表达式。

alternations.py

#!/usr/bin/python3

import re

words = ("Jane", "Thomas", "Robert",
    "Lucy", "Beky", "John", "Peter", "Andy")

pattern = re.compile(r'Jane|Beky|Robert')

for word in words:
    if re.match(pattern, word):
        print(word)

列表中有八个名称。

pattern = re.compile(r'Jane|Beky|Robert')

此正则表达式查找Jane，"Beky"或"Robert"字符串。

查找方法

finditer()方法返回一个迭代器，该迭代器在字符串中的模式的所有不重叠匹配上产生匹配对象。

find_iter.py

#!/usr/bin/python3

import re

text = ('I saw a fox in the wood. The fox had red fur.')

pattern = re.compile(r'fox')

found = re.finditer(pattern, text)

for item in found:

    s = item.start()
    e = item.end()
    print('Found {} at {}:{}'.format(text[s:e], s, e))

在示例中，我们在文本中搜索"fox"一词。我们遍历找到的匹配项的迭代器，并使用它们的索引进行打印。

s = item.start()
e = item.end()

start()和end()方法分别返回起始索引和结束索引。

$ ./find_iter.py 
Found fox at 8:11
Found fox at 29:32

这是输出。

捕获组

捕获组是一种将多个字符视为一个单元的方法。通过将字符放置在一组圆括号内来创建它们。例如，(book)是包含'b', 'o', 'o', 'k'字符的单个组。

捕获组技术使我们能够找出字符串中与常规模式匹配的那些部分。

capturing_groups.py

#!/usr/bin/python3

import re

content = '''<p>The <code>Pattern</code> is a compiled
representation of a regular expression.</p>'''

pattern = re.compile(r'(</?[a-z]*>)')

found = re.findall(pattern, content)

for tag in found:
    print(tag)

该代码示例通过捕获一组字符来打印提供的字符串中的所有 HTML 标签。

found = re.findall(pattern, content)

为了找到所有标签，我们使用findall()方法。

$ ./capturing_groups.py 
<p>
<code>
</code>
</p>

我们找到了四个 HTML 标签。