syntax

package syntax

import "regexp/syntax"

Package syntax 将正则表达式解析为解析树，并将解析树编译为程序。大多数正则表达式的用户将使用 regexp 包的功能（如 regexp.Compile 和 regexp.Match）而非此包。

语法

使用 Perl 标志解析时，本包理解的正则表达式语法如下。可以通过向 Parse 传递其他标志来禁用部分语法。

单个字符：

.              任意字符，可能包括换行符（标志 s=true）
[xyz]          字符类
[^xyz]         取反字符类
\d             Perl 字符类
\D             取反 Perl 字符类
[[:alpha:]]    ASCII 字符类
[[:^alpha:]]   取反 ASCII 字符类
\pN            Unicode 字符类（单字母名称）
\p{Greek}      Unicode 字符类
\PN            取反 Unicode 字符类（单字母名称）
\P{Greek}      取反 Unicode 字符类

组合：

xy             x 后接 y
x|y            x 或 y（优先 x）

重复：

x*             零个或多个 x，优先更多
x+             一个或多个 x，优先更多
x?             零个或一个 x，优先一个
x{n,m}         n 或 n+1 或 ... 或 m 个 x，优先更多
x{n,}          n 个或更多 x，优先更多
x{n}           恰好 n 个 x
x*?            零个或多个 x，优先更少
x+?            一个或多个 x，优先更少
x??            零个或一个 x，优先零个
x{n,m}?        n 或 n+1 或 ... 或 m 个 x，优先更少
x{n,}?         n 个或更多 x，优先更少
x{n}?          恰好 n 个 x

实现限制：计数形式 x{n,m}、x{n,} 和 x{n} 会拒绝最小或最大重复次数超过 1000 的形式。无限制重复不受此限制。

分组：

(re)           编号捕获组（子匹配）
(?P<name>re)   命名且编号的捕获组（子匹配）
(?<name>re)    命名且编号的捕获组（子匹配）
(?:re)         非捕获组
(?flags)       在当前组中设置标志；非捕获
(?flags:re)    在 re 期间设置标志；非捕获

标志语法为 xyz（设置）或 -xyz（清除）或 xy-z（设置 xy，清除 z）。标志有：

i              不区分大小写（默认 false）
m              多行模式：^ 和 $ 除了匹配文本开头/结尾外还匹配行开头/结尾（默认 false）
s              让 . 匹配 \n（默认 false）
U              非贪婪：交换 x* 和 x*?、x+ 和 x+? 等的含义（默认 false）

空字符串：

^              在文本或行的开头（标志 m=true）
$              在文本结尾（类似 \z 而非 \Z）或行结尾（标志 m=true）
\A             在文本开头
\b             在 ASCII 单词边界（一侧是 \w，另一侧是 \W、\A 或 \z）
\B             不在 ASCII 单词边界
\z             在文本结尾

转义序列：

\a             响铃符（== \007）
\f             换页符（== \014）
\t             水平制表符（== \011）
\n             换行符（== \012）
\r             回车符（== \015）
\v             垂直制表符（== \013）
\*             字面 *，适用于任何标点字符 *
\123           八进制字符码（最多三位）
\x7F           十六进制字符码（恰好两位）
\x{10FFFF}     十六进制字符码
\Q...\E        字面文本 ... 即使 ... 含有标点

字符类元素：

x              单个字符
A-Z            字符范围（含两端）
\d             Perl 字符类
[:foo:]        ASCII 字符类 foo
\p{Foo}        Unicode 字符类 Foo
\pF            Unicode 字符类 F（单字母名称）

作为字符类元素的命名字符类：

[\d]           数字（== \d）
[^\d]          非数字（== \D）
[\D]           非数字（== \D）
[^\D]          非非数字（== \d）
[[:name:]]     字符类内的命名 ASCII 类（== [:name:]）
[^[:name:]]    取反字符类内的命名 ASCII 类（== [:^name:]）
[\p{Name}]     字符类内的命名 Unicode 属性（== \p{Name}）
[^\p{Name}]    取反字符类内的命名 Unicode 属性（== \P{Name}）

Perl 字符类（全部仅限 ASCII）：

\d             数字（== [0-9]）
\D             非数字（== [^0-9]）
\s             空白字符（== [\t\n\f\r ]）
\S             非空白字符（== [^\t\n\f\r ]）
\w             单词字符（== [0-9A-Za-z_]）
\W             非单词字符（== [^0-9A-Za-z_]）

ASCII 字符类：

[[:alnum:]]    字母数字（== [0-9A-Za-z]）
[[:alpha:]]    字母（== [A-Za-z]）
[[:ascii:]]    ASCII（== [\x00-\x7F]）
[[:blank:]]    空格（== [\t ]）
[[:cntrl:]]    控制字符（== [\x00-\x1F\x7F]）
[[:digit:]]    数字（== [0-9]）
[[:graph:]]    图形字符（== [!-~] == [A-Za-z0-9!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~]）
[[:lower:]]    小写字母（== [a-z]）
[[:print:]]    可打印字符（== [ -~] == [ [:graph:]]）
[[:punct:]]    标点符号（== [!-/:-@[-`{-~]）
[[:space:]]    空白字符（== [\t\n\v\f\r ]）
[[:upper:]]    大写字母（== [A-Z]）
[[:word:]]     单词字符（== [0-9A-Za-z_]）
[[:xdigit:]]   十六进制数字（== [0-9A-Fa-f]）

Unicode 字符类是 unicode.Categories、 unicode.CategoryAliases 和 unicode.Scripts 中的那些。

Index

func IsWordChar(r rune) bool
type EmptyOp
- func EmptyOpContext(r1, r2 rune) EmptyOp
type Error
- func (e *Error) Error() string
type ErrorCode
- func (e ErrorCode) String() string
type Flags
type Inst
type InstOp
- func (i InstOp) String() string
type Op
- func (i Op) String() string
type Prog
type Regexp

Functions

func IsWordChar

func IsWordChar(r rune) bool

IsWordChar 报告 r 是否被视为"单词字符"，用于 \b 和 \B 零宽度断言的求值过程中。这些断言仅限 ASCII：单词字符为 [A-Za-z0-9_]。

Types

type EmptyOp

type EmptyOp uint8

EmptyOp 指定一种或多种零宽度断言。

const (
	EmptyBeginLine EmptyOp = 1 << iota
	EmptyEndLine
	EmptyBeginText
	EmptyEndText
	EmptyWordBoundary
	EmptyNoWordBoundary
)

func EmptyOpContext

func EmptyOpContext(r1, r2 rune) EmptyOp

EmptyOpContext 返回在 rune r1 和 r2 之间的位置满足的零宽度断言。传入 r1 == -1 表示该位置在文本开头。传入 r2 == -1 表示该位置在文本结尾。

type Error

type Error struct {
	Code ErrorCode
	Expr string
}

Error 描述解析正则表达式的失败，并给出有问题的表达式。

func (*Error) Error

func (e *Error) Error() string

type ErrorCode

type ErrorCode string

ErrorCode 描述解析正则表达式的失败。

const (
	// 意外错误
	ErrInternalError ErrorCode = "regexp/syntax: internal error"

	// 解析错误
	ErrInvalidCharClass      ErrorCode = "invalid character class"
	ErrInvalidCharRange      ErrorCode = "invalid character class range"
	ErrInvalidEscape         ErrorCode = "invalid escape sequence"
	ErrInvalidNamedCapture   ErrorCode = "invalid named capture"
	ErrInvalidPerlOp         ErrorCode = "invalid or unsupported Perl syntax"
	ErrInvalidRepeatOp       ErrorCode = "invalid nested repetition operator"
	ErrInvalidRepeatSize     ErrorCode = "invalid repeat count"
	ErrInvalidUTF8           ErrorCode = "invalid UTF-8"
	ErrMissingBracket        ErrorCode = "missing closing ]"
	ErrMissingParen          ErrorCode = "missing closing )"
	ErrMissingRepeatArgument ErrorCode = "missing argument to repetition operator"
	ErrTrailingBackslash     ErrorCode = "trailing backslash at end of expression"
	ErrUnexpectedParen       ErrorCode = "unexpected )"
	ErrNestingDepth          ErrorCode = "expression nests too deeply"
	ErrLarge                 ErrorCode = "expression too large"
)

func (ErrorCode) String

func (e ErrorCode) String() string

type Flags

type Flags uint16

Flags 控制解析器的行为并记录正则表达式上下文的信息。

const (
	FoldCase      Flags = 1 << iota // case-insensitive match
	Literal                         // treat pattern as literal string
	ClassNL                         // allow character classes like [^a-z] and [[:space:]] to match newline
	DotNL                           // allow . to match newline
	OneLine                         // treat ^ and $ as only matching at beginning and end of text
	NonGreedy                       // make repetition operators default to non-greedy
	PerlX                           // allow Perl extensions
	UnicodeGroups                   // allow \p{Han}, \P{Han} for Unicode group and negation
	WasDollar                       // regexp OpEndText was $, not \z
	Simple                          // regexp contains no counted repetition

	MatchNL = ClassNL | DotNL

	Perl        = ClassNL | OneLine | PerlX | UnicodeGroups // as close to Perl as possible
	POSIX Flags = 0                                         // POSIX syntax
)

type Inst

type Inst struct {
	Op   InstOp
	Out  uint32 // 除 InstMatch、InstFail 外的所有指令
	Arg  uint32 // InstAlt、InstAltMatch、InstCapture、InstEmptyWidth
	Rune []rune
}

Inst 是正则表达式程序中的单条指令。

func (*Inst) MatchEmptyWidth

func (i *Inst) MatchEmptyWidth(before rune, after rune) bool

MatchEmptyWidth 报告该指令是否匹配 rune before 和 after 之间的空字符串。只应在 i.Op == InstEmptyWidth 时调用。

func (*Inst) MatchRune

func (i *Inst) MatchRune(r rune) bool

MatchRune 报告该指令是否匹配（并消耗）r。只应在 i.Op == InstRune 时调用。

func (*Inst) MatchRunePos

func (i *Inst) MatchRunePos(r rune) int

MatchRunePos 检查该指令是否匹配（并消耗）r。如果匹配，MatchRunePos 返回匹配的 rune 对的索引（或者当 len(i.Rune) == 1 时，返回单个 rune 的索引）。如果不匹配，MatchRunePos 返回 -1。 MatchRunePos 只应在 i.Op == InstRune 时调用。

func (*Inst) String

func (i *Inst) String() string

type InstOp

type InstOp uint8

InstOp 是指令操作码。

const (
	InstAlt InstOp = iota
	InstAltMatch
	InstCapture
	InstEmptyWidth
	InstMatch
	InstFail
	InstNop
	InstRune
	InstRune1
	InstRuneAny
	InstRuneAnyNotNL
)

func (InstOp) String

func (i InstOp) String() string

type Op

type Op uint8

Op 是单个正则表达式操作符。

const (
	OpNoMatch        Op = 1 + iota // matches no strings
	OpEmptyMatch                   // matches empty string
	OpLiteral                      // matches Runes sequence
	OpCharClass                    // matches Runes interpreted as range pair list
	OpAnyCharNotNL                 // matches any character except newline
	OpAnyChar                      // matches any character
	OpBeginLine                    // matches empty string at beginning of line
	OpEndLine                      // matches empty string at end of line
	OpBeginText                    // matches empty string at beginning of text
	OpEndText                      // matches empty string at end of text
	OpWordBoundary                 // matches word boundary `\b`
	OpNoWordBoundary               // matches word non-boundary `\B`
	OpCapture                      // capturing subexpression with index Cap, optional name Name
	OpStar                         // matches Sub[0] zero or more times
	OpPlus                         // matches Sub[0] one or more times
	OpQuest                        // matches Sub[0] zero or one times
	OpRepeat                       // matches Sub[0] at least Min times, at most Max (Max == -1 is no limit)
	OpConcat                       // matches concatenation of Subs
	OpAlternate                    // matches alternation of Subs
)

func (Op) String

func (i Op) String() string

type Prog

type Prog struct {
	Inst   []Inst
	Start  int // 起始指令的索引
	NumCap int // re 中 InstCapture 指令的数量
}

Prog 是一个编译后的正则表达式程序。

func Compile

func Compile(re *Regexp) (*Prog, error)

Compile 将正则表达式编译为待执行的程序。正则表达式应该已经被简化（从 re.Simplify 返回）。

func (*Prog) Prefix

func (p *Prog) Prefix() (prefix string, complete bool)

Prefix 返回一个字面字符串，正则表达式的所有匹配都必须以此开头。如果前缀就是整个匹配，则 Complete 为 true。

func (*Prog) StartCond

func (p *Prog) StartCond() EmptyOp

StartCond 返回在任何匹配中必须为真的前导零宽度条件。如果不可能有匹配，则返回 ^EmptyOp(0)。

func (*Prog) String

func (p *Prog) String() string

type Regexp

type Regexp struct {
	Op       Op // 操作符
	Flags    Flags
	Sub      []*Regexp  // 子表达式（如果有）
	Sub0     [1]*Regexp // 短 Sub 的存储
	Rune     []rune     // 匹配的 rune，用于 OpLiteral、OpCharClass
	Rune0    [2]rune    // 短 Rune 的存储
	Min, Max int        // OpRepeat 的最小值和最大值
	Cap      int        // 捕获索引，用于 OpCapture
	Name     string     // 捕获名称，用于 OpCapture
}

Regexp 是正则表达式语法树中的一个节点。

func Parse

func Parse(s string, flags Flags) (*Regexp, error)

Parse parses a regular expression string s, controlled by the specified Flags, and returns a regular expression parse tree. The syntax is described in the top-level comment.

func (*Regexp) CapNames

func (re *Regexp) CapNames() []string

CapNames 遍历正则表达式以找到捕获组的名称。

func (*Regexp) Equal

func (x *Regexp) Equal(y *Regexp) bool

Equal 报告 x 和 y 是否具有相同的结构。

func (*Regexp) MaxCap

func (re *Regexp) MaxCap() int

MaxCap 遍历正则表达式以找到最大捕获索引。

func (*Regexp) Simplify

func (re *Regexp) Simplify() *Regexp

Simplify 返回一个与 re 等价的正则表达式，但没有计数重复以及各种其他简化，例如将 /(?:a+)+/ 重写为 /a+/。结果正则表达式将正确执行，但其字符串表示不会产生相同的解析树，因为捕获括号可能已被复制或移除。例如，/(x){1,2}/ 的简化形式为 /(x)(x)?/，但两个括号都作为 $1 捕获。返回的正则表达式可能与原始表达式共享结构，也可能就是原始表达式。

func (*Regexp) String

func (re *Regexp) String() string