doc/pages/regex.asciidoc


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193

= Regex

== Regex syntax

Kakoune's regex syntax is inspired by ECMAScript, as defined by the
ECMA-262 standard (see <<regex#compatibility,:doc regex compatibility>>).

Kakoune's regex always runs on Unicode codepoint sequences, not on bytes.

== Literals

Every character except the syntax characters `\^$.*+?[]{}|().` match
themselves. Syntax characters can be escaped with a backslash so that
`\$` will match a literal `$`, and `\\` will match a literal `\`.

Some literals are available as escape sequences:

* `\f` matches the form feed character.
* `\n` matches the newline character.
* `\r` matches the carriage return character.
* `\t` matches the tabulation character.
* `\v` matches the vertical tabulation character.
* `\0` matches the null character.
* `\cX` matches the control-`X` character (`X` can be in `[A-Za-z]`).
* `\xXX` matches the character whose codepoint is `XX` (in hexadecimal).
* `\uXXXXXX` matches the character whose codepoint is `XXXXXX` (in hexadecimal).

== Character classes

The `[` character introduces a character class, matching one character
from a set of characters.

A character class contains a list of literals, character ranges,
and character class escapes surrounded by `[` and `]`.

If the first character inside a character class is `^`, then the character
class is negated, meaning that it matches every character not specified
in the character class.

Literals match themselves, including syntax characters, so `^`
does not need to be escaped in a character class. `[\*+]` matches both
the `\*` character and the `+` character. Literal escape sequences are
supported, so `[\n\r]` matches both the newline and carriage return
characters.

The `]` character needs to be escaped for it to match a literal `]`
instead of closing the character class.

Character ranges are written as `<start character>-<end character>`, so
`[A-Z]` matches all uppercase basic letters. `[A-Z0-9]` will match all
uppercase basic letters and all basic digits.

The `-` characters in a character class that are not specifying a
range are treated as literal `-`, so `[A-Z-+]` matches all upper case
characters, the `-` character, and the `+` character.

Supported character class escapes are:

* `\d` which matches digits 0-9.
* `\w` which matches word characters A-Z, a-z, 0-9 and underscore
    (ignoring the `extra_word_chars` option).
* `\s` which matches all Unicode whitespace characters.
* `\h` which matches whitespace except Vertical Tab and line-breaks.

Using an upper case letter instead of a lower case one will negate
the character class. For example, `\D` will match every non-digit
character.

Character class escapes can be used outside of a character class, `\d`
is equivalent to `[\d]`.

== Any character

`.` matches any character, including newlines, by default.
(see <<regex#modifiers,:doc regex modifiers>> on how to change it)

== Groups

Regex atoms can be grouped using `(` and `)` or `(?:` and `)`. If `(` is
used, the group will be a capturing group, which means the positions from
the subject strings that matched between `(` and `)` will be recorded.

Capture groups are numbered starting at 1. They are numbered in the
order of appearance of their `(` in the regex. A special capture group
0 is for the whole sequence that matched.

* `(?:` introduces a non capturing group, which will not record the
matching positions.

* `(?<name>` introduces a named capturing group, which, in addition to
being referred by number, can be, in certain contexts, referred by the
given name.

== Alternations

The `|` character introduces an alternation, which will either match
its left-hand side, or its right-hand side (preferring the left-hand side)

For example, `foo|bar` matches either `foo` or `bar`, `foo(bar|baz|qux)`
matches `foo` followed by either `bar`, `baz` or `qux`.

== Quantifier

Literals, character classes, any characters, and groups can be followed
by a quantifier, which specifies the number of times they can match.

* `?` matches zero, or one time.
* `*` matches zero or more times.
* `+` matches one or more times.
* `{n}` matches exactly `n` times.
* `{n,}` matches `n` or more times.
* `{n,m}` matches `n` to `m` times.
* `{,m}` matches zero to `m` times.

By default, quantifiers are *greedy*, which means they will prefer to
match more characters if possible. Suffixing a quantifier with `?` will
make it non-greedy, meaning it will prefer to match as few characters
as possible.

== Zero width assertions

Assertions do not consume any character, but they will prevent the regex
from matching if not fulfilled.

* `^` matches at the start of a line; that is, just after a newline
      character, or at the subject's beginning (unless it is specified
      that the subject's beginning is not a start of line).
* `$` matches at the end of a line; that is, just before a newline, or
      at the subject end (unless it is specified that the subject's end
      is not an end of line).
* `\b` matches at a word boundary; which is to say that between the
       previous character and the current character, one is a word
       character, and the other is not.
* `\B` matches at a non-word boundary; meaning, when both the previous
       character and the current character are word characters, or both
       are not.
* `\A` matches at the subject string's beginning.
* `\z` matches at the subject string's end.
* `\K` matches anything, and resets the start position of capture group
       0 to the current position.

More complex assertions can be expressed with lookarounds:

* `(?=...)` is a lookahead; it will match if its content matches the
            text following the current position.
* `(?!...)` is a negative lookahead; it will match if its content does
            not match the text following the current position.
* `(?<=...)` is a lookbehind; it will match if its content matches
             the text preceding the current position.
* `(?<!...)` is a negative lookbehind; it will match if its content does
             not match the text preceding the current position.

For performance reasons, lookaround contents must be a sequence of
literals, character classes, or any character (`.`); quantifiers are not
supported.

For example, `(?<!bar)(?=foo).` will match any character which is not
preceded by `bar` and where `foo` matches from the current position
(which means the character has to be an `f`).

== Modifiers

Some modifiers can control the matching behavior of the atoms following
them:

* `(?i)` starts case-insensitive matching.
* `(?I)` starts case-sensitive matching (default).
* `(?s)` allows `.` to match newlines (default).
* `(?S)` prevents `.` from matching newlines.

== Quoting

`\Q` will start a quoted sequence, where every character is treated as
a literal. That quoted sequence will continue until either the end of
the regex, or the appearance of `\E`.

For example, `.\Q.^$\E$` will match any character followed by the
literal string `.^$`, followed by an end of line.

== Compatibility

Kakoune's syntax tries to follow the ECMAScript regex syntax, as defined
by <https://www.ecma-international.org/ecma-262/8.0/>; some divergence
exists for ease of use, or performance reasons:

* Lookarounds are not arbitrary, but lookbehind is supported.
* `\K`, `\Q..\E`, `\A`, `\h` and `\z` are added.
* Stricter handling of escaping, as we introduce additional escapes;
  identity escapes like `\X` with `X` being a non-special character
  are not accepted, to avoid confusions between `\h` meaning literal
  `h` in ECMAScript, and horizontal blank in Kakoune.
* `\uXXXXXX` uses 6 digits to cover all of Unicode, instead of relying
  on ECMAScript UTF-16 surrogate pairs with 4 digits.