diff options
| author | Kylie McClain <kylie@somas.is> | 2021-05-10 02:40:01 -0400 |
|---|---|---|
| committer | Kylie McClain <kylie@somas.is> | 2021-05-16 09:16:14 -0400 |
| commit | 4ae2102cd8ea34cd207c176c3d1ab7840c32d61c (patch) | |
| tree | 8afe1a4b32ef8996a6793dc26c5f13304d185305 | |
| parent | ead12e11bdfc861c0f1decb9ff7e91582196fcfe (diff) | |
regex.asciidoc: rephrasing, style, consistency
* Polish some grammar in places.
* Correct some capitalization nitpicks.
* Use "newline" rather than "line feed", which tends to be more common
in Kakoune's documentation thusfar.
I rephrased some sections, as some of them read a little odd.
* Zero width assertions
* Consistently use "subject's beginning" instead of "subject begin",
it reads better.
* Improve the flow of the word boundary descriptions.
* Modifiers
* Improve phrasing to emphasize the linear nature of their usage and
remove a double negative.
* Use `.` instead of "dot", since that aids in searching through the
page for things talking about the dot character.
* Compatibility
* Use asciidoc syntax for the link to the ECMA-262 standard.
* Use better punctuation on the point about escapes.
| -rw-r--r-- | doc/pages/regex.asciidoc | 124 |
1 files changed, 64 insertions, 60 deletions
diff --git a/doc/pages/regex.asciidoc b/doc/pages/regex.asciidoc index b7fd5391..9c1f4859 100644 --- a/doc/pages/regex.asciidoc +++ b/doc/pages/regex.asciidoc @@ -1,29 +1,29 @@ = Regex -== Regex Syntax +== Regex syntax -Kakoune regex syntax is based on the ECMAScript syntax, as defined by the -ECMA-262 standard (see <<Compatibility>>). +Kakoune regex syntax is based on ECMAScript syntax, as defined by the +ECMA-262 standard (see <<regex#compatibility,:doc regex compatibility>>). -Kakoune's regex always run on Unicode codepoint sequences, not on bytes. +Kakoune's regex always runs on Unicode codepoint sequences, not on bytes. == Literals Every character except the syntax characters `\^$.*+?[]{}|().` match -themselves. Syntax characters can be escaped with a backslash so `\$` -will match a literal `$` and `\\` will match a literal `\`. +themselves. Syntax characters can be escaped with a backslash so that +`\$` will match a literal `$`, and `\\` will match a literal `\`. Some literals are available as escape sequences: * `\f` matches the form feed character. -* `\n` matches the line feed character. +* `\n` matches the newline character. * `\r` matches the carriage return character. * `\t` matches the tabulation character. * `\v` matches the vertical tabulation character. * `\0` matches the null character. -* `\cX` matches the control-X character (X can be in `[A-Za-z]`). -* `\xXX` matches the character whose codepoint is XX (in hexadecimal). -* `\uXXXXXX` matches the character whose codepoint is XXXXXX (in hexadecimal). +* `\cX` matches the control-`X` character (`X` can be in `[A-Za-z]`). +* `\xXX` matches the character whose codepoint is `XX` (in hexadecimal). +* `\uXXXXXX` matches the character whose codepoint is `XXXXXX` (in hexadecimal). == Character classes @@ -40,15 +40,15 @@ in the character class. Literals match themselves, including syntax characters, so `^` does not need to be escaped in a character class. `[\*+]` matches both the `\*` character and the `+` character. Literal escape sequences are -supported, so `[\n\r]` matches both the line feed and carriage return +supported, so `[\n\r]` matches both the newline and carriage return characters. The `]` character needs to be escaped for it to match a literal `]` instead of closing the character class. Character ranges are written as `<start character>-<end character>`, so -`[A-Z]` matches all upper case basic letters. `[A-Z0-9]` will match all -upper cases basic letters and all basic digits. +`[A-Z]` matches all uppercase basic letters. `[A-Z0-9]` will match all +uppercase basic letters and all basic digits. The `-` characters in a character class that are not specifying a range are treated as literal `-`, so `[A-Z-+]` matches all upper case @@ -62,15 +62,16 @@ Supported character class escapes are: * `\h` which matches all horizontal whitespace characters. Using an upper case letter instead of a lower case one will negate -the character class, meaning for example that `\D` will match every -non-digit character. +the character class. For example, `\D` will match every non-digit +character. Character class escapes can be used outside of a character class, `\d` is equivalent to `[\d]`. == Any character -`.` matches any character, including new lines. +`.` matches any character, including newlines, by default. +(see <<regex#modifiers,:doc regex modifiers>> on how to change it) == Groups @@ -99,16 +100,16 @@ matches `foo` followed by either `bar`, `baz` or `qux`. == Quantifier -Literals, Character classes, Any characters and groups can be followed +Literals, character classes, any characters, and groups can be followed by a quantifier, which specifies the number of times they can match. -* `?` matches zero or one times. +* `?` matches zero, or one time. * `*` matches zero or more times. * `+` matches one or more times. -* `{n}` matches exactly n times. -* `{n,}` matches n or more times. -* `{n,m}` matches n to m times. -* `{,m}` matches zero to m times. +* `{n}` matches exactly `n` times. +* `{n,}` matches `n` or more times. +* `{n,m}` matches `n` to `m` times. +* `{,m}` matches zero to `m` times. By default, quantifiers are *greedy*, which means they will prefer to match more characters if possible. Suffixing a quantifier with `?` will @@ -117,37 +118,40 @@ as possible. == Zero width assertions -Assertions do not consume any character, but will prevent the regex -from matching if they are not fulfilled. +Assertions do not consume any character, but they will prevent the regex +from matching if not fulfilled. -* `^` matches at the start of a line, that is just after a new line - character, or at the subject begin (except if specified that the - subject begin is not a start of line). -* `$` matches at the end of a line, that is just before a new line, or - at the subject end (except if specified that the subject's end +* `^` matches at the start of a line; that is, just after a newline + character, or at the subject's beginning (unless it is specified + that the subject's beginning is not a start of line). +* `$` matches at the end of a line; that is, just before a newline, or + at the subject end (unless it is specified that the subject's end is not an end of line). -* `\b` matches at a word boundary, when one of the previous character - and current character is a word character, and the other is not. -* `\B` matches at a non word boundary, when both the previous character - and the current character are word, or are not. -* `\A` matches at the subject string begin. -* `\z` matches at the subject string end. -* `\K` matches anything, and resets the start position of the capture - group 0 to the current position. +* `\b` matches at a word boundary; which is to say that between the + previous character and the current character, one is a word + character, and the other is not. +* `\B` matches at a non-word boundary; meaning, when both the previous + character and the current character are word characters, or both + are not. +* `\A` matches at the subject string's beginning. +* `\z` matches at the subject string's end. +* `\K` matches anything, and resets the start position of capture group + 0 to the current position. More complex assertions can be expressed with lookarounds: -* `(?=...)` is a lookahead, it will match if its content matches the text - following the current position -* `(?!...)` is a negative lookahead, it will match if its content does - not match the text following the current position -* `(?<=...)` is a lookbehind, it will match if its content matches - the text preceding the current position -* `(?<!...)` is a negative lookbehind, it will match if its content does - not match the text preceding the current position +* `(?=...)` is a lookahead; it will match if its content matches the + text following the current position. +* `(?!...)` is a negative lookahead; it will match if its content does + not match the text following the current position. +* `(?<=...)` is a lookbehind; it will match if its content matches + the text preceding the current position. +* `(?<!...)` is a negative lookbehind; it will match if its content does + not match the text preceding the current position. -For performance reasons lookaround contents must be sequence of literals, -character classes or any-character (`.`); Quantifiers are not supported. +For performance reasons, lookaround contents must be a sequence of +literals, character classes, or any character (`.`); quantifiers are not +supported. For example, `(?<!bar)(?=foo).` will match any character which is not preceded by `bar` and where `foo` matches from the current position @@ -158,10 +162,10 @@ preceded by `bar` and where `foo` matches from the current position Some modifiers can control the matching behavior of the atoms following them: -* `(?i)` enables case-insensitive matching -* `(?I)` disables case-insensitive matching (default) -* `(?s)` enables dot-matches-newline (default) -* `(?S)` disables dot-matches-newline +* `(?i)` starts case-insensitive matching. +* `(?I)` starts case-sensitive matching (default). +* `(?s)` allows `.` to match newlines (default). +* `(?S)` prevents `.` from matching newlines. == Quoting @@ -169,20 +173,20 @@ them: a literal. That quoted sequence will continue until either the end of the regex, or the appearance of `\E`. -For example `.\Q.^$\E$` will match any character followed by the literal -string `.^$` followed by an end of line. +For example, `.\Q.^$\E$` will match any character followed by the +literal string `.^$`, followed by an end of line. == Compatibility -The syntax tries to follow the ECMAScript regex syntax as defined by -https://www.ecma-international.org/ecma-262/8.0/ some divergences -exists for ease of use or performance reasons: +Kakoune's syntax tries to follow the ECMAScript regex syntax, as defined +by <https://www.ecma-international.org/ecma-262/8.0/>; some divergence +exists for ease of use, or performance reasons: -* lookarounds are not arbitrary, but lookbehind is supported. +* Lookarounds are not arbitrary, but lookbehind is supported. * `\K`, `\Q..\E`, `\A`, `\h` and `\z` are added. -* Stricter handling of escaping, as we introduce additional - escapes, identity escapes like `\X` with X a non-special character +* Stricter handling of escaping, as we introduce additional escapes; + identity escapes like `\X` with `X` being a non-special character are not accepted, to avoid confusions between `\h` meaning literal `h` in ECMAScript, and horizontal blank in Kakoune. -* `\uXXXXXX` uses 6 digits to cover all of unicode, instead of relying +* `\uXXXXXX` uses 6 digits to cover all of Unicode, instead of relying on ECMAScript UTF-16 surrogate pairs with 4 digits. |
