summaryrefslogtreecommitdiff
path: root/src/regex_impl.cc
AgeCommit message (Collapse)Author
2025-07-08Replace std::unique_ptr with a custom implementationMaxime Coste
<memory> is a costly header we can avoid by just implementing UniquePtr ourselves, which is a pretty straightforward in modern C++, this saves around 10% of the compilation time here.
2025-07-07Add a CharRange regex op to optimize the common simple range caseMaxime Coste
Instead of jumping into the general CharClass code, detect simple [a-z] style ranges and use a specific op. Also detect when a range can be converted to ignore case
2025-04-02Reduce include creepMaxime Coste
2025-02-04Revert "Use uint64_t for regex step"Maxime Coste
This got pushed by accident This reverts commit d92496449d0c9655253ad16363685bb8446dc582.
2025-01-22Use uint64_t for regex stepMaxime Coste
Its unclear that maintaining a small instruction size outweigh the cost of handling wrapping of the current_step every 64K codepoints, this makes the code simpler.
2024-12-01Add specific start desc optimization for single possible start byteMaxime Coste
Use tighter codegen for that pretty common use case.
2024-11-04Fix backward regex search ending in DOTALLJohannes Altmanninger
I noticed that reverse searches ending in "." stopped working in version 2024.05.08: kak -n -e "exec %{%cfoobar<ret><esc>gj<a-/>foo.<ret>}' Bisects ca7471c25 (Compute StartDesc with an offset to effective start, 2024-03-18) which updated the find_next_start() logic for the forward case but not for backward case. Add a symmetrical change and test case, that seems to fix it. Not 100% sure if this is correct but feels so.
2024-08-16include headers cleanupAdriĆ  Arrufat
2024-08-12Reduce headers dependency graphMaxime Coste
Move more code into the implementation files to reduce the amount of code pulled by headers.
2024-08-12Extract format implementation to its own fileMaxime Coste
Split it to avoid pulling all string_utils dependencies for just format.
2024-08-04Move most code in regex_impl inside the anonymous namespaceMaxime Coste
This is all internal code that should not be exposed.
2024-07-15Improve error message for alternations in lookarounds and add testsMaxime Coste
Fixes #5203
2024-06-26Fix order of evaluation issue when compiling regex alternationsMaxime Coste
The previous implementation was relying on the left hand side of the assignement's side effects to be sequenced before the right hand side evaluation (so --split_pos was expected to have taken place) This is actually counter to C++17 which guarantees evaluation of the right hand side of the assignement first. For some reason this is not the case with GCC on linux.
2024-06-15Store instruction pointers directly in ThreadedRegexVM::ThreadMaxime Coste
The previous tradeoff of having a very small Thread struct is not necessary anymore as we do not memcpy Threads on swap_next since d708b77186c1685dcbd2298246ada7d204acec2f. This requires offsets to be used instead of indices for jump/split ops.
2024-03-22Match Op declaration order in switchesMaxime Coste
2024-03-21Compute StartDesc with an offset to effective startMaxime Coste
This means `.{2,4}foo` will now consider 4 or less before f as a start candidate instead of every characters
2024-03-13Simplify and accelerate start desc mapMaxime Coste
Store values for all possible bytes and fill utf8 multi byte start values when necessary.
2024-03-13Fix quantifier parsing bugMaxime Coste
2024-03-12Simplify Quantifier logic in regex parsingMaxime Coste
Remove redundant type enum
2023-11-05Remove ignored packed attribute and static_assert on Node sizeMaxime Coste
This static_assert is not necessary for the code to work and is not valid on every platform.
2023-11-03Add support for 0-padding in format and replace uses of sprintfMaxime Coste
2023-06-27Unbreak build on ppcSergey Fedorov
Fixes: https://github.com/mawww/kakoune/issues/4937
2023-02-19Optimize Regex CharacterClass matchingMaxime Coste
Take advantage of ranges sorting to early out, make the logic inline.
2023-02-14Fix broken corner cases in DualThreadStack::grow_ifnMaxime Coste
We only grow when the ring buffer is full, which allows for a nice simplification of the code. Tell grow_ifn if we pushed in current or next so that we can distinguish between filled by next or filled by current when m_current == m_next_begin
2023-02-13Remove scheduled optimization from ThreadedRegexVMMaxime Coste
This does not seem to actually speed up execution as threads will be dropped on next step anyway
2023-01-23Fix incorrect use of subject end/begin in regex executionMaxime Coste
This could lead to reading past subject string end in certain conditions Fixes #4794
2022-08-05Reuse existing character classes when possible in regexMaxime Coste
2022-02-02Fix regex alternation execution priorityMaxime Coste
The ThreadedRegexVM implementation does not execute split opcodes as expected: on split the pending thread is pushed on top of the thread stack, which means that when multiple splits are executed in a row (such as with a disjunction with 3 or more branches) the last split target gets on top of the thread stack and gets executed next (when the thread from the first split target would be the expected one) Fixing this in the ThreadedRegexVM would have a performance impact as we would not be able to use a plain stack for current threads, so the best solution at the moment is to reverse the order of splits generated by a disjunction. Fixes #4519
2021-12-11Fix parsing nul bytes in regexMaxime Coste
Fixes #4460
2021-11-25small regex impl code style tweakMaxime Coste
2021-11-21Micro-optimize regex character class/type matchingMaxime Coste
Also force-inline step_thread as function call overhead has a mesurable impact.
2021-11-21Reduce the amount of Regex VM Instruction codeMaxime Coste
Merge all lookarounds into the same instruction, merge splits, merge literal ignore case with literal... Besides reducing the amount of almost duplicated code, this improves performance by reducing pressure on the (often failing) branch target prediction for instruction dispatching by moving branches into the instruction code themselves where they are more likely to be well predicted.
2021-08-21Use the [[gnu::packed]] C++ attribute.Peter Pentchev
Suggested by: Maxime Coste <mawww@kakoune.org>
2021-08-20Do not break non-GCC/g++ compilers.Peter Pentchev
2021-08-20Make sure the ParsedRegex structure has the right size.Peter Pentchev
Some versions of GCC/g++ will not necessarily pad the structure to a 32-bit boundary, so make the alignment and the filler explicit. Detected on: Debian/m68k; https://buildd.debian.org/status/fetch.php?pkg=kakoune&arch=m68k&ver=2020.09.01-1&stamp=1629387444&raw=0
2021-07-31Code style tweak for Regex implementation TestVMMaxime Coste
2021-01-03Add missing limits includesMaxime Coste
Fixes #4003
2020-02-07Fix regex start desc computation for case insensitive rangesMaxime Coste
Fixes #3345
2019-12-05Restore regex optimization pass by introducing basic block analysisMaxime Coste
Run the peephole optimizer on each basic block, avoiding the previous issue that some instructions could move across their boundaries.
2019-11-09Add static or const where usefulJason Felice
2019-11-06Support \x and \u escapes in regex character classesMaxime Coste
Change \u to use 6 digits to cover the full unicode range. Fixes #3172
2019-07-06Fix build on FreeBSDTobias Kortkamp
file.cc:390:21: error: use of undeclared identifier 'rename'; did you mean 'devname'? if (replace and rename(temp_filename, zfilename) != 0) ^~~~~~ devname /usr/include/stdlib.h:277:7: note: 'devname' declared here char *devname(__dev_t, __mode_t); ^ file.cc:390:28: error: cannot initialize a parameter of type '__dev_t' (aka 'unsigned long') with an lvalue of type 'char [1024]' if (replace and rename(temp_filename, zfilename) != 0) ^~~~~~~~~~~~~ /usr/include/stdlib.h:277:22: note: passing argument to parameter here char *devname(__dev_t, __mode_t); ^ 2 errors generated. --- highlighters.cc:1110:13: error: use of undeclared identifier 'snprintf'; did you mean 'vswprintf'? snprintf(buffer, 16, format, std::abs(line_to_format)); ^~~~~~~~ vswprintf /usr/include/wchar.h:139:5: note: 'vswprintf' declared here int vswprintf(wchar_t * __restrict, size_t n, const wchar_t * __restrict, ^ highlighters.cc:1110:22: error: cannot initialize a parameter of type 'wchar_t *' with an lvalue of type 'char [16]' snprintf(buffer, 16, format, std::abs(line_to_format)); ^~~~~~ /usr/include/wchar.h:139:35: note: passing argument to parameter here int vswprintf(wchar_t * __restrict, size_t n, const wchar_t * __restrict, ^ 2 errors generated. --- json_ui.cc:60:13: error: use of undeclared identifier 'sprintf'; did you mean 'swprintf'? sprintf(buf, "\\u%04x", *next); ^~~~~~~ swprintf /usr/include/wchar.h:133:5: note: 'swprintf' declared here int swprintf(wchar_t * __restrict, size_t n, const wchar_t * __restrict, ^ json_ui.cc:60:21: error: cannot initialize a parameter of type 'wchar_t *' with an lvalue of type 'char [7]' sprintf(buf, "\\u%04x", *next); ^~~ /usr/include/wchar.h:133:34: note: passing argument to parameter here int swprintf(wchar_t * __restrict, size_t n, const wchar_t * __restrict, ^ json_ui.cc:74:9: error: use of undeclared identifier 'sprintf' sprintf(buffer, R"("#%02x%02x%02x")", color.r, color.g, color.b); ^ 3 errors generated. --- regex_impl.cc:1039:9: error: use of undeclared identifier 'sprintf'; did you mean 'swprintf'? sprintf(buf, " %03d ", count++); ^~~~~~~ swprintf /usr/include/wchar.h:133:5: note: 'swprintf' declared here int swprintf(wchar_t * __restrict, size_t n, const wchar_t * __restrict, ^ regex_impl.cc:1039:17: error: cannot initialize a parameter of type 'wchar_t *' with an lvalue of type 'char [20]' sprintf(buf, " %03d ", count++); ^~~ /usr/include/wchar.h:133:34: note: passing argument to parameter here int swprintf(wchar_t * __restrict, size_t n, const wchar_t * __restrict, ^ regex_impl.cc:1197:17: error: use of undeclared identifier 'puts' { if (dump) puts(dump_regex(*this).c_str()); } ^ regex_impl.cc:1208:18: note: in instantiation of member function 'Kakoune::(anonymous namespace)::TestVM<Kakoune::RegexMode::Forward>::TestVM' requested here TestVM<> vm{R"(a*b)"}; ^ regex_impl.cc:1197:17: error: use of undeclared identifier 'puts' { if (dump) puts(dump_regex(*this).c_str()); } ^ regex_impl.cc:1283:56: note: in instantiation of member function 'Kakoune::(anonymous namespace)::TestVM<5>::TestVM' requested here TestVM<RegexMode::Forward | RegexMode::Search> vm{R"(f.*a(.*o))"}; ^ regex_impl.cc:1197:17: error: use of undeclared identifier 'puts' { if (dump) puts(dump_regex(*this).c_str()); } ^ regex_impl.cc:1423:57: note: in instantiation of member function 'Kakoune::(anonymous namespace)::TestVM<6>::TestVM' requested here TestVM<RegexMode::Backward | RegexMode::Search> vm{R"(fo{1,})"}; ^ 5 errors generated. --- remote.cc:829:9: error: use of undeclared identifier 'rename'; did you mean 'devname'? if (rename(old_socket_file.c_str(), new_socket_file.c_str()) != 0) ^~~~~~ devname /usr/include/stdlib.h:277:7: note: 'devname' declared here char *devname(__dev_t, __mode_t); ^ remote.cc:829:16: error: cannot initialize a parameter of type '__dev_t' (aka 'unsigned long') with an rvalue of type 'const char *' if (rename(old_socket_file.c_str(), new_socket_file.c_str()) != 0) ^~~~~~~~~~~~~~~~~~~~~~~ /usr/include/stdlib.h:277:22: note: passing argument to parameter here char *devname(__dev_t, __mode_t); ^ 2 errors generated. --- string_utils.cc:126:20: error: use of undeclared identifier 'sprintf'; did you mean 'swprintf'? res.m_length = sprintf(res.m_data, "%i", val); ^~~~~~~ swprintf /usr/include/wchar.h:133:5: note: 'swprintf' declared here int swprintf(wchar_t * __restrict, size_t n, const wchar_t * __restrict, ^ string_utils.cc:126:28: error: cannot initialize a parameter of type 'wchar_t *' with an lvalue of type 'char [15]' res.m_length = sprintf(res.m_data, "%i", val); ^~~~~~~~~~ /usr/include/wchar.h:133:34: note: passing argument to parameter here int swprintf(wchar_t * __restrict, size_t n, const wchar_t * __restrict, ^ string_utils.cc:133:20: error: use of undeclared identifier 'sprintf'; did you mean 'swprintf'? res.m_length = sprintf(res.m_data, "%u", val); ^~~~~~~ swprintf [...]
2019-02-27Fixed all reorder warningsJustin Frank
2019-02-04Remove peephole regex optimization passMaxime Coste
The current implementation is wrong as it crosses basic blocks boundaries. Doing basic block decomposition of regex is probably a tad too complex for this single optimization. Fixes #2711
2019-02-04Fix regex not always selecting the leftmost longest matchMaxime Coste
(Actually the rightmost longest match when searching backwards) Fixes #2710
2019-01-20Add a peephole optimization pass to the regex compilerMaxime Coste
2019-01-20Refactor regex find next start not to be an instruction anymoreMaxime Coste
The same logic can be hard coded, avoiding one thread and 3 instructions, improving the regex matching speed.
2019-01-20Split compile time regex flags from runtime onesMaxime Coste
2019-01-20Refactor parsed regex children iteration to use regular range-for loopsMaxime Coste
2019-01-03Add support for named captures to the regex impl and regex highlighterMaxime Coste
ECMAScript is adding support for it, and it is a pretty isolated change to do. Fixes #2293