> (5) works on non-string data Would you mind giving an example of when this wou...

dataflow · on July 31, 2018

Well, by strings I meant 8-bit/16-bit chars (std::string/wstring). One "problem" with these string patterns is that they let the engine do more than mere comparison -- they let the engine actually enumerate all the characters in between. If I remember correctly, I've seen linear-ish-time regex engines (i.e.: I'm not talking about the backtracking ones you see in most languages, although they might sometimes do this too) take advantage of this by making lookup tables for the characters in an interval, and this clearly doesn't scale when you go to 32-bit or 64-bit "characters". They really have to implement efficient set intersection for large sets, which is possible, but I'm not I've seen them do this.

If you're looking for specific examples of "large alphabets", there's a very easy one in my mind, which is UTF-32. If you're looking for data that isn't a plain sequence of integers in any shape or form... I have to think back, as I don't remember what the use case I ran into was, sadly. I'll post here if I remember.

burntsushi · on July 31, 2018

You'll also find large alphabets quite difficult to deal with in an efficient DFA implementation. Naively, a large alphabet will either cause the size of each DFA node to be very large (a dense representation) or make accesses into the DFA transitions slower (a sparse representation).

For most DFA engines I've seen, they take advantage of "strings" themselves and define the alphabet to be the set of bytes. If you're searching UTF-8, then this also implies that your DFA must include UTF-8 decoding (which is generally the right choice for performance reasons anyway).

Regex engines with modern Unicode support do generally need to support efficient set operations on large sets though, in order to support the various character class syntaxes (which include negation, intersection, symmetric difference and, of course, union). Usually you want some kind of interval set structure, and its implementation complexity, in my experience, generally correlates with the amount of additional memory you're willing to use. :-)

dataflow · on July 31, 2018

...yeah, exactly that, thanks for mentioning the character class issues. :-) I think it was partly for this reason (but also others -- like my idea that it would help SAT or SMT solving) I've also had this idea of making a "universal" int_set class that adaptively adjusts its internal data structure (bitvectors, intervals, "shift-intervals", sparse sets, BDDs, etc.) based on its contents' sparsity patterns... which, for n different data structures, would require O(n^2) or O(n^3) methods to handle set and perhaps arithmetic operations between them (n^2 if the output type matches the input type, n^3 if we let the output data structure be different than the inputs'), which is kind of painful but I think doable and likely to be useful. That's yet another ambitious endeavor I may never get around to, or at least not before someone else does...

andreasvc · on Aug 1, 2018

Regarding that idea for a universal integer set, roaring bitmaps implement that.

HelloNurse · on July 31, 2018

Given a byte based encoding like UTF-8 or extended 8-bit Ascii variants or Shift-JIS, a regular expression over Unicode characters can be translated to a more complicated, but not necessarily slow to match, regular expression over bytes.

burntsushi · on July 31, 2018

Indeed. That's what I meant with: "For most DFA engines I've seen, they take advantage of "strings" themselves and define the alphabet to be the set of bytes. If you're searching UTF-8, then this also implies that your DFA must include UTF-8 decoding (which is generally the right choice for performance reasons anyway)." :-)

HelloNurse · on July 31, 2018

UTF-8 decoding usually means translating UTF-8 text to a more convenient character representation like UTF-32 or if you are naughty UTF-16 before processing, not adapting the application to work with UTF-8 data.

burntsushi · on July 31, 2018

I think the principle of charity applies here. What other possible interpretation of UTF-8 decoding is there for what I wrote? I mean, the automaton literally contains a UTF-8 automaton right in it, and semantically, it is totally matching in terms of codepoints.