EN VI

Python and PCRE regex that are the same give different outputs for the same input?

How to Python and PCRE regex that are the same give different outputs for the same input

I am trying to implement the minbpe library in zig, using a wrapper over PCRE library.

The pattern in Python is r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

When I use the pattern with a UTF-8 encoded text like abcdeparallel १२४, I get the following output:

>>> import regex as re
>>> p = re.compile(r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
>>> p
regex.Regex("'(?:[sdmt]|ll|ve|re)| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+", flags=regex.V0)
>>> p.findall("abcdeparallel १२४")
['abcdeparallel', ' १२४']

It looks like this is more or less the same in PCRE flavored regex as well, with me just having to add a /g flag in the end for UTF-8 matching

However when I try to use the pattern with pcre via the pcre2test tool on macOS, I get a much different output

$ pcre2test -8
PCRE2 version 10.42 2022-12-11
  re> /'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/g
data> abcdeparallel १२४
 0: abcdeparallel
 0:  \xe0
 0: \xa5\xa7
 0: \xe0
 0: \xa5\xa8
 0: \xe0
 0: \xa5
 0: \xaa

Somehow it looks like the code points for the Hindi numerals (1, 2 4) are interpreted differently and the output is matched as a totally different set of characters

>>> "\xe0\xa5\xa7\xe0\xa5\xa8"
'à¥§à¥¨'

Is there a flag or something that I am missing that must be passed to have the same behaviour as the the regex Package/module from Python ? When UTF-8 code points are decoded into bytes, wouldn't the library know how to put them back together into the same code points ?

Solution:

The Hindi codepoints are actually matched, but rendered on screen as UTF-8 hexcodes:

>>> "१२४".encode("utf-8")
b'\xe0\xa5\xa7\xe0\xa5\xa8\xe0\xa5\xaa'

According to the pcr2test spec:

When pcre2test is outputting text in the compiled version of a pattern, bytes other than 32-126 are always treated as non-printing characters and are therefore shown as hex escapes.

When pcre2test is outputting text that is a matched part of a subject string, it behaves in the same way, unless a different locale has been set for the pattern (using the locale modifier). In this case, the isprint() function is used to distinguish printing and non-printing characters.

The spec doesn't mention which locales can be used. The example (fr_FR) suggests two-letter country code and two-letter language code, but it's unclear to me if Hindi is supported.

With the `(*UTF) flag you do get two matches and the Hindi numerals are then rendered as unicode hexes:

re> /(*UTF)(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/g
data> abcdeparallel १२४
 0: abcdeparallel
 0:  \x{967}\x{968}\x{96a}