EN VI

Python and PCRE regex that are the same give different outputs for the same input?

2024-03-12 07:00:06
How to Python and PCRE regex that are the same give different outputs for the same input

I am trying to implement the minbpe library in zig, using a wrapper over PCRE library.

The pattern in Python is r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

When I use the pattern with a UTF-8 encoded text like abcdeparallel १२४, I get the following output:

>>> import regex as re
>>> p = re.compile(r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
>>> p
regex.Regex("'(?:[sdmt]|ll|ve|re)| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+", flags=regex.V0)
>>> p.findall("abcdeparallel १२४")
['abcdeparallel', ' १२४']

It looks like this is more or less the same in PCRE flavored regex as well, with me just having to add a /g flag in the end for UTF-8 matching

However when I try to use the pattern with pcre via the pcre2test tool on macOS, I get a much different output

$ pcre2test -8
PCRE2 version 10.42 2022-12-11
  re> /'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/g
data> abcdeparallel १२४
 0: abcdeparallel
 0:  \xe0
 0: \xa5\xa7
 0: \xe0
 0: \xa5\xa8
 0: \xe0
 0: \xa5
 0: \xaa

Somehow it looks like the code points for the Hindi numerals (1, 2 4) are interpreted differently and the output is matched as a totally different set of characters

>>> "\xe0\xa5\xa7\xe0\xa5\xa8"
'१२'

Is there a flag or something that I am missing that must be passed to have the same behaviour as the the regex Package/module from Python ? When UTF-8 code points are decoded into bytes, wouldn't the library know how to put them back together into the same code points ?

Solution:

The Hindi codepoints are actually matched, but rendered on screen as UTF-8 hexcodes:

>>> "१२४".encode("utf-8")
b'\xe0\xa5\xa7\xe0\xa5\xa8\xe0\xa5\xaa'

According to the pcr2test spec:

When pcre2test is outputting text in the compiled version of a pattern, bytes other than 32-126 are always treated as non-printing characters and are therefore shown as hex escapes.

When pcre2test is outputting text that is a matched part of a subject string, it behaves in the same way, unless a different locale has been set for the pattern (using the locale modifier). In this case, the isprint() function is used to distinguish printing and non-printing characters.

The spec doesn't mention which locales can be used. The example (fr_FR) suggests two-letter country code and two-letter language code, but it's unclear to me if Hindi is supported.

With the `(*UTF) flag you do get two matches and the Hindi numerals are then rendered as unicode hexes:

re> /(*UTF)(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/g
data> abcdeparallel १२४
 0: abcdeparallel
 0:  \x{967}\x{968}\x{96a}
Answer

Login


Forgot Your Password?

Create Account


Lost your password? Please enter your email address. You will receive a link to create a new password.

Reset Password

Back to login