I am trying to implement the minbpe
library in zig, using a wrapper over PCRE library.
The pattern in Python is r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
When I use the pattern with a UTF-8 encoded text like abcdeparallel १२४
, I get the following output:
>>> import regex as re
>>> p = re.compile(r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
>>> p
regex.Regex("'(?:[sdmt]|ll|ve|re)| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+", flags=regex.V0)
>>> p.findall("abcdeparallel १२४")
['abcdeparallel', ' १२४']
It looks like this is more or less the same in PCRE flavored regex as well, with me just having to add a /g
flag in the end for UTF-8 matching
However when I try to use the pattern with pcre via the pcre2test tool on macOS, I get a much different output
$ pcre2test -8
PCRE2 version 10.42 2022-12-11
re> /'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/g
data> abcdeparallel १२४
0: abcdeparallel
0: \xe0
0: \xa5\xa7
0: \xe0
0: \xa5\xa8
0: \xe0
0: \xa5
0: \xaa
Somehow it looks like the code points for the Hindi numerals (1, 2 4) are interpreted differently and the output is matched as a totally different set of characters
>>> "\xe0\xa5\xa7\xe0\xa5\xa8"
'१२'
Is there a flag or something that I am missing that must be passed to have the same behaviour as the the regex
Package/module from Python ? When UTF-8 code points are decoded into bytes, wouldn't the library know how to put them back together into the same code points ?