strpos searching for unicode in PHP (and handling inline UTF-8)
-
30-09-2019 - |
Question
I am having a problem dealing with a simple search for a two character unicode string (the needle) inside another string (the haystack) that may or may not be UTF-8
Part of the problem is I don't know how to specify the code for use in strpos
, and I don't know if PHP has to be compiled with any special support for the code, or if I have to use mb_strpos
which I am trying to avoid since it also might not be available.
ie. for example the needle is U+56DE U+590D
(without the space)
With preg_match it might be preg_match("@\x{56DE}\x{590D}@",$haystack)
but that actually requires @u
which might not be available and I get a Compilation failed: character value in \x{...} sequence is too large
anyway.
I don't want to use preg_match anyway as it might be significantly slower than strpos (there are other sequences that have to be searched).
Can I convert U+56DE U+590D
into its single byte sequence (possibly 5-6 characters) and then search for it via strpos? I can't figure out how to convert it to bytes if so.
How do you specify unicode inline in PHP anyway? I mean outside of PRCE ?
$blah="\u56DE\u590D";
doesn't work?
Thanks for any ideas!
Solution
First, your question is poorly structured. It has several questions at several points. You would probably get more answers if you used a more clear structure: 1) describe the task you're trying to accomplish, 2) the limitations/requirements, 3) the strategy you considered, 4) the difficulties you found with such strategy/is there a better one.
That said, I'll start by the end:
$blah="\u56DE\u590D";
doesn't work?
No. The language doesn't know anything about unicode. In PHP, strings are byte arrays. Therefore, how you express a unicode code points in a PHP script depends on the encoding you want to use. For UTF-8, it would be "\xE5\x9B\x9E\xE5\xA4\x8D"
, for UTF-16 big endian would be "\x56\xDE\x59\x0D"
, and so on.
Can I convert
U+56DE U+590D
into its single byte sequence (possibly 5-6 characters) and then search for it viastrpos
? I can't figure out how to convert it to bytes if so.
For, the first part, yes, i.e., converting U+56DE U+590D
into bytes, a clarification is needed. Are these UTF-16 code units or Unicode code points? For instance, how is 𪛖
represented? U+D869 U+uDED6
or U+2A6D6
? If they are unicode code units, it's trivial to encode them into UTF-16. For UTF-16 big endian, it's just "\x56\xDE\x59\x0D"
. Otherwise, it's still trivial to encode them UTF-32, but it takes a little more work to do the same in UTF-16 (or UTF-8).
For the second part, keep reading.
Part of the problem is I don't know how to specify the code for use in
strpos
, and I don't know if PHP has to be compiled with any special support for the code, or if I have to usemb_strpos
which I am trying to avoid since it also might not be available.
What are you trying to do? Why do you need to find a position in a string? strpos
will give you a byte offset for a given string (again, interpreted in binary form). Are you trying to clip a string? strpos
(or even mb_strpos
) mean trouble in Unicode – a glyph can be constituted by several code units, so you risk clipping part of a glyph. I can't advise you more unless you tell what you're trying to do.
OTHER TIPS
You wrote 'might not be available'. I suggest you to try mb_strpos.