strpos searching for unicode in PHP (and handling inline UTF-8)

https://stackoverflow.com/questions/3545807

30-09-2019
|

Question

I am having a problem dealing with a simple search for a two character unicode string (the needle) inside another string (the haystack) that may or may not be UTF-8

Part of the problem is I don't know how to specify the code for use in strpos, and I don't know if PHP has to be compiled with any special support for the code, or if I have to use mb_strpos which I am trying to avoid since it also might not be available.

ie. for example the needle is U+56DE U+590D (without the space)

With preg_match it might be preg_match("@\x{56DE}\x{590D}@",$haystack) but that actually requires @u which might not be available and I get a Compilation failed: character value in \x{...} sequence is too large anyway.

I don't want to use preg_match anyway as it might be significantly slower than strpos (there are other sequences that have to be searched).

Can I convert U+56DE U+590D into its single byte sequence (possibly 5-6 characters) and then search for it via strpos? I can't figure out how to convert it to bytes if so.

How do you specify unicode inline in PHP anyway? I mean outside of PRCE ?

$blah="\u56DE\u590D"; doesn't work?

Thanks for any ideas!

Solution

First, your question is poorly structured. It has several questions at several points. You would probably get more answers if you used a more clear structure: 1) describe the task you're trying to accomplish, 2) the limitations/requirements, 3) the strategy you considered, 4) the difficulties you found with such strategy/is there a better one.

That said, I'll start by the end:

$blah="\u56DE\u590D"; doesn't work?

No. The language doesn't know anything about unicode. In PHP, strings are byte arrays. Therefore, how you express a unicode code points in a PHP script depends on the encoding you want to use. For UTF-8, it would be "\xE5\x9B\x9E\xE5\xA4\x8D", for UTF-16 big endian would be "\x56\xDE\x59\x0D", and so on.

Can I convert U+56DE U+590D into its single byte sequence (possibly 5-6 characters) and then search for it via strpos? I can't figure out how to convert it to bytes if so.

For, the first part, yes, i.e., converting U+56DE U+590D into bytes, a clarification is needed. Are these UTF-16 code units or Unicode code points? For instance, how is 𪛖 represented? U+D869 U+uDED6 or U+2A6D6? If they are unicode code units, it's trivial to encode them into UTF-16. For UTF-16 big endian, it's just "\x56\xDE\x59\x0D". Otherwise, it's still trivial to encode them UTF-32, but it takes a little more work to do the same in UTF-16 (or UTF-8).

For the second part, keep reading.

Part of the problem is I don't know how to specify the code for use in strpos, and I don't know if PHP has to be compiled with any special support for the code, or if I have to use mb_strpos which I am trying to avoid since it also might not be available.

What are you trying to do? Why do you need to find a position in a string? strpos will give you a byte offset for a given string (again, interpreted in binary form). Are you trying to clip a string? strpos (or even mb_strpos) mean trouble in Unicode – a glyph can be constituted by several code units, so you risk clipping part of a glyph. I can't advise you more unless you tell what you're trying to do.

OTHER TIPS

You wrote 'might not be available'. I suggest you to try mb_strpos.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow