Determining and removing invisible characters from a string in PHP (%E2%80%8E)

https://stackoverflow.com/questions/23130740

05-07-2023
|

题

I have strings in PHP which I read from a database. The strings are URLs and at first glance they look good, but there seems to be some weird character at the end. In the address bar of the browser, the string '%E2%80%8E' gets appended to the URL, which breaks the URL.

I found this post on stripping the left-to-right-mark from a string in PHP and it seems related to my problem, but the solution does not work for me because my characters seem to be something else.

So how can I determine which character I have so I can remove it from the strings?

(I would post one of the URLs here as an example, but the stack overflow form strips the character at the end as soon as I paste it in here.)

I know that I could only allow certain chars in the string and discard all others. But I would still like to know what char it is -- and how it gets into the database.

EDIT: The question has been answered and the code given in the accepted answer works for me:

$str = preg_replace('/\p{C}+/u', "", $str);

解决方案

If the input is utf8-encoded, might use unicode regex to match/strip invisible control characters like e2808e (left-to-right-mark). Use u (PCRE_UTF8) modifier and \p{C} or \p{Other}.

Strip out all invisibles:

$str = preg_replace('/\p{C}+/u', "", $str);

Here is a list of \p{Other}

Detect/identify invisibles:

$str = ".\xE2\x80\x8E.\xE2\x80\x8B.\xE2\x80\x8F";

// get invisibles + offset
if(preg_match_all('/\p{C}/u', $str, $out, PREG_OFFSET_CAPTURE))
{
  echo "<pre>\n";
  foreach($out[0] AS $k => $v) {
    echo "detected ".bin2hex($v[0])." @ offset ".$v[1]."\n";
  }
  echo "</pre>";
}

outputs:

detected e2808e @ offset 1
detected e2808b @ offset 5
detected e2808f @ offset 9

Test on eval.in

To identify, look up at Google e.g. fileformat.info:

@google: site:fileformat.info e2808e

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow