list of garbage characters like â€™

https://stackoverflow.com/questions/12024097

27-06-2021
|

Question

I am using librets to retrieve data form my RETS Server. Somehow librets Encoding method is not working and I am receiving some weird characters in my output. I noticed characters like '’' is replaced with â€™. I am unable to find a fix for librets so i decided to replace such garbage characeters with actual values after downloading data. What I need is a list of such garbage string and their equivalent characters. I googled for this but not found any resource. Can anyone point me to the list of such garbage letters and their actual values or a piece of code which can generate such letter.

thanx

Solution

Search for the term "UTF-8", because that's what you're seeing.

UTF-8 is a way of representing Unicode characters as a sequence of bytes. ("Unicode characters" are the full range of letters and symbols used all in human languages.) Typically, one Unicode character becomes 1, 2, or 3 bytes in UTF-8. When those bytes (numbers from 0 to 255) are displayed using the character set normally used by Windows, they appear as "garbage" -- in this case, 3 "garbage letters" which are really the 3 bytes of a UTF-8 encoding.

In your example, you started with the smart quote character ’. Its representation in Unicode is the number 8217, or U+2019 (2019 is the hexadecimal for 8217). (Search for "Unicode" for a complete list of Unicode characters and their numbers.) The UTF-8 representation of the number 8217 is the three byte sequence 226, 128, 153. And when you display those three bytes as characters, using the Windows "CP-1252" character encoding (the ordinary way of displaying text on Windows in the USA), they appear as â€™. (Search for "CP-1252" to see a table of bytes and characters.)

I don't have any list for you. But you could make one if you wrote a program in a language that has built-in support for Unicode and UTF-8. All I can do is explain what you are seeing.

If there is a way to tell librets to use UTF-8 when downloading, that might automatically solve your problem. I don't know anything about librets, but now that you know the term "UTF-8" you might be able to make progress.

OTHER TIPS

Question reminder:

"...I noticed characters like '’' is replaced with â€™... i decided to replace such garbage characeters with actual values after downloading data. What I need is a list of such garbage string and their equivalent characters."

Strictly dealing with this part:

"What I need is a list of such garbage string and their equivalent characters."

Using php, you can generate these characters and their equivalence. Working with all 1,111,998 Unicode points or 109,449 Utf8 symbols is impractical. You may use the ASCII range in the following loop between &#128 and &#258 or another range that is more relevant to your context.

<?php
  for ($i=128; $i<258; $i++)
    $tmp1 .= "<tr><td>".htmlentities("&#$i;")."</td><td>".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."</td><td>&#".$i.";</td></tr>";

  echo "<table border=1>
    <tr><td>&#</td><td>&quot;Garbage&quot;</td><td>symbol</td></tr>";
    echo $tmp1;
  echo "</table>";
?>

From experience, in an ASCII context, most "garbage" symbols originate in the range &#128 to &#257 + (seldom) &#8129 to &#8246.

In order for the "garbage" symbols to display, the html page charset must be set to iso-1 or whichever other charset that caused the problem in the first place. They will not show if the charset is set to utf-8.

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

"i decided to replace such garbage characeters with actual values after downloading data"

You CANNOT undo the "garbage" with php utf8_decode(), which would actually create more "garbage" on already "garbage". But, you may use the simple and fast search and replace php str_replace() function.

First, generate 2 arrays for each set of "garbage" symbols you wish to replace. The first array is the Search term:

<?php
  //ISO 8859-1 (Latin-1) special chars are found in the range 128 to 257
  $tmp1 = "\$SearchArr = array(";
  for ($i=128; $i<258; $i++)
    $tmp1 .= "\"".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."\", ";
  $tmp1 = substr($tmp1,0,strlen($tmp1)-2);//erases last comma
  $tmp1 .= ");";
  $tmp1 = htmlentities($tmp1,ENT_NOQUOTES,"utf-8");
?>

The second array is the replace term:

<?php
  //Adapt for your relevant range.
  $tmp2 = "\$ReplaceArr = array(\n";
  for ($i=128; $i<258; $i++)
    $tmp2 .= "\"&#".$i.";\", ";
  $tmp2 = substr($tmp2,0,strlen($tmp2)-2);//erases last comma
  $tmp2 .= ");";

  echo $tmp1."\n<br><br>\n";
  echo $tmp2."\n";
?>

Now, you've got 2 arrays that you can copy and paste to use and reuse to clean any of your infected strings like this:

$InfectedString = str_replace($SearchArr,$ReplaceArr,$InfectedString);

Note: utf8_decode() is of no help for cleaning up "garbage" symbols. But, it can be used to prevent further contamination. Alternatively a mb_ function can be useful.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow