Question

I am building a table, with content pulled from other elements in the page (page scraping).

I am using innerText or textContent to pull the text, then a regular expression to trim it:

string.replace(/^\s+|\s+$/g,"");

This works fine in IE 9 and Chrome, but in IE 8 I am getting a garbage character that I cannot identify. I was able to reproduce the behavior with alerts in jsfiddle:

http://jsfiddle.net/Te4FQ/

What is this extra character, and how can I get rid of it?

Update: thanks for the helpful replies! It seems that the character in question is u200E (left to right mark). So the second part of my question remains, how can I get rid of such characters with regular expressions, and just keep regular text?

Was it helpful?

Solution

Both the "At Risk" and "Complete" <th> tags in your jsFiddle snippet have a U+200E (Left-to-Right Mark, aka LRM) code point at the end of their content. That is not a whitespace character, so it cannot be matched by \s.

One way to get rid of this character is to use the XRegExp library, so that you can replace all matches of \p{C} with the empty string (i.e., delete them). \p{C} matches any code point in Unicode's "Other" category, which includes control, format, private use, surrogate, and unassigned code points. U+200E, specifically, is within the \p{Cf} "Other, Format" subcategory.

OTHER TIPS

Try printing to the page the result of

escape(string.replace(/^\s+|\s+$/g,""));

Your garbage character should show up as an escape code.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top