Why am I getting "�" characters?

https://stackoverflow.com/questions/17540480

02-06-2022
|

Question

I've written a quick-and-dirty utility to parse a text file, but in some cases it's writing out a "�" character. My utility reads from a .txt file which contains "records" in this format:

Biography
Title:George F. Kennan: An American Life 
Author:John Lewis Gaddis
Kindle: B0054TVO1G
Hardcover: B007R93I1U
Paperback: 0143122150
Image link: <a href="https://rads.stackoverflow.com/amzn/click/com/B0054TVO1G" rel="nofollow noreferrer"><img src="http://images.amazon.com/images/P/B0054TVO1G.01.MZZZZZZZ.jpg" alt="Book Cover" /></a>

...and writes out lines from that to a CSV file such as:

Biography,"George F. Kennan: An American Life","John Lewis Gaddis",B0054TVO1G,B007R93I1U,0143122150,<a href="https://rads.stackoverflow.com/amzn/click/com/B0054TVO1G" rel="nofollow noreferrer"><img src="http://images.amazon.com/images/P/B0054TVO1G.01.MZZZZZZZ.jpg" alt="Book Cover" /></a>

...but in several cases, as mentioned, that weird character is appending itself to an author's name. In most cases where this is happening, it's what appears to be a space character in the .txt file. I'm trimming the author's name prior to writing it out to the CSV file, so it's obviously not being seen as a space, though.

When I save the text file with these characters, I get the message about non-unicode characters, etc.

What could be the cause of that? And better yet, how can I delete them with a search and replace operation? In Notepad, they are not found, so I have to delete them one-by-one.

Prior to being in the .txt file, this data was in an Open Office/.odt file, if that means anything to anyone.

BTW, I have no idea how that "stackoverflow" got into the href above; it's not in the original text I pasted in...

UPDATE

I am curious how that character got in my files. I sure didn't put it there (deliberately), any more than I added the "stackoverflow" to the URL above. Could it be that a call to Environment.Newline would add that?

Here was my process:

1) Copy and paste info from the interwebs into an Open Office/.odt file
2) Copy and past that into a text (Notepad) file
3) Open that text file programmatically and loop through it, writing to a new "csv"/.txt file.

UPDATE 2

Silly me - all I had to do was save the file (which wouldn't save those weird characters), then open it again. IOW, when I opened it today (at home, after work) those were gone.

UPDATE 3

I wrote too soon - it replaced the weird character with a question mark (a "normal" one, not a stylized one).

Solution

They are almost certainly non-breaking spaces, U+00A0 (although there are other fixed-width space characters which are also possible.) These won't be trimmed as spaces, but will be rendered as spaces if the encoding of the file matches the encoding of the output device.

My guess is that your text file is in CP-1252 (i.e., Windows default one-byte coding) but your output is being rendered as though it were UTF-8.

Normally you would type these characters as AltGr+Space. You might try that with Notepad, but no guarantees.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow