Python Inserting Unwanted Characters

Question 1

There's nothing wrong with your data—it's pure ASCII. The problem is in your source code.

Clicking the Edit button to copy your actual source, rather than your formatted source, it's got non-breaking space (U+00A0) characters in the middle of the template string literal.

Assuming your editor and the browser you copied from and pasted to are doing things right, that means that your actual UTF-8 source has '\xc2\xa0' sequences.

Since you're putting non-ASCII characters into a str/bytes literal (which, as I explained in the other answer, is always a bad idea), this means your strings end up with '\xc2\xa0' sequences.

Somewhere between there and your screen, there's an additional coding problem, and this is getting garbled into '\xc2\xac\xe2\x80\xa0' sequences—which, when interpreted as UTF-8, show up as u'¬†'.

We could try to track down where that additional problem is coming from, but it doesn't matter too much.

The immediate fix is to replace all the non-breaking spaces in your source with plain ASCII spaces.

Going beyond that, you need to figure out what you were using that generated these non-breaking spaces. Often, this is a sign of editing source code in word processors rather than text editors; if so, stop doing that.

If you don't actually have any intentionally-non-ASCII source code, using # coding=ascii instead of # coding=utf-8 at the top of your file is a great way to catch bugs like this. (You can still process UTF-8 values; all the coding declaration says is that the source code itself is in UTF-8.)

Question 2

You still haven't answered the questions I asked to clarify this, so I'm going to take a guess here.

First, the reason your re.sub doesn't work is that your pattern is a UTF-8 ¬† ('\xc2\xac\xe2\x80\xa0'), but you're trying to match a cp1252 ¬† ('\xac\x86'). Obviously, those don't match.

Second, the reason you're getting that garbage in the first place is that your CSV file is being processed by something that's not using UTF-8, even though you think it is. Maybe it's your spreadsheet program, or a text editor, or a command-line tool.

Most likely, you've just mixed up one 8-bit encoding with another at some step on the chain—written out some text as cp1252, then tried to edit it as UTF-8, or vice-versa.

But that † is pretty interesting. That's U+2020. If you have some UTF-16-LE text, and edit it as UTF-8 (or ASCII or cp1252), and try to add in a pair of spaces, you're actually adding in one U+2020. Normally, you'd think it would be hard to mix up UTF-16 and UTF-8. But clearly you're just eyeballing the text instead of actually looking at the bytes, and if all of your data fits within Latin-1, UTF-16 will look perfectly fine to your eyeball—sure, there's an invisible NUL character after each real character, but you can't see invisible things.

Anyway, it doesn't matter what the exact details are. The only way to fix this is to look at the actual bytes in the file generated at each step on the chain, find out where you're doing it wrong, and fix it appropriately. If you don't know how to do any part of that, you need to give other people enough information to do it for you.

However, if you just want a quick workaround: Take the file that you're feeding into your Python script, and view it in a hex editor. Find the two garbage characters, and record what bytes they are. If they're, say, ac 86, just change your code to do a s = s.replace('\xac\x86', '').

Question 3

Try this:

line = re.sub(r'(?u)¬†','', line.rstrip())

Then the regex treat your string as unicode.