Avoid non printable character in html file written by Python

https://stackoverflow.com/questions/16538883

29-05-2022
|

Domanda

I'm trying to convert SPSS syntax files to readable HTML. It's working almost perfectly except that a (single) non printable character is inserted into the HTML file. It doesn't seem to have an ASCII code and looks like a tiny dot. And it's causing trouble.

It occurs (only) in the second line of the HTML file, always corresponding to the first line of the original file. Which probably hints at which line(s) of Python cause the problem (please see comments)

The code which seems to cause this is

    rfil = open(fil,"r") #rfil =  Read File, original syntax
    wfil = open(txtFil,"w") #wfil =  Write File, HTML output
    #Line below causes problem??
    wfil.write("<ol class='code'>\n<li>") 
    cnt = 0
    for line in rfil:
        if cnt == 0:
            #Line below causes problem??
            wfil.write(line.rstrip("\n").replace("'",'&#39;').replace('"','&#34;')) 
        elif len(line) > 1:
            wfil.write("</li>\n<li>" + line.strip("\n").replace("'",'&#39;').replace('"','&#34;'))
        else:
            wfil.write("<br /><br />")
        cnt += 1
    wfil.write("</li>\n</ol>")
    wfil.close()
    rfil.close()

Screen shot of the result

enter image description here

Soluzione

The input file seems to begin with a byte order mark (BOM), to indicate UTF-8 encoding. You can decode the file to Unicode strings by opening it with

import codecs
rfil = codecs.open(fil, "r", "utf_8_sig")

The utf_8_sig encoding skips the BOM in the beginning.

Some programs recognize the BOM, some don't. To write the file out without BOM, use

wfil = codecs.open(txtFil, "w", "utf_8")

Altri suggerimenti

What you see is a byte-order mark, or BOM. The way you see it , \xef\xbb\xbf, says that the stringgs you work with are actually UTF-8; you can convert them into proper Unicode (line.decode('utf-8')) to make manipulation easier.

Then you can augment the logic for the first line so that it safely removes the BOM:

for raw_line in rfil:
    line = raw_line.decode('utf-8') # now line is Unicode
    if cnt == 0 and line[0] == '\ufeff':
        line = line[1:] # cut the first character, which is a BOM
    ...

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow