python, xmlrpc, tidy & unicode issues [closed]

https://stackoverflow.com/questions/10710915

10-06-2021
|

Question

I've been trying to work around an issue I'm facing for two days now.

The final goal is to migrate the content of an apple wiki server to foswiki/twiki markup.

I found an xslt stylesheet that does most of the work, and does it reasonably well, and fast. All I need to do to utilise it is to feed it well-formed (X)HTML, which is where tidy comes in - the "content" string of the apple wiki datastructure has lots of HTML tags in it, but is incomplete.

Using xmlrpc introspection the undocumented apple API becomes almost usable with a few hints strewn about on the apple forums.

Trying to use tidy now gives me:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 121: ordinal not in range(128)

Obviously I searched for this error message, and found several articles, including some here on Stackoverflow, but they seem to suggest that it's an encoding problem of the terminal I'm using. However, LANG=en_US.UTF-8 here, so this can't be the cause of my problem.

I found an article that suggested to get rid of BOM, but in doing so I created a new error message that made just as little sense to me:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

Here's the relevant code-snippet:

pages = {}

paths = s.groupsForSession(session_id) # paths is a list of groups that user can read on that server
for aPath in paths:
  entries = s.wiki.getEntries(session_id, aPath)
  # entries = s.search.getEntries(session_id, aPath)
  pprint.pprint(entries)

  for uid in entries:
    try:
      entry = s.wiki.getEntryWithUID(session_id, uid['uid'])
    except Exception, e:
      print e.faultString
      raise Exception
    pages[uid['uid']] = entry
    pprint.pprint(  pages[uid['uid']]['content'])
    print(
      tidy.parseString(
        str(
          unicode(
              pages[uid['uid']]['content'].strip(codecs.BOM_UTF8), 'utf-8'
          )
        ),
        **options
        )
      )

Solution

As suggested by @oefe:

A few more experiments later I'm getting what I want; looks like the messages about encoding issues made me bark up the wrong tree. the solution to the problem was quite simple.

tidy.parseString( str( pages[uid['uid']]['content'].encode('utf-8') ), **options )

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow