Here's the dealio: I've written a program that finds all of the algorithm classes in the dictionary. However, I'm having a problem dealing with accented characters. Currently my code reads them in, treats them like they're invisible, but still prints out some sort of replacement code at the end in the form of '\xc3\???'. I'd like to discard all of the words with accents, but I don't know how to detect them.
Things I've tried:
- checking if the type is unicode
- using a regex to check for words containing '\xc3'
- decoding/encoding (I don't understand unicode completely but whatever I tried didn't work).
QUESTION/PROBLEM: I need to find out how to detect accents, but my program prints the accents onto the command line as weird '\xc3\???' characters, which is not how the program treats them, as I haven't been able to find any words containing '\xc3\???' despite that being printed to the command line.
Example: sé -> s\xc3\xa9, and sé and s are considered anagrams by my program.
Test dictionary:
stop
tops
pots
hello
world
pit
tip
\xc3\xa9
sé
s
se
Output of Code:
Found
\xc3\xa9
['pit', 'tip']
['world']
['s\xc3\xa9', 's']
['\\xc3\\xa9']
['stop', 'tops', 'pots']
['se']
['hello']
Program itself:
import re
anadict = {};
for line in open('fakedic.txt'):#/usr/share/dict/words'):
word = line.strip().lower().replace("'", "")
line = ''.join(sorted(ch for ch in word if word if ch.isalnum($
if isinstance(word, unicode):
print word
print "UNICODE!"
pattern = re.compile(r'xc3')
if pattern.findall(word):
print 'Found'
print word
if anadict.has_key(line):
if not (word in anadict[line]):
anadict[line].append(word)
else:
anadict[line] = [word]
for key in anadict:
if (len(anadict[key]) >= 1):
print anadict[key]
Help?