Detecting accents in words (Python)

https://stackoverflow.com/questions/21843971

13-10-2022
|

Question

Here's the dealio: I've written a program that finds all of the algorithm classes in the dictionary. However, I'm having a problem dealing with accented characters. Currently my code reads them in, treats them like they're invisible, but still prints out some sort of replacement code at the end in the form of '\xc3\???'. I'd like to discard all of the words with accents, but I don't know how to detect them.

Things I've tried:

checking if the type is unicode
using a regex to check for words containing '\xc3'
decoding/encoding (I don't understand unicode completely but whatever I tried didn't work).

QUESTION/PROBLEM: I need to find out how to detect accents, but my program prints the accents onto the command line as weird '\xc3\???' characters, which is not how the program treats them, as I haven't been able to find any words containing '\xc3\???' despite that being printed to the command line.

Example: sé -> s\xc3\xa9, and sé and s are considered anagrams by my program.

Test dictionary:

stop
tops
pots
hello
world
pit
tip
\xc3\xa9
sé
s
se

Output of Code:

Found
\xc3\xa9
['pit', 'tip']
['world']
['s\xc3\xa9', 's']
['\\xc3\\xa9']
['stop', 'tops', 'pots']
['se']
['hello']

Program itself:

import re

anadict = {};

for line in open('fakedic.txt'):#/usr/share/dict/words'):
        word = line.strip().lower().replace("'", "")
        line = ''.join(sorted(ch for ch in word if word if ch.isalnum($
        if isinstance(word, unicode):
                print word
                print "UNICODE!"
        pattern = re.compile(r'xc3')
        if pattern.findall(word):
               print 'Found'
               print word
        if anadict.has_key(line):
                if not (word in anadict[line]):
                        anadict[line].append(word)
        else:
                anadict[line] = [word]

for key in anadict:
        if (len(anadict[key]) >= 1):
                print anadict[key]

Help?

La solution 2

I ended up using regular expressions (basically to check for everything which wasn't an alphabetic character) with:

if re.match('^[a-zA-Z_]+$', word):

Which helped me strip out any word that had a \ or any other number or funky symbol in it. Not a perfect solution, but it worked.

Autres conseils

So basically scratch my answer... Just look here:

How to check if a string in Python is in ASCII?

The gist is that you can check every character to see if the ord of the char is less than 128, which allows you to check if it's an accented character. Or you can do a lot of try and catching, looking for unicode errors which will throw during accented characters. (The latter seems to be more of the efficient answer)

This was definitely a learning experience for me as well :) Sorry for taking so long

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow