Why does this regex return unreadable characters?

https://stackoverflow.com/questions/23538246

17-07-2023
|

Vra

I have a list of words. I look up each of these words in WordNet and select the first synset. This first synset displays correctly on my terminal (for example : Synset('prior.n.01')). Then, I try to use a replacement regex on that converted string. The expected output is 'prior.n.01'. But what I get is those square boxes with numbers in them. Since my terminal can display the string before it goes through the replacement, I'm guessing the problem doesn't come from that. So, is there something wrong with this regex? Is it because I'm using it on a string which was originally a list element?

Here's the code I'm using:

import re
import nltk
from nltk.corpus import wordnet as wn

word_list = ['prior','indication','link','linked','administered','foobar']

for word in word_list:
    synset_list = wn.synsets(word)  #returns a list of all synsets for a word

    if synset_list == []:   #break if word in list isn't in dictionary (empty list)
        break

    else:
        first_synset = str(synset_list[0])  #returns Synset('prior.n.01') as string
        print first_synset

        clean_synset = re.sub(r'Synset\((.+)\)',r'\1',first_synset) #expected output: 'prior.n.01'
        print clean_synset

Oplossing

There is actually a Synset.name() function to extract the synset name:

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('dog')[0].name()
u'dog.n.01'

Also there's a Synset.unicode_repr() which is useful to avoid any encoding/bytecode problems. Going back to the regex:

>>> x = wn.synsets('dog')[0].unicode_repr()
>>> re.sub(r'Synset\((.+)\)','\1',x)
u'\x01'
>>> re.sub(r'Synset\((.+)\)','1',x)
u'1'
>>> re.sub(r'Synset\((.+)\)','\\1',x)
u"'dog.n.01'"
>>> re.sub(r"Synset\(\'(.+)\'\)",'\\1',x)
u'dog.n.01'

Gelisensieer onder: CC-BY-SA met toeskrywing

Nie verbonde aan StackOverflow