Check if a string is a possible abbrevation for a name

https://stackoverflow.com/questions/7331462

27-10-2019
|

문제

I'm trying to develop a python algorithm to check if a string could be an abbrevation for another word. For example

fck is a match for fc kopenhavn because it matches the first characters of the word. fhk would not match.
fco should not match fc kopenhavn because no one irl would abbrevate FC Kopenhavn as FCO.
irl is a match for in real life.
ifk is a match for ifk goteborg.
aik is a match for allmanna idrottskluben.
aid is a match for allmanna idrottsklubben. This is not a real team name abbrevation, but I guess it is hard to exclude it unless you apply domain specific knowledge on how Swedish abbrevations are formed.
manu is a match for manchester united.

It is hard to describe the exact rules of the algorithm, but I hope my examples show what I'm after.

Update I made a mistake in showing the strings with the matching letters uppercased. In the real scenario, all letters are lowercase so it is not as easy as just checking which letters are uppercased.

해결책

This passes all the tests, including a few extra I created. It uses recursion. Here are the rules that I used:

The first letter of the abbreviation must match the first letter of the text
The rest of the abbreviation (the abbrev minus the first letter) must be an abbreviation for:
- the remaining words, or
- the remaining text starting from any position in the first word.

tests=(
    ('fck','fc kopenhavn',True),
    ('fco','fc kopenhavn',False),
    ('irl','in real life',True),
    ('irnl','in real life',False),    
    ('ifk','ifk gotebork',True),   
    ('ifko','ifk gotebork',False),    
    ('aik','allmanna idrottskluben',True),
    ('aid','allmanna idrottskluben',True),
    ('manu','manchester united',True), 
    ('fz','faz zoo',True), 
    ('fzz','faz zoo',True),
    ('fzzz','faz zoo',False),    
    )

def is_abbrev(abbrev, text):
    abbrev=abbrev.lower()
    text=text.lower()
    words=text.split()
    if not abbrev:
        return True
    if abbrev and not text:
        return False
    if abbrev[0]!=text[0]:
        return False
    else:
        return (is_abbrev(abbrev[1:],' '.join(words[1:])) or
                any(is_abbrev(abbrev[1:],text[i+1:])
                    for i in range(len(words[0]))))

for abbrev,text,answer in tests:
    result=is_abbrev(abbrev,text)
    print(abbrev,text,result,answer)
    assert result==answer

다른 팁

Here's a way to accomplish what you seem to want to do

import re    
def is_abbrev(abbrev, text):
    pattern = ".*".join(abbrev.lower())
    return re.match("^" + pattern, text.lower()) is not None

The caret makes sure that the first character of the abbreviation matches the first character of the word, it should be true for most abbreviations.

Edit: Your new update changed the rules a bit. By using "(|.*\s)" instead of ".*" the characters in the abbreviation will only match if they are next to each other, or if the next character appears at the start of a new word.

This will correctly match fck with FC Kopenhavn, but fco will not. However, matching aik with allmanna idrottskluben will not work, as that requires knowledge of the swedish language and is not as trivial to do.

Here's the new code with the minor modification

import re    
def is_abbrev(abbrev, text):
    pattern = "(|.*\s)".join(abbrev.lower())
    return re.match("^" + pattern, text.lower()) is not None

@Ocaso Protal said in comment how should you decide that aik is valid, but aid is not valid? and he is right.

The algo which came in my mind is to work with word threshold (number of words separated by space).

words = string.strip().split()
if len(words) > 2:
   #take first letter of every word
elif len(words) == 2:
   #take two letters from first word and one letter from other
else:
   #we have single word, take first three letter or as you like

you have to define your logic, you can't find abbreviation blindly.

Your algorithm seems simple - the abbreviation is the Concatenation of all upper case letters. so:

upper_case_letters = "QWERTYUIOPASDFGHJKLZXCVBNM"
abbrevation = ""
for letter in word_i_want_to_check:
    if letter in letters:
        abbrevation += letter
for abb in _list_of_abbrevations:
    if abb=abbrevation:
        great_success()

This might be good enough.

def is_abbrevation(abbrevation, word):
    lowword = word.lower()
    lowabbr = abbrevation.lower()

    for c in lowabbr:
        if c not in lowword:
            return False

    return True

print is_abbrevation('fck', 'FC Kopenhavn')

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow