Check if a string is a possible abbrevation for a name
-
27-10-2019 - |
문제
I'm trying to develop a python algorithm to check if a string could be an abbrevation for another word. For example
fck
is a match forfc kopenhavn
because it matches the first characters of the word.fhk
would not match.fco
should not matchfc kopenhavn
because no one irl would abbrevate FC Kopenhavn as FCO.irl
is a match forin real life
.ifk
is a match forifk goteborg
.aik
is a match forallmanna idrottskluben
.aid
is a match forallmanna idrottsklubben
. This is not a real team name abbrevation, but I guess it is hard to exclude it unless you apply domain specific knowledge on how Swedish abbrevations are formed.manu
is a match formanchester united
.
It is hard to describe the exact rules of the algorithm, but I hope my examples show what I'm after.
Update I made a mistake in showing the strings with the matching letters uppercased. In the real scenario, all letters are lowercase so it is not as easy as just checking which letters are uppercased.
해결책
This passes all the tests, including a few extra I created. It uses recursion. Here are the rules that I used:
- The first letter of the abbreviation must match the first letter of the text
The rest of the abbreviation (the abbrev minus the first letter) must be an abbreviation for:
- the remaining words, or
- the remaining text starting from any position in the first word.
tests=(
('fck','fc kopenhavn',True),
('fco','fc kopenhavn',False),
('irl','in real life',True),
('irnl','in real life',False),
('ifk','ifk gotebork',True),
('ifko','ifk gotebork',False),
('aik','allmanna idrottskluben',True),
('aid','allmanna idrottskluben',True),
('manu','manchester united',True),
('fz','faz zoo',True),
('fzz','faz zoo',True),
('fzzz','faz zoo',False),
)
def is_abbrev(abbrev, text):
abbrev=abbrev.lower()
text=text.lower()
words=text.split()
if not abbrev:
return True
if abbrev and not text:
return False
if abbrev[0]!=text[0]:
return False
else:
return (is_abbrev(abbrev[1:],' '.join(words[1:])) or
any(is_abbrev(abbrev[1:],text[i+1:])
for i in range(len(words[0]))))
for abbrev,text,answer in tests:
result=is_abbrev(abbrev,text)
print(abbrev,text,result,answer)
assert result==answer
다른 팁
Here's a way to accomplish what you seem to want to do
import re
def is_abbrev(abbrev, text):
pattern = ".*".join(abbrev.lower())
return re.match("^" + pattern, text.lower()) is not None
The caret makes sure that the first character of the abbreviation matches the first character of the word, it should be true for most abbreviations.
Edit:
Your new update changed the rules a bit. By using "(|.*\s)"
instead of ".*"
the characters in the abbreviation will only match if they are next to each other, or if the next character appears at the start of a new word.
This will correctly match fck
with FC Kopenhavn
, but fco
will not.
However, matching aik
with allmanna idrottskluben
will not work, as that requires knowledge of the swedish language and is not as trivial to do.
Here's the new code with the minor modification
import re
def is_abbrev(abbrev, text):
pattern = "(|.*\s)".join(abbrev.lower())
return re.match("^" + pattern, text.lower()) is not None
@Ocaso Protal
said in comment how should you decide that aik is valid, but aid is not valid?
and he is right.
The algo which came in my mind is to work with word threshold
(number of words separated by space).
words = string.strip().split()
if len(words) > 2:
#take first letter of every word
elif len(words) == 2:
#take two letters from first word and one letter from other
else:
#we have single word, take first three letter or as you like
you have to define your logic, you can't find abbreviation blindly.
Your algorithm seems simple - the abbreviation is the Concatenation of all upper case letters. so:
upper_case_letters = "QWERTYUIOPASDFGHJKLZXCVBNM"
abbrevation = ""
for letter in word_i_want_to_check:
if letter in letters:
abbrevation += letter
for abb in _list_of_abbrevations:
if abb=abbrevation:
great_success()
This might be good enough.
def is_abbrevation(abbrevation, word):
lowword = word.lower()
lowabbr = abbrevation.lower()
for c in lowabbr:
if c not in lowword:
return False
return True
print is_abbrevation('fck', 'FC Kopenhavn')