Frage

I have data about soccer teams from three different sources. However, the 'team name' for the same team from each of these sources differ in style. For e.g.

[Source1]             [Source2]  [Source3]
Arsenal               ARS        Arsenal
Manchester United     MNU        ManUtd
West Bromwich Albion  WBA        WestBrom

Now very often I have to compare these team names (from different or the same sources) to check they're the same or different team. For e.g.

Arsenal == ARS  : True
MNU == WBA      : False
WBA == WestBrom : True

I wanted to know if there is a neat pythonic way of achieving this.

My idea is the following: Create a class Team which has a list of tuples containing tuples with the 3 matching team names together. Instantiate a. object of Team for each of the team names. Then override the __eq__ method for the class where I'll do a reduce over the list of tuples to find if the two team names in question belong to the same tuple which would indicate equality.

Some pseudocode:

class Team:
  def __init__(self, teamname):
    self.teams = [(Arsenal, ARS, Arsenal),
                  (Manchester United, MNU, ManUtd),
                  (West Bromwich Albion, WBA, WestBrom),]
    self.teamname = teamname

  def __eq__(self, teamA, teamB):
    reduce(self.teams, <check if teamA and teamB in same tuple>)

Thoughts?

P.S.: Please suggest a better Title for this question as I don't think I've done a good job with the same.

Edit: Expanded my suggested solution

War es hilfreich?

Lösung 2

You could have some kind of equivalence mapping:

equivalents = {"Arsenal": ["ARS",], 
               "Manchester United": ["MNU", "ManUtd"], ...}

And use this to process your data:

>>> name = "ManUtd"
>>> for main, equivs in equivalents.items():
    if name == main or name in equivs:
        name = main
        break

>>> name 
"Manchester United"

This allows you to easily see what you consider to be the "canonical name" for the team (i.e. the key) and other names that are considered to be the same team (i.e. the list value).


If you do go down the class route, you should make the list of team tuples a class attribute:

class Team:

    TEAMS = [("Arsenal", "ARS"), ("Manchester United", "MNU", "ManUtd"), ...]

    def __init__(self, name):
        if not any(name in names for names in self.TEAMS):
            raise ValueError("Not a valid team name.")
        self.name = name

    def __eq__(self, other):
        for names in self.TEAMS:
            if self.name in names and other.name in names:
                return True
        return False

The output from this:

>>> mnu1 = Team("ManUtd")
>>> mnu2 = Team("MNU")
>>> mnu1 == mnu2
True

>>> ars = Team("ARS")
>>> ars == mnu1
False

>>> fail = Team("Not a name")
Traceback (most recent call last):
  File "<pyshell#49>", line 1, in <module>
    fail = Team("Not a name")
  File "<pyshell#43>", line 7, in __init__
    raise ValueError("Not a valid team name.")
ValueError: Not a valid team name.

Alternatively, just a simple function would do the same job if your Team won't have other attributes:

def equivalent(team1, team2):
    teams = [("Arsenal", "ARS"), ("Manchester United", "MNU", "ManUtd"), ...]
    for names in teams:
        if team1 in names and team2 in names:
            return True
    return False

Output from this:

>>> equivalent("MNU", "ManUtd")
True
>>> equivalent("MNU", "Arsenal")
False
>>> equivalent("MNU", "Not a name")
False

Andere Tipps

For simplicity, you can just put everything in a flat canonical lookup:

canonical = {'Arsenal':'ARS',
             'ARS':'ARS',
             'Manchester United':'MNU',
             'MNU':'MNU',
             'ManUtd':'MNU',
             ...}

Then equivalence testing is easy:

if canonical[x] == canonical[y]:
    #they're the same team

There are a lot of good alternative answers here, so broad picture: this approach is good if you never expect your canonical lookup to change. You can generate it once then forget about it. If it does frequently change, this is going to be miserable to maintain, so you should look elsewhere.

The code of roippi can be made better maintainable if you define a function which inverts the dictionary:

def invertdict(d):
  id=dict()
  for (key,value) in d.items():
    for part in value:
      if part in id:
        id[part]=id[part]+(key,)
      else:
          id[part]=(key,)
  return id

If you do it this way, the values of canonical have to be defined as tuples:

canonical = {'Arsenal':('ARS',),
             'ARS':('ARS',),
             'Manchester United':('MNU',),
             'MNU':('MNU',),
             'ManUtd':('MNU',)}

but then you can simply:

print invertdict(canonical)
{'ARS': ('ARS', 'Arsenal'), 'MNU': ('ManUtd', 'Manchester United', 'MNU')}
print invertdict(invertdict(canonical))
{'MNU': ('MNU',), 'Manchester United': ('MNU',), 'ARS': ('ARS',), 'Arsenal': ('ARS',), 'ManUtd': ('MNU',)}
# this is canonical again

Then you maybe want to define the inverted canonical in the beginning and use invertdict to get canonical and be able to compare your teams

hope it helps

What I would do:

class Team:
    def __init__(self, name, all_names):
        self.name = name  # use name as it's "proper" name
        self.all_names = all_names # use a list of all acceptable names and abbreviaitons

man = Team('Manchester United',['Manchester United', 'MNU', 'ManUtd'])

You could then use if 'MNU' in man.all_names

I think the best way to do it is close to what you have, using a list of tuples of all the correlated names.

def __eq__(self, teamA, teamB):
    for names in self.teams:
        if teamA in names:  break

    if (teamA and teamB) in names: #Must include teamA in this comparison to avoid false positive from last entry of self.teams containing teamB but not teamA
         return True
    else:
         return False

This has the advantage over using a dict or list of abbreviations because it doesn't matter which name version is used as the "key"


You could attempt to automate the matching with something like this:

def __eq__(self, teamA, teamB):
    if len(teamA) > len(teamB):
        return all([l in teamA.lower() for l in teamB.lower()])
    elif len(teamA) < len(teamB):
        return all([l in teamB.lower() for l in teamA.lower()])
    else:
        return teamA.lower() == teamB.lower()

Note that this method won't be perfect since it requires all the letters of the abbreviation to be in the full version (wwhich may not always be the case). You could build a more sophisticated matching scheme than what I have here which will get more reliable results

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top