Question

Problem

We are recieving strings and they may either represent a company name or a person's name. We need a heuristic to determine this.

Initial thoughts

  • Use an XML doc with either node Commercial String /Commercial or Personal String /Personal and score matching strings +1 (sorry dont know how to format XML in SO)

  • Cant just check for proper nouns. I.E. Bob's Company is a company where Bob Compton is a name

  • Need to return confidence level in some format. I can't think of how to do it as a percentage, all I can think to do is if it finds a match use an integer

  • Possible Commercial (all will be converted to lower case): co, co., inc, inc., etc (verbose versions of each)

  • I can get a English Name list from online

Question

Has anyone ran into this kind of domain problem before? What methods did you use? Any flashy way of solving this?

Thank You.

Was it helpful?

Solution

I haven't done this before, but some other thoughts:

Check for non-proper nouns (e.g. "and", "the", "piping"). In fact, if you have an English dictionary and a names list, any word that is not a name could be a good pointer to a company name.

A big problem is that some companies are just named after a person(s). "Fred Meyer", "J.C. Penney", and "Lockheed Martin" are examples of companies that look just like human names. There's likely no really good way around this (probably nothing easy anyway). If you can categorize first and last names, a double last name or last name only might be a good reason to lower the certainty.

I would agree with your integer idea. Unless you can do some very broad and very thorough testing, your percentages would probably be meaningless. I would probably run all the tests (returning name, company, or unknown) and compare the results, adding up an integer based on consistency in results.

OTHER TIPS

Can you compare to a database of known company names?

E.g. in the UK: http://wck2.companieshouse.gov.uk

Of course, this doesn't help if it's actually someone's name, but there's a company with the same name.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top