Question

I have an array of correctly formatted phone numbers:

string[] phoneNumbers = {"US +1 866 XXX XXXX",
                         "UK +44 (0)XXX XXX XXXX",
                         "Singapore +65 XXXX XXXX"
                        };

The phone numbers that I am getting as input corresponds to one of these items in the list, however it is formatted slightly different. The inputs can be one of these 3. Note, the country names at the beginning are NOT included.

  • (866) XXX-XXXX
  • +44 (0) XXX XXXXXX
  • +65 XXXXXXXX

As you can see, my input is slightly different formatting than the array.

My question is, what is a good way to pull the correct formatted version of the number out of the array when I have an input that is formatted differently.

I am not requesting someone to do this for me, as I can do the code fine. The logic is getting me for some reason right now.

What I have thought about doing, is a parallel phone numbers array with all of the incorrectly formated inputs and get the index of the item in that array and get the corresponding input of the correct array. Does this seem logical? Is there a better, faster way?


EDIT:

Currently I am getting the job done with this:

                for(int i=0; i<phoneNumbers.Count(); i++)
                {
                    var tempDialInNumber = (from t in input //input from the user
                                            where char.IsDigit(t)
                                            select t).ToArray();
                    string tDialInNumber = new string(tempDialInNumber);

                    var tempDigitPhoneNumber = (from t in phoneNumbers.GetValue(i).ToString()
                                            where char.IsDigit(t)
                                            select t).ToArray();

                    string tDigitPhoneNumber = new string(tempDigitPhoneNumber);

                    if (tDigitPhoneNumber.Contains(tDialInNumber))
                    {
                        dialInNumber = phoneNumbers.GetValue(i).ToString(); 
                    }

                }
Was it helpful?

Solution

The canonical way to do that is:

  1. Transform your data into canonical form.
  2. Do a dumb comparison of the canonical forms.

OTHER TIPS

I would try to use this http://en.wikipedia.org/wiki/Levenshtein_distance 1st.

Depending on the error rate I would tune up the algorithm by pre-classifying the strings in groups( you can use regexps to generate classes of strings) and the compare inside classes with Levenshtein.

Another way would be to create a Bloom filter based on string patterns and then use it to match against the strings you want. I am not sure though if it will work better on your case.

It seems like if you ignore '+' and parentheses and a leading 1 and spaces , then you will get a match for the leading 2 or 3 digits into the set of country codes. So you can just remove '+' and parentheses and spaces and leading '1'1 and see what leading country code the leading digits match, and then check that the number of trailing digits matches what you expect for that country (otherwise the matching country is 'unknown'). Note that if the country code starts with '1' then there are two possible matches for the leading codes of the country. Also, if the digit count matches US digit count and there is no match for a country then it is a U.S. number. Then once you know the country, you can put the digits of the phone number into a standard template for that country, and put the name of the country at the front if you want and you are done.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top