Question

I have a table which is full of arbitrarily formatted phone numbers, like this

027 123 5644
021 393-5593
(07) 123 456
042123456

I need to search for a phone number in a similarly arbitrary format ( e.g. 07123456 should find the entry (07) 123 456

The way I'd do this in a normal programming language is to strip all the non-digit characters out of the 'needle', then go through each number in the haystack, strip all non-digit characters out of it, then compare against the needle, eg (in ruby)

digits_only = lambda{ |n| n.gsub /[^\d]/, '' }

needle = digits_only[input_phone_number]
haystack.map(&digits_only).include?(needle)

The catch is, I need to do this in MySQL. It has a host of string functions, none of which really seem to do what I want.

Currently I can think of 2 'solutions'

  • Hack together a franken-query of CONCAT and SUBSTR
  • Insert a % between every character of the needle ( so it's like this: %0%7%1%2%3%4%5%6% )

However, neither of these seem like particularly elegant solutions.
Hopefully someone can help or I might be forced to use the %%%%%% solution

Update: This is operating over a relatively fixed set of data, with maybe a few hundred rows. I just didn't want to do something ridiculously bad that future programmers would cry over.

If the dataset grows I'll take the 'phoneStripped' approach. Thanks for all the feedback!


could you use a "replace" function to strip out any instances of "(", "-" and " ",

I'm not concerned about the result being numeric. The main characters I need to consider are +, -, (, ) and space So would that solution look like this?

SELECT * FROM people 
WHERE 
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(phonenumber, '('),')'),'-'),' '),'+')
LIKE '123456'

Wouldn't that be terribly slow?

Was it helpful?

Solution

This looks like a problem from the start. Any kind of searching you do will require a table scan and we all know that's bad.

How about adding a column with a hash of the current phone numbers after stripping out all formatting characters. Then you can at least index the hash values and avoid a full blown table scan.

Or is the amount of data small and not expected to grow much? Then maybe just sucking all the numbers into the client and running a search there.

OTHER TIPS

I know this is ancient history, but I found it while looking for a similar solution.

A simple REGEXP may work:

select * from phone_table where phone1 REGEXP "07[^0-9]*123[^0-9]*456"

This would match the phonenumber column with or without any separating characters.

An out-of-the-box idea, but could you use a "replace" function to strip out any instances of "(", "-" and " ", and then use an "isnumeric" function to test whether the resulting string is a number?

Then you could do the same to the phone number string you're searching for and compare them as integers.

Of course, this won't work for numbers like 1800-MATT-ROCKS. :)

My solution would be something along the lines of what John Dyer said. I'd add a second column (e.g. phoneStripped) that gets stripped on insert and update. Index this column and search on it (after stripping your search term, of course).

You could also add a trigger to automatically update the column, although I've not worked with triggers. But like you said, it's really difficult to write the MySQL code to strip the strings, so it's probably easier to just do it in your client code.

(I know this is late, but I just started looking around here :)

i suggest to use php functions, and not mysql patterns, so you will have some code like this:

$tmp_phone = '';
for ($i=0; $i < strlen($phone); $i++)
   if (is_numeric($phone[$i]))
       $tmp_phone .= '%'.$phone[$i];
$tmp_phone .= '%';
$search_condition .= " and phone LIKE '" . $tmp_phone . "' ";

This is a problem with MySQL - the regex function can match, but it can't replace. See this post for a possible solution.

Is it possible to run a query to reformat the data to match a desired format and then just run a simple query? That way even if the initial reformatting is slow you it doesn't really matter.

See

http://www.mfs-erp.org/community/blog/find-phone-number-in-database-format-independent

It is not really an issue that the regular expression would become visually appalling, since only mysql "sees" it. Note that instead of '+' (cfr. post with [\D] from the OP) you should use '*' in the regular expression.

Some users are concerned about performance (non-indexed search), but in a table with 100000 customers, this query, when issued from a user interface returns immediately, without noticeable delay.

MySQL can search based on regular expressions.

Sure, but given the arbitrary formatting, if my haystack contained "(027) 123 456" (bear in mind position of spaces can change, it could just as easily be 027 12 3456 and I wanted to match it with 027123456, would my regex therefore need to be this?

"^[\D]+0[\D]+2[\D]+7[\D]+1[\D]+2[\D]+3[\D]+4[\D]+5[\D]+6$"

(actually it'd be worse as the mysql manual doesn't seem to indicate it supports \D)

If that is the case, isn't it more or less the same as my %%%%% idea?

Just an idea, but couldn't you use Regex to quickly strip out the characters and then compare against that like @Matt Hamilton suggested?

Maybe even set up a view (not sure of mysql on views) that would hold all phone numbers stripped by regex to a plain phone number?

Woe is me. I ended up doing this:

mre = mobile_number && ('%' + mobile_number.gsub(/\D/, '').scan(/./m).join('%'))

find(:first, :conditions => ['trim(mobile_phone) like ?', mre])

if this is something that is going to happen on a regular basis perhaps modifying the data to be all one format and then setup the search form to strip out any non-alphanumeric (if you allow numbers like 310-BELL) would be a good idea. Having data in an easily searched format is half the battle.

a possible solution can be found at http: //udf-regexp.php-baustelle.de/trac/

additional package need to be installed, then you can play with REGEXP_REPLACE

Create a user defined function to dynamically creates Regex.

DELIMITER //

CREATE FUNCTION udfn_GetPhoneRegex
(   
    var_Input VARCHAR(25)
)
RETURNS VARCHAR(200)

BEGIN
    DECLARE iterator INT          DEFAULT 1;
    DECLARE phoneregex VARCHAR(200)          DEFAULT '';

    DECLARE output   VARCHAR(25) DEFAULT '';


   WHILE iterator < (LENGTH(var_Input) + 1) DO
      IF SUBSTRING(var_Input, iterator, 1) IN ( '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' ) THEN
         SET output = CONCAT(output, SUBSTRING(var_Input, iterator, 1));
      END IF;
      SET iterator = iterator + 1;
   END WHILE;
    SET output = RIGHT(output,10);
    SET iterator = 1;
    WHILE iterator < (LENGTH(output) + 1) DO
         SET phoneregex = CONCAT(phoneregex,'[^0-9]*',SUBSTRING(output, iterator, 1));
         SET iterator = iterator + 1;
    END WHILE;
    SET phoneregex = CONCAT(phoneregex,'$');
   RETURN phoneregex;
END//
DELIMITER ;

Call that User Defined Function in your stored procedure.

DECLARE var_PhoneNumberRegex        VARCHAR(200);
SET var_PhoneNumberRegex = udfn_GetPhoneRegex('+ 123 555 7890');
SELECT * FROM Customer WHERE phonenumber REGEXP var_PhoneNumberRegex;

I would use Google's libPhoneNumber to format a number to E164 format. I would add a second column called "e164_number" to store the e164 formatted number and add an index on it.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top