find duplicate addresses in database, stop users entering them early?

https://stackoverflow.com/questions/37568

09-06-2019
|

Question

How do I find duplicate addresses in a database, or better stop people already when filling in the form ? I guess the earlier the better?

Is there any good way of abstracting street, postal code etc so that typos and simple attempts to get 2 registrations can be detected? like:

Quellenstrasse 66/11 
Quellenstr. 66a-11

I'm talking German addresses... Thanks!

Solution

Johannes:

@PConroy: This was my initial thougt also. the interesting part on this is to find good transformation rules for the different parts of the address! Any good suggestions?

When we were working on this type of project before, our approach was to take our existing corpus of addresses (150k or so), then apply the most common transformations for our domain (Ireland, so "Dr"->"Drive", "Rd"->"Road", etc). I'm afraid there was no comprehensive online resource for such things at the time, so we ended up basically coming up with a list ourselves, checking things like the phone book (pressed for space there, addresses are abbreviated in all manner of ways!). As I mentioned earlier, you'd be amazed how many "duplicates" you'll detect with the addition of only a few common rules!

I've recently stumbled across a page with a fairly comprehensive list of address abbreviations, although it's american english, so I'm not sure how useful it'd be in Germany! A quick google turned up a couple of sites, but they seemed like spammy newsletter sign-up traps. Although that was me googling in english, so you may have more look with "german address abbreviations" in german :)

OTHER TIPS

You could use the Google GeoCode API

Wich in fact gives results for both of your examples, just tried it. That way you get structured results that you can save in your database. If the lookup fails, ask the user to write the address in another way.

The earlier you can stop people, the easier it'll be in the long run!

Not being too familiar with your db schema or data entry form, I'd suggest a route something like the following:

have distinct fields in your db for each address "part", e.g. street, city, postal code, Länder, etc.
have your data entry form broken down similarly, e.g. street, city, etc

The reasoning behind the above is that each part will likely have it's own particular "rules" for checking slightly-changed addressed, ("Quellenstrasse"->"Quellenstr.", "66/11"->"66a-11" above) so your validation code can check if the values as presented for each field exist in their respective db field. If not, you can have a class that applies the transformation rules for each given field (e.g. "strasse" stemmed to "str") and checks again for duplicates.

Obviously the above method has it's drawbacks:

it can be slow, depending on your data set, leaving the user waiting
users may try to get around it by putting address "Parts" in the wrong fields (appending post code to city, etc). but from experience we've found that introducing even simple checking like the above will prevent a large percentage of users from entering pre-existing addresses.

Once you've the basic checking in place, you can look at optimising the db accesses required, refining the rules, etc to meet your particular schema. You might also take a look at MySQL's match() function for working out similar text.

Before you start searching for duplicate addresses in your database, you should first make sure you store the addresses in a standard format.

Most countries have a standard way of formatting addresses, in the US it's the USPS CASS system: http://www.usps.com/ncsc/addressservices/certprograms/cass.htm

But most other countries have a similar service/standard. Try this site for more international formats: http://bitboost.com/ref/international-address-formats.html

This not only helps in finding duplicates, but also saves you money when mailing you customers (the postal service charges less if the address is in a standard format).

Depending on your application, in some cases you might want to store a "vanity" address record as well as the standard address record. This keeps your VIP customers happy. A "vanity" address might be something like:

62 West Ninety First Street
Apartment 4D
Manhattan, New York, NY 10001

While the standard address might look like this:

62 W 91ST ST APT 4D
NEW YORK NY 10024-1414

One thing you might want to look at are Soundex searches, which are quite useful for misspellings and contractions.

This however is not an in-database validation so it may or may not be what you're looking for.

Another possible solution (assuming you actually need reliable address data and you're not just using addresses as a way to prevent duplicate accounts) is to use a third-party web service to standardize the addresses provided by your users.

It works this way -- your system accepts a user's address via an online form. Your form hands off the user's address to the third-party address standardization web service. The web service gives you back the same address but now with the data standardized into discrete address fields, and with the standard abbreviations and formats applied. Your application displays this standardized address to your user for their confirmation before attempting to save the data in your DB.

If all the user addresses go through a standardization step and only standardized addresses are saved to your DB, then finding duplicate records should be greatly simplified since you are now comparing apples to apples.

One such third-party service is Global Address's Interactive Service which includes Germany in the list of supported countries, and also has an online demo that demonstrates how their service works (demo link can be found on that web page).

There's a cost disadvantage to this approach, obviously. However, on the plus side:

you would not need to create and maintain your own address standardization metadata
you won't need to continuously enhance your address standardization routines, and
you're free to focus your software development energy on the parts of the application that are unique to your requirements

Disclaimer: I don't work for Global Address and have not tried using their service. I'm merely mentioning them as an example since they have an online demo that you can actually play with.

To add an answer to my own question:

A different way of doing it is ask users for their mobile phone number, send them a text msg for verification. This stops most people messing with duplicate addresses.

I'm talking from personal experience. (thanks pigsback !) They introduced confirmation through mobile phone. That stopped me having 2 accounts! :-)

I realize that the original post is specif to German addresses, but this is a good questions for addresses in general.

In the United States, there is a part of an address called a delivery point barcode. It's a unique 12-digit number that identifies a single point of delivery and can serve as the unique identifier of an address. To get this value you'll want to use an address verification or address standardization web service API, which can cost about $20/mo depending upon the volume of requests you make to it.

In the interest of full disclosure, I'm the founder of SmartyStreets. We offer just such an address validation web service API called LiveAddress. You're more than welcome to contact me personally with any questions you have.

Machine learning and AI has algorithms to find string similarities and duplicate measures.

Record linkage or the task of matching equivalent records that differ syntactically—was first explored in the late 1950s and 1960s.

You can represent every pair of records using a vector of features that describe the similarity between individual record fields.

For example, Adaptive Duplicate Detection Using Learnable String Similarity Measures. for example, read this doc

You can use generic or manually tuned distance metrics for estimating the similarity of potential duplicates.
You can use adaptive name matching algorithms, like Jaro metric, which is based on the number and order of common characters between two strings.
Token-based and hybrid distance. In such cases, we can convert the strings s and t to token multisets (where each token is a word) and consider similarity metrics on these multisets.

Often you use constraints in a database to ensure data to be "unique" in the data-based sense.

Regarding "isomorphisms" I think you are on your own, ie writing the code your self. If in the database you could use a trigger.

I'm looking for an answer addressing United States addresses

The issue in question is prevent users from entering duplicates like

Quellenstrasse 66/11 and Quellenstr. 66a-11

This happens when you let your user enter the complete address in input box.

There are some methods you can use to prevent this.

1. Uniform formatting using RegEx

You can prompt users to enter the details in a uniform format.
That is very efficient while querying too
test the user entered value against some regular expressions and if failed, ask user to correct it.

2.Use a map api like google maps and ask the user to select details from it.

If you choose google maps, you can achieve it using Reverse Geocoding.

From Google Developer's guide,

The term geocoding generally refers to translating a human-readable address into a location on a map. The process of doing the opposite, translating a location on the map into a human-readable address, is known as reverse geocoding.

3. Allow heterogeneous data as shown in the question and compare it with different formatting.

In the question, the OP allow address in different format.
In such case, you can change it to different forms and check it with database to get a solution.
This may take more time and the time is completely depends on the number of test cases.

4. Split the address into different parts and store it in db and provide such a form to user.

That is provide different fields to store Street, city, state etc in database.
Also provide the different input fields to user to enter street, city, state, etc in top down format.
When user enter state, narrow the query to find dupes to that state only.
When user enter city, narrow it to that city only.
When user enter the street, narrow it to that street.

And finally

When user enter the address, change it to different formats and test it against Data Base.

This is efficient even the number of test cases may high, the number of entries you test against will be very less and so it will consume very less amount of time.

In the USA, you can use USPS Address Standardization Web Tool. It verifies and normalizes addresses for you. This way, you can normalize the address before checking if it already exists in the database. If all the addresses in the database are already normalized, you'll be able to spot duplicates easily.

Sample URL:

https://production.shippingapis.com/ShippingAPI.dll?API=Verify&XML=insert_request_XML_here

Sample request:

<AddressValidateRequest USERID="XXXXX">
  <IncludeOptionalElements>true</IncludeOptionalElements>
  <ReturnCarrierRoute>true</ReturnCarrierRoute>
  <Address ID="0">  
    <FirmName />   
    <Address1 />   
    <Address2>205 bagwell ave</Address2>   
    <City>nutter fort</City>   
    <State>wv</State>   
    <Zip5></Zip5>   
    <Zip4></Zip4> 
  </Address>      
</AddressValidateRequest>

Sample response:

<AddressValidateResponse>
  <Address ID="0">
    <Address2>205 BAGWELL AVE</Address2>
    <City>NUTTER FORT</City>
    <State>WV</State>
    <Zip5>26301</Zip5>
    <Zip4>4322</Zip4>
    <DeliveryPoint>05</DeliveryPoint>
    <CarrierRoute>C025</CarrierRoute>
  </Address>
</AddressValidateResponse>

Other countries might have their own APIs. Other people mentioned 3rd party APIs that support multiple countries that might be useful in some cases.

As google fetch suggesions for search you can search database address fields

First, let’s create an index.htm(l) file:

    <!DOCTYPE html>
    <html lang="en">

    <head>
        <meta http-equiv="Content-Language" content="en-us">
        <title>Address Autocomplete</title>
        <meta charset="utf-8">
        <link href="//maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" rel="stylesheet">
        <script src="//code.jquery.com/jquery-2.1.4.min.js"></script>
        <script src="//maxcdn.bootstrapcdn.com/bootstrap/3.3.5/js/bootstrap.min.js"></script>
        <script src="//netsh.pp.ua/upwork-demo/1/js/typeahead.js"></script>
        <style>
            h1 {
                font-size: 20px;
                color: #111;
            }

            .content {
                width: 80%;
                margin: 0 auto;
                margin-top: 50px;
            }

            .tt-hint,
            .city {
                border: 2px solid #CCCCCC;
                border-radius: 8px 8px 8px 8px;
                font-size: 24px;
                height: 45px;
                line-height: 30px;
                outline: medium none;
                padding: 8px 12px;
                width: 400px;
            }

            .tt-dropdown-menu {
                width: 400px;
                margin-top: 5px;
                padding: 8px 12px;
                background-color: #fff;
                border: 1px solid #ccc;
                border: 1px solid rgba(0, 0, 0, 0.2);
                border-radius: 8px 8px 8px 8px;
                font-size: 18px;
                color: #111;
                background-color: #F1F1F1;
            }
        </style>
        <script>
            $(document).ready(function() {

                $('input.city').typeahead({
                    name: 'city',
                    remote: 'city.php?query=%QUERY'

                });

            })
        </script>

    <script>
            function register_address()
            {
                $.ajax({
                    type: "POST",
                    data: {
                        City: $('#city').val(),
                    },
                    url: "addressexists.php",
                    success: function(data)
                    {
                        if(data === 'ADDRESS_EXISTS')
                        {
                            $('#address')
                                .css('color', 'red')
                                .html("This address already exists!");
                        }

                    }
                })              
            }
        </script>
    </head>

    <body>
        <div class="content">

            <form>
                <h1>Try it yourself</h1>
                <input type="text" name="city" size="30" id="city" class="city" placeholder="Please Enter City or ZIP code">
<span id="address"></span>
            </form>
        </div>
    </body>
</html>

Now we will create a city.php file which will aggregate our query to MySQL DB and give response as JSON. Here is the code:

<?php

//CREDENTIALS FOR DB
define ('DBSERVER', 'localhost');
define ('DBUSER', 'user');
define ('DBPASS','password');
define ('DBNAME','dbname');

//LET'S INITIATE CONNECT TO DB
$connection = mysqli_connect(DBSERVER, DBUSER, DBPASS,"DBNAME") or die("Can't connect to server. Please check credentials and try again");


//CREATE QUERY TO DB AND PUT RECEIVED DATA INTO ASSOCIATIVE ARRAY
if (isset($_REQUEST['query'])) {
    $query = $_REQUEST['query'];
    $sql = mysqli_query ($connection ,"SELECT zip, city FROM zips WHERE city LIKE '%{$query}%' OR zip LIKE '%{$query}%'");
    $array = array();
    while ($row = mysqli_fetch_array($sql,MYSQLI_NUM)) {
        $array[] = array (
            'label' => $row['city'].', '.$row['zip'],
            'value' => $row['city'],
        );
    }
    //RETURN JSON ARRAY
    echo json_encode ($array);
}

?>

and then prevent saving them into database if found duplicate in table column

And for your addressexists.php code:

<?php//CREDENTIALS FOR DB
    define ('DBSERVER', 'localhost');
    define ('DBUSER', 'user');
    define ('DBPASS','password');
    define ('DBNAME','dbname');

    //LET'S INITIATE CONNECT TO DB
    $connection = mysqli_connect(DBSERVER, DBUSER, DBPASS,"DBNAME") or die("Can't connect to server. Please check credentials and try again");


    $city= mysqli_real_escape_string($_POST['city']); // $_POST is an array (not a function)
    // mysqli_real_escape_string is to prevent sql injection

    $sql = "SELECT username FROM ".TABLENAME." WHERE city='".$city."'"; // City must enclosed in two quotations

    $query = mysqli_query($connection,$sql);

    if(mysqli_num_rows($query) != 0)

    {
        echo('ADDRESS_EXISTS');
    }
?>

Match address to addresses provided by DET BundesPost to detect duplicates.

DET probably sells a CD like USA does. The problem then becomes matching to the Bundespost addresses. Just a long process of replacing abbreviations with the post approved abbreviations and such.

Same way in USA. Match to USPostOffice addresses (Sorry these cost money so its not entirely open CDs are available from the US post office) to find duplicates.

This is an old question, but another approach is to calculate the Levenshtein distance to the addresses and this way you can find already existing ones that are very similar. You can see more here. Finding Duplicate Addresses Using the Levenshtein Distance Metric in SQL.

In my opinion, assuming that you already had a lot of dirty data in your DB,

You have to do build your "handmade" dirty filter which may detect a maximum of german abreviation ...

But If you treat a lot of data, you will take the risk to find some false-positive and true-negative sample...

Finally a semi automated job (machine with human assist when probability of a case of false-positive or true-negative is too high) will be the best solution.

More you treat "exception" (because human raise exception when filling data), more your "handmade" filter will fit your requierement.

In the other hand, you may also use a germany address verification service on user side, and store only the verified one...

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow