Question

The closest existing question I have found is this or this

I would like to write a function or class that accepts a string and then based on whatever criteria can be programmed into it will return the probability that it is a real human name. At the moment I would expect it to be heavily biased toward English or European names or English transliterations of other names. (for example, "bob", "bob smith", and "smith" should all return 1.0 and "sfgoisxdzzg" should return something like .001 or even .0000001)

Does anyone know if this is already done / being done? (even if in another language) My first thought was that I'd have to do some sort of machine learning script. My problem with that is my complete ignorance of any machine learning theory.

So, the second part of my question is this: Is machine learning a viable option for tackling this problem? If so, what resources should I start with to learn how to do it? IF not, can you point me in the right direction?

Was it helpful?

Solution

This Bayesian approach that I use for filtering with quite a bit of success on a contact submission and a request for quote forms. The form is using scoring and handles requests from all over the world in various languages. If they fail 3 or 4 tests on various fields only then do I mark them as a Spam attempt. Obviously things like '123456' throw up a red flag instantly for a phone number. Also BBCode in the comments is a dead giveaway.

<?php
function nameCheck($var) {
        $nameScore = 0;
        //If name < 4 score + '3'
        $chars_count = strlen($var);
        $consonants = preg_replace('![^BCDFGHJKLMNPQRSTVWXZ]!i','',$var);
        $consonant_count = strlen($consonants);
        $vowels = preg_replace('![^AEIOUY]!i','',$var);
        $vowel_count = strlen($vowels);
        //We're expecting first and last name.
        if ($chars_count < 4){
            $nameScore = $nameScore + 3;    
        }

        //if name > 4 and no spaces score + '4'
        if (($chars_count > 4)&& (!preg_match('![ ]!',$var))){
            $nameScore = $nameScore + 4;    
        }

        if (($chars_count > 4)&&(($consonant_count==0)||($vowel_count==0))){
            $nameScore = $nameScore + 5;            
        }

        //if name > 4 and vowel to consonant ratio < 1/8 score + '5'
        if (($consonant_count > 0) && ($vowel_count > 0) && ($chars_count > 4) && ($vowel_count/$consonant_count < 1/8)){
            $nameScore = $nameScore + 5;    
        }
        //Needs at least 1 letter.
        if (!preg_match('![A-Za-z]!',$var)){
            $nameScore = $nameScore + 10;           
        }

        return $nameScore;
    }

//added for testing
$var = $_GET['email'];
echo nameCheck($var);
?>

Even if someone flushes I have it copy me on the attempt so I can fix my scoring. There are a few false-positives usually in Chinese or Korean, but for the most part anyone who completes the form in English will pass. Names like "Wu Xi" do exist.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top