Programmatically extract keywords from domain names

https://stackoverflow.com/questions/1315373

19-09-2019
|

Question

Let's say I have a list of domain names that I would like to analyze. Unless the domain name is hyphenated, I don't see a particularly easy way to "extract" the keywords used in the domain. Yet I see it done on sites such as DomainTools.com, Estibot.com, etc. For example:

ilikecheese.com becomes "i like cheese"
sanfranciscohotels.com becomes "san francisco hotels"
...

Any suggestions for accomplishing this efficiently and effectively?

Edit: I'd like to write this in PHP.

Solution

Ok, I ran the script I wrote for this SO question, with a couple of minor changes -- using log probabilities to avoid underflow, and modifying it to read multiple files as the corpus.

For my corpus I downloaded a bunch of files from project Gutenberg -- no real method to this, just grabbed all english-language files from etext00, etext01, and etext02.

Below are the results, I saved the top three for each combination.

expertsexchange: 97 possibilities
 -  experts exchange -23.71
 -  expert sex change -31.46
 -  experts ex change -33.86

penisland: 11 possibilities
 -  pen island -20.54
 -  penis land -22.64
 -  pen is land -25.06

choosespain: 28 possibilities
 -  choose spain -21.17
 -  chooses pain -23.06
 -  choose spa in -29.41

kidsexpress: 15 possibilities
 -  kids express -23.56
 -  kid sex press -32.65
 -  kids ex press -34.98

childrenswear: 34 possibilities
 -  children swear -19.85
 -  childrens wear -25.26
 -  child ren swear -32.70

dicksonweb: 8 possibilities
 -  dickson web -27.09
 -  dick son web -30.51
 -  dicks on web -33.63

OTHER TIPS

Might want to check out this SO question.

You need to develop a heuristic that will get likely matches out of the domain. The way I would do it is first find a large corpus of text. For example, you could download Wikipedia.

Next take your corpus, and combine every two adjacent words. For example, if your sentence is:

quick brown fox jumps over the lazy dog

You'll create a list:

quickbrown
brownfox
foxjumps
jumpsover
overthe
thelazy
lazydog

Each of these would have a count of one. As you parse your corpus, you'll keep track of the frequency pairs of every two words. Additionally, for each pair, you'll need to sort what the original two words were.

Sort this list by frequency, and then attempt to find matches in your domain based on these words.

Lastly, do a domain check for the top two word phrases which aren't registered!

I think the sites like DomainTool take a list of the highest ranking words. They then try to parse these words out first. Depending on the purpose, you may want to consider using MTurk to do the job. Different people will parse the same words differently, and might not do so in proportion to how common the words are.

choosespain.com kidsexpress.com childrenswear.com dicksonweb.com

Have fun (and a good lawyer) if you are going to try to parse the url with a dictionary.

You might do better if you can find the same characters but separated by white space on their web site.

Other possiblities: extract data from ssl certificate; query top level domain name server; Access the domain name server (TLD); or use one of the "whois" tools or services (just google "whois").

If you have a list of valid words, you can loop through your domain string, and try to cut off a valid word each time with a backtracking algorithm. If you managed to use up all words, you are finished. Be aware that the time-complexity of this is not optimal :)

function getwords( $string ) {
    if( strpos($string,"xn--") !== false ) {
        return false;
    }
    $string = trim( str_replace( '-', '', $string ) );
    $pspell = pspell_new( 'en' );
    $check = array();
    $words = array();
    for( $j = 0; $j < ( strlen( $string ) - 5 ); $j++ ) {
        for( $i = 4; $i < strlen( $string ); $i++ ) {
            if( pspell_check( $pspell, substr( $string, $j, $i ) ) ) {
                $check[$j]++;
                $words[] = substr( $string, $j, $i );
            }
        }
    }
    $words = array_unique( $words );
    if( count( $check ) > 0 ) {
        return $words;
    }
    return false;
}

print_r( getwords( 'ilikecheesehotels' ) );

Array
(
    [0] => like
    [1] => cheese
    [2] => hotel
    [3] => hotels
)

as a simple start with pspell. you might want to compare results and see if you got the stemm of a words without the "s" at the end and merge them.

You would have to use a dictionary engine against a domain entry to find valid words and the run that dictionary engine against the result to ensure the result is valid words.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow