Keyword analysis in PHP

https://stackoverflow.com/questions/10721836

10-06-2021
|

Question

For a web application I'm building I need to analyze a website, retrieve and rank it's most important keywords and display those.

Getting all words, their density and displaying those is relatively simple, but this gives very skewed results (e.g. stopwords ranking very high).

Basically, my question is: How can I create a keyword analysis tool in PHP which results in a list correctly ordered by word importance?

Solution

Recently, I've been working on this myself, and I'll try to explain what I did as best as possible.

Steps

Filter text
Split into words
Remove 2 character words and stopwords
Determine word frequency + density
Determine word prominence
Determine word containers
1. Title
2. Meta description
3. URL
4. Headings
5. Meta keywords
Calculate keyword value

1. Filter text

The first thing you need to do is filter make sure the encoding is correct, so convert is to UTF-8:

iconv ($encoding, "utf-8", $file); // where $encoding is the current encoding

After that, you need to strip all html tags, punctuation, symbols and numbers. Look for functions on how to do this on Google!

2. Split into words

$words = mb_split( ' +', $text );

3. Remove 2 character words and stopwords

Any word consisting of either 1 or 2 characters won't be of any significance, so we remove all of them.

To remove stopwords, we first need to detect the language. There are a couple of ways we can do this: - Checking the Content-Language HTTP header - Checking lang="" or xml:lang="" attribute - Checking the Language and Content-Language metadata tags If none of those are set, you can use an external API like the AlchemyAPI.

You will need a list of stopwords per language, which can be easily found on the web. I've been using this one: http://www.ranks.nl/resources/stopwords.html

4. Determine word frequency + density

To count the number of occurrences per word, use this:

$uniqueWords = array_unique ($keywords); // $keywords is the $words array after being filtered as mentioned in step 3
$uniqueWordCounts = array_count_values ( $words );

Now loop through the $uniqueWords array and calculate the density of each word like this:

$density = $frequency / count ($words) * 100;

5. Determine word prominence

The word prominence is defined by the position of the words within the text. For example, the second word in the first sentence is probably more important than the 6th word in the 83th sentence.

To calculate it, add this code within the same loop from the previous step:'

$keys = array_keys ($words, $word); // $word is the word we're currently at in the loop
$positionSum = array_sum ($keys) + count ($keys);
$prominence = (count ($words) - (($positionSum - 1) / count ($keys))) * (100 /   count ($words));

6. Determine word containers

A very important part is to determine where a word resides - in the title, description and more.

First, you need to grab the title, all metadata tags and all headings using something like DOMDocument or PHPQuery (dont try to use regex!) Then you need to check, within the same loop, whether these contain the words.

7. Calculate keyword value

The last step is to calculate a keywords value. To do this, you need to weigh each factor - density, prominence and containers. For example:

$value = (double) ((1 + $density) * ($prominence / 10)) * (1 + (0.5 * count ($containers)));

This calculation is far from perfect, but it should give you decent results.

Conclusion

I haven't mentioned every single detail of what I used in my tool, but I hope it offers a good view into keyword analysis.

^{N.B. Yes, this was inspired by the today's blogpost about answering your own questions!}

OTHER TIPS

One thing which is missing in your algorithm is document-oriented analysis (if you didn't omit it intentionally for some reason).

Every site is built on a document set. Counting word frequencies for all and every document will provide you with information about words coverage. The words which occur in most of documents are stop words. The words specific for a limited number of documents can form a cluster of documents on a specific topic. Number of documents pertaining to a specific topic can increase overall importance of the words of the topic, or at least provide an additional factor to be counted in your formulae.

Perhaps, you could benefit from a preconfigured classificator which contains categories/topics and keywords for each of them (this task can be partially automated by indexing existing public hierarchies of categories, up to Wikipedia, but this is not a trivial task itself). Then you can involve categories into analisys.

Also, you can improve statistics by analysis on sentence-level. That is, having frequencies of how often words occur in the same sentence or phrase, you can discover cliches and duplicates and eliminate them from statistics. But, i'm afraid this is not easily impemented in pure PHP.

This is probably a small contribution, but I'll mention it nonetheless.

Context scoring

To a certain extent you're already looking at the context of a word by using the position in which it's placed. You could add another factor to this by ranking words that appear in a heading (H1, H2, etc.) higher than words inside a paragraph, higher than perhaps words in a bulleted list, etc.

Frequency sanitization

Detecting stop words based on a language might work, but perhaps you could consider using a bell curve to determine which word frequencies / densities are too extravagant (e.g. strip lower 5% and upper 95%). Then apply the scoring on the remaining words. Not only does it prevent stop words, but also key word abuse, at least in theory :)

@ refining 'Steps'

In regards to doing these many steps, i would go with a bit 'enhanced' solution, suturing some of your steps together.

Not sure, whether a full lexer is better though, if you design it perfectly to fit your needs, e.g. look only for text within hX etc. But you would have to mean _serious business since it can be a headache to implement. Though i will put my point out and say that a Flex / Bison solution in another language (PHP offers poor support as it is such a high-level language) would be an 'insane' speed boost.

However, luckily libxml provides magnificent features and as the following should show, you will end up having multiple steps in one. Before the point where you analyse the contents, setup language(stopwords), minify the NodeList set and work from there.

load full page in
detect language
extract only <body> into seperate field
release a tad of memory from <head> and others like, eg. unset($fullpage);
fire your algorithm (if pcntl - linux host - is available, forking and releasing browser is a nice feature)

While using DOM parsers, one should realize that settings may introduce further validation for attributes href and src, depending on library (such as parse_url and likes)

Another way of getting by the timeout / memory consumption stuff is to call php-cli (also works for a windows host) and 'get on with business' and start next document. See this question for more info.

If you scroll down a bit, look at the proposed schema - initial crawling would put only body in database (and additionally lang in your case) and then run a cron-script, filling in the ft_index whilst using the following function

    function analyse() {
        ob_start(); // dont care about warnings, clean ob contents after parse
        $doc->loadHTML("<html><head><meta http-equiv=\"Content-Type\" content=\"text/html;charset=UTF-8\"/></head><body><pre>" . $this->html_entity_decode("UTF-8") . "</pre></body>");
        ob_end_clean();
        $weighted_ft = array('0'=>"",'5'=>"",'15'=>"");

        $includes = $doc->getElementsByTagName('h1');
        // relevance wieght 0
        foreach ($includes as $h) {


                $text = $h->textContent;
                // check/filter stopwords and uniqueness
                // do so with other weights as well, basically narrow it down before counting
                $weighted_ft['0'] .= " " . $text;


        }
        // relevance wieght 5
        $includes = $doc->getElementsByTagName('h2');
        foreach ($includes as $h) {
            $weighted_ft['5'] .= " " . $h->textContent;
        }
        // relevance wieght 15
        $includes = $doc->getElementsByTagName('p');
        foreach ($includes as $p) {
            $weighted_ft['15'] .= " " . $p->textContent;
        }
            // pseudo; start counting frequencies and stuff
            // foreach weighted_ft sz do 
            //   foreach word in sz do 
            //      freqency / prominence
 }

    function html_entity_decode($toEncoding) {
        $encoding = mb_detect_encoding($this->body, "ASCII,JIS,UTF-8,ISO-8859-1,ISO-8859-15,EUC-JP,SJIS");
        $body = mb_convert_encoding($this->body, $toEncoding, ($encoding != "" ? $encoding : "auto"));
        return html_entity_decode($body, ENT_QUOTES, $toEncoding);
    }

The above is a class, resembling your database which has the page 'body' field loaded in prehand.

Again, as far as database handling goes, i ended up inserting the above parsed result into a full-text flagged tablecolumn so that future lookups would go seemlessly. This is a huge advantage for db engines.

Note on full-text indexing:

When dealing with a small number of documents it is possible for the full-text search engine to directly scan the contents of the documents with each query, a strategy called serial scanning. This is what some rudimentary tools, such as grep, do when searching.

Your indexing algorithm filters out some words, ok.. But these are enumerated by how much weight they carry - there is a strategy to think out here, since a full-text string does not carry over the weights given. That is why in the example, as basic strategy on splitting strings into 3 different strings is given.

Once put into database, the columns should then resemble this, so a schema could be like so, where we would maintain weights - and still offer a superfast query method

CREATE TABLE IF NOT EXISTS `oo_pages` (
  `id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
  `body` mediumtext COLLATE utf8_danish_ci NOT NULL COMMENT 'PageBody entity encoded html',
  `title` varchar(31) COLLATE utf8_danish_ci NOT NULL,
  `ft_index5` mediumtext COLLATE utf8_danish_ci NOT NULL COMMENT 'Regenerated cron-wise, weighted highest',
  `ft_index10` mediumtext COLLATE utf8_danish_ci NOT NULL COMMENT 'Regenerated cron-wise, weighted medium',
  `ft_index15` mediumtext COLLATE utf8_danish_ci NOT NULL COMMENT 'Regenerated cron-wise, weighted lesser',
  `ft_lastmodified` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' COMMENT 'last cron run',
  PRIMARY KEY (`id`),
  UNIQUE KEY `alias` (`alias`),
  FULLTEXT KEY `ft_index5` (`ft_index5`),
  FULLTEXT KEY `ft_index10` (`ft_index10`),
  FULLTEXT KEY `ft_index15` (`ft_index15`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_danish_ci;

One may add an index like so:

ALTER TABLE `oo_pages` ADD FULLTEXT (
`named_column`
)

The thing about detecting language and then selecting your stopword database from that point is a feature I myself have left out but its nifty - And By The Book! So cudos for your efforts and this answer :)

Also, keep in mind there's not only the title tag, but also anchor / img title attributes. If for some reason your analytics goes into a spider-like state, i would suggest combining the reference link (<a>) title and textContent with the target page <title>

I'd recommend instead of re-inventing the wheel, you use Apache SoIr for search and analysis. It has almost everything you might need, including stop-word detection for 30+ languages [as far as I can remember, might be even more] and do tons of stuff with data stored in it.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow