number of occurrences of every word on page

https://stackoverflow.com/questions/22913934

29-06-2023
|

Pergunta

I am attempting to count the number of occurrences of every unique word on a page (think SEO 'word count' that you see on woorank etc. - but not for that purpose!)

I am really struggling on how to set this up:-

At the moment I am thinking of reading each word and then checking if it is unique against an array -> if unique add to array with occurences=>1 - then if I find the same word later just +1.

However this seems really cumbersome and slow for large blocks of text (especially as I will have to strip commas etc, convert all to lower case etc.) -> is there are a better way, has someone got a code snippet or library for this task?

For clarity

The Cat ran away with the hat. The spoon had already run away with another cat, far far away.

Would yield:

the => 3, away => 3, cat => 2, with => 2, far => 2, spoon => 1, hat => 1, ran => 1, run => 1, had => 1, another => 1, already => 1

Thanks in advance - if there is no better way then that is fine!

ASIDE

I contemplated do a replace($word,"") on all words once found and counted -> but this seems just as cumbersome.

Solução

Use array_count_values() in conjunction with str_word_count():

$wordCounts = array_count_values(str_word_count(strtolower($sentence), 1));
arsort($wordCounts);

Output:

Array
(
    [the] => 3
    [away] => 3
    [cat] => 2
    [far] => 2
    [with] => 2
    [run] => 1
    [another] => 1
    [already] => 1
    [hat] => 1
    [ran] => 1
    [spoon] => 1
    [had] => 1
)

Demo

Outras dicas

Split all the words (you could use a tokenizer like the ones users in Solr to "clean" them), put then in array, sort it, and array unique count. It really would depend on the language, but it will always be faster to use the language native functions that iterate the text by yourself.

In php:

$array = preg_split('/[\s,\.]+/', strtolower($text));
$unique = array_count_values($array);
print_r($unique);

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow