str_word_count does not properly handle non-latin characters

https://stackoverflow.com/questions/22751870

24-06-2023
|

Question

I'm using php 5.3 and I want to count the words of some text for validation reason. My problem is that the javascript functionality that I have for the validation text, returns different number of words according the php functionality.

Here is the php code:

//trim it
$text = strip_tags(html_entity_decode($text,ENT_QUOTES));
// replace numbers with X
$text = preg_replace('/\d/', 'X', $text);
// remove ./,/-/&
$text = str_replace(array('.',',','-','&'), '', $text);
// number of words
$count = str_word_count($text);

I noticed that with php 5.5, I get the right number of the words but with php 5.3 not. I searched about that and I found this link (http://grokbase.com/t/php/php-bugs/12c14e0y6q/php-bug-bug-63663-new-str-word-count-does-not-properly-handle-non-latin-characters) that explains about the bug that php 5.3 has regarding with the latin characters. I tried to solve it with this code:

// remove non-utf8 characters
$text = preg_replace('/[^(\x20-\x7F)]*/','', $text);

But I still didn't get right result. Basically, the number of the word was very close to the result and sometimes accurate but often I had issues.

I decided to create another php functionality to fix the bug. Here is the php code:

//trim it
$text = strip_tags(html_entity_decode($text,ENT_QUOTES));
// replace multiple (one ore more) line breaks with a single space
$text = preg_replace("/[\n]+/", " ", $text);
// replace multiple (one ore more) spaces with a separator string (@SEPARATOR@)
$text = preg_replace("/[\s]+/", "@SEPARATOR@", $text);
// explode the separator string (@SEPARATOR@) and get the array
$text_array = explode('@SEPARATOR@', $text);
// get the numbers of the array/words
$count = count($text_array);
// check if the last key of the array is empty and decrease the count by one 
$last_key = end($text_array);
if (empty($last_key)) {
    $count--;
}

The last code is working fine for me and I would like to ask two questions:

What could I do in first situation about the str_word_count function?
If my second solution is accurate or could I do something to improve it?

Solution

Assuming you are asking how to still use str_word_count: You could try using: preg_replace('/[^a-zA-Z0-9\s]/','',$string) after you have already replaced any punctuation. Not having a "test string" that you know fails, I had no way to try that out, but at least it is something you can try yourself.
One improvement, would be to actually trim the text, it mentions trim in the first comment but that first line is just removing HTML tags. Add a trim($string) then you can remove the last part:

CHANGE first 2 lines:

//trim it & remove tags
$text = trim(strip_tags(html_entity_decode($text,ENT_QUOTES)));

Remove:

// check if the last key of the array is empty and decrease the count by one 
$last_key = end($text_array);
if (empty($last_key)) {
    $count--;
}

OTHER TIPS

;Have you considering using regex split to count the number of words using your own definition of what a word is. I might recommend /[^\s]+/ as a 'word', that would mean to split on /\s/ and count the resulting array of 'words'.

PHP: Let $input = 'your input here' then count(pregsplit('/\s/', $input))

JS: Let var input = 'your input here' then input.split(/\s/).length

You can also use regex character ranges to capture a set of characters you want to use as valid word contense, more on regex here: http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow