Question

I need to count words in a string using PHP or Javascript (preferably PHP). The problem is that the counting needs to be the same as it works in Microsoft Word, because that is where the people assemble their original texts in so that is their reference frame. PHP has a word counting function (http://php.net/manual/en/function.str-word-count.php) but that is not 100% the same as far as I know.

Any pointers?

Was it helpful?

Solution

The real problem here is that you're trying to develop a solution without really understanding the exact requirements. This isn't a coding problem so much as a problem with the specs.

The crux of the issue is that your word-counting algorithm is different to Word's word-counting algorithm - potentially for good reason, since there are various edge-cases to consider with no obvious answers. Thus your question should really be "What algorithm does Word use to calculate word count?" And if you think about this for a bit, you already know the answer - it's closed-source, proprietary software so no-one can know for sure. And even if you do work it out, this isn't a public interface so it can easily be changed in the next version.

Basically, I think it's fundamentally a bad idea to design your software so that it functions identically to something that you cannot fully understand. Personally, I would concentrate on just developing a sane word-count of your own, documenting the algorithm behind it and justifying why it's a reasonable method of counting words (pointing out that there is no One True Way).

If you must conform to Word's attempt for some short-sighted business reason, then the number one task is to work out what methodology they use to the point where you can write down an algorithm on paper. But this won't be easy, will be very hard to verify completely and is liable to change without notice... :-/

OTHER TIPS

Bit of a mine-field as MS word counts are considered wrong and unreliable by profesionals who depend on word counts -- journalists, translators, and, lawers who are often involved in legal procedures where motions and submisions must be less than a specific number fo words.

Having said that this article- http://dotnetperls.com/word-count

describes a pretty good regex algorithm implemented in C# -- but should be faily easy to transalate into php.

I think his small inaccuracies are based on two factors -- MS Word misses out words not conatined in "regular paragraphs" so footnotes, text box and table wrapped words may or may not be counted. Also I think the EVIL smart quotes feature messing with hypens may affect the results. So it may be worth changing all the 'el-dash' and 'em-dash' characters back to the normal minus sign.

The following JS code gives a word count of 67. OpenOffice gives the same number.

str = "I need to count words in a string using PHP or Javascript (preferably PHP). The problem is that the counting needs to be the same as it works in Microsoft Word, because that is where the people assemble their original texts in so that is their reference frame. PHP has a word counting function (http://php.net/manual/en/function.str-word-count.php) but that is not 100% the same as far as I know.";

wordCount = str.split(/\s+/g).length;
function countWords( $text )
{
    $text = preg_replace('![^ \pL\pN\s]+!u', '', strtolower($text));
    $text = trim( preg_replace('![ \s]+!u', ' ', $text) );

    $count = count( explode(' ', $text) );

    return $count;
}

you can use this code for word count

<title>Untitled Document</title>
<script type="text/javascript" src="mootools.svn.js"></script>
<script type="text/javascript">
    window.addEvent('domready', function()
    {   
        $('myInput').addEvent('keyup', function() 
        {
            max_chars = 0;
            current_value   = $('myInput').value;
            current_length  = current_value.length;
            remaining_chars = max_chars+current_length;
            $('counter_number').innerHTML = remaining_chars;
            if(remaining_chars<=5)
            {
                $('counter_number').setStyle('color', '#990000');
            } else {
                $('counter_number').setStyle('color', '#666666');       
            }   
        }); 
    }); 
</script>

<style type="text/css"> 
    body{
        font-family:"Lucida Grande", "Lucida Sans Unicode", Verdana, Arial, Helvetica, sans-serif; 
        font-size:12px;
        color:#000000; 
    }
    a:link, a:visited{color:#0066CC;}
    label{display:block;}
    .counter{
        font-family:Georgia, "Times New Roman", Times, serif;
        font-size:16px; 
        font-weight:bold;
        color:#666666
    } 
</style> 
</head>
<body> 
    <label for="myInput">Write something here:</label> 
    <input type="text" id="myInput" maxlength="20" />  
    <span id="counter_number" class="counter">20</span> 
    Remaining chars

and download the mootools library...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow