Regular Expression for accurate word-count using JavaScript

https://stackoverflow.com/questions/4593565

15-10-2019
|

Question

I'm trying to put together a regular expression for a JavaScript command that accurately counts the number of words in a textarea.

One solution I had found is as follows:

document.querySelector("#wordcount").innerHTML = document.querySelector("#editor").value.split(/\b\w+\b/).length -1;

But this doesn't count any non-Latin characters (eg: Cyrillic, Hangul, etc); it skips over them completely.

Another one I put together:

document.querySelector("#wordcount").innerHTML = document.querySelector("#editor").value.split(/\s+/g).length -1;

But this doesn't count accurately unless the document ends in a space character. If a space character is appended to the value being counted it counts 1 word even with an empty document. Furthermore, if the document begins with a space character an extraneous word is counted.

Is there a regular expression I can put into this command that counts the words accurately, regardless of input method?

Solution

This should do what you're after:

value.match(/\S+/g).length;

Rather than splitting the string, you're matching on any sequence of non-whitespace characters.

There's the added bonus of being easily able to extract each word if needed ;)

OTHER TIPS

Try to count anything that is not whitespace and with a word boundary:

value.split(/\b\S+\b/g).length

You could also try to use unicode ranges, but I am not sure if the following one is complete:

value.split(/[\u0080-\uFFFF\w]+/g).length

For me this gave the best results:

value.split(/\b\W+\b/).length

with

var words = value.split(/\b\W+\b/)

you get all words.

Explanation:

\b is a word boundary
\W is a NON-word character, capital usually means the negation
'+' means 1 or more characters or the prefixed character class

I recommend learning regular expressions. It's a great skill to have because they are so powerful. ;-)

The correct regexp would be /s+/ in order to discard non-words:

'Lorem ipsum dolor , sit amet'.split(/\S+/g).length
7
'Lorem ipsum dolor , sit amet'.split(/\s+/g).length
6

you could extend/change you methods like this

document.querySelector("#wordcount").innerHTML = document.querySelector("#editor").value.split(/\b\(.*?)\b/).length -1; if you want to match things like email-addresses as well

and

document.querySelector("#wordcount").innerHTML = document.querySelector("#editor").value.trim().split(/\s+/g).length -1;

also try using \s as its the \w for unicode

source:http://www.regular-expressions.info/charclass.html

Try

    value.match(/\w+/g).length;

This will match a string of characters that can be in a word. Whereas something like:

    value.match(/\S+/g).length;

will result in an incorrect count if the user adds commas or other punctuation that is not followed by a space - or adds a comma with a space either side of it.

my simple JavaScript library, called FuncJS has a function called "count()" which does exactly what it's called — count words.

For example, say that you have a string full of words, you can simply place it in between the function brackets, like this:

count("How many words are in this string?");

and then call the function, which will then return the number of words. Also, this function is designed to ignore any amount of whitespace, thus giving an accurate result.

To learn more about this function, please read the documentation at http://docs.funcjs.webege.com/count().html and the download link for FuncJS is also on the page.

Hope this helps anyone wanting to do this! :)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow