Getting all words and punctiation from English text

https://stackoverflow.com/questions/20918543

24-09-2022
|

Question

What I want to do:

User loads the text. I analyse it and get all words and punctuation from it. Now I can easy render text for other users with fast translation of each word or additional info of analysed words.

Now I'm trying to use treat gem(NLP for ruby) but there are many problems with it.

For example in sentence

"The world ain't all sunshine and rainbows."

It divides ain't in two words "ai" and "n't"

Can anybody suggest some libraries or gem, maybe which I can implement with jruby where I can just separate text in words and punctuations without problems.

Or mb I'm wrong in my ideas and there is any other ways?

Solution

Why not start from a simple scan, where you use a simple regular expression to get all words from the text? http://ruby-doc.org/core-2.1.0/String.html#method-i-scan

For English, the regular expression should be simply \w, and some special characters like the ' you indicate.

OTHER TIPS

Have you tried using open-nlp gem from the same author?

An example there suggests it does what you want:

OpenNLP.load

text      = "The death of the poet was kept from his poems."
tokenizer = OpenNLP::SimpleTokenizer.new
tokens    = tokenizer.tokenize(text).to_a
# => %w[The death of the poet was kept from his poems .]

Unfortunately, since I don't have jruby on my machine right now I couldn't confirm it's working as expected for cases with words with apostrophes.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow