Question

Can some one advise me direction where to look in.Or some resources. Here is a task:

  1. User leaves feed back-text with min 50 characters.
  2. I need to check if it's normal human sentences/ word combination OR just bag of words and characters.

For ex ( 1-normal, 0-not normal):

"I wrote question.hope for answer" - 1(class)

"Bla bla goog goog goog gooo" - 0(class)

Maybe some dataset available.or some approach? Thanks in advance!

Was it helpful?

Solution

What you need is simply a language model. This is a very common task so you should be able to find code and data easily. This question gives some pointers for Python (be careful, the accepted answer is incorrect according to the two other answers).

Applying the language model to a sentence gives you a probability (or a perplexity score, which works the opposite way), so you have to define a threshold in order to classify as real language or not.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top