Question

I have a kind of Q&A site (very approximately) where users enter questions to be answered by our Staff. I am quite concerned about users posting non-questions, which are an annoyance. The best I thought to far is a system to detect whether the text is in Italian (our users' language), and if it is, to check if it's not a copypasta against a list of common copypastas.

So, long story short: users will input some text, I have to make sure it's a proper question in Italian and not random characters.

Was it helpful?

Solution

Not sure what language you'll make

http://www.easywayserver.com/blog/java-string-contains-example/

How do I check if a string contains a specific word in PHP?

Checking if the input String (Question) contains any forbidden word would be one way to go at it.

Pseudo code

ListOfForbiddenWords;
if Language = Italian
    if Input does not contain any of ListOfForbiddenwords
         //It's fine
    else
         //Don't spam
else
    //You're not Italian

Not quite sure on what's the best way to check if a string is written in a specific language

OTHER TIPS

You can use Rosoka's language detection if you want a commercial option. You can try it out at Rosoka Cloud for about $1/hour with all of the features. The language ID is available as a stand alone library. So you can feed it examples inputs that you are concerned with to see if it gives back what you want.

Random text like "jgujqkwfjpihoujlkfa" will be flagged as ROMANIZATION or a tag based on the underlying codeblocks that where used if it is non ascii. i.e. input that is not a language will not be tagged as a language.

There are many free language detection libraries. One popular example is libexttextcat from LibreOffice. There are many clones and ports and variants if you don't want a C library; see e.g. http://odur.let.rug.nl/vannoord/TextCat/competitors.html for an (incomplete, slightly dated) list of pointers.

A similar question was asked here a while ago and the answers listed a number of language detection API solutions. One of the answers points to detectlanguage.com which offers up a limited free language detection service.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top