Distinguish randomly generated texts from reasonable for human texts [closed]

https://datascience.stackexchange.com/questions/81824

14-12-2020
|

Pergunta

I have strings short texts of 2 types: '23jd2032n0d2mn', 'fn830n30rn83', 'fhui29n4ok', 'qn4foml', ... and 'sweetie23', 'king3prussia', 'maryjesus', 'lovedog4and_kitties', ...

Is there a way to distinguish one type from another?

I've tried to vectorize texts with word2vec and classify on these vectors with xgboost, but I didn't succeded to achieve got F1-score.

Solução

You could train a character-level language model, e.g. an LSTM, on the real short texts, and use the perplexity as the signal to know whether a piece of text is real or not.

In order to find an appropriate perplexity threshold, you can have a look at the distribution of perplexities over a validation holdout dataset.

UPDATE: There are multiple implementations of language models. For "classical" options, you can go for KenLM or if you have GPUs to train the model, you can use fairseq. Just remember to prepare your text to have character-level tokens (normally you just need to have a space between every letter).

Outras dicas

Assuming that the "human readable" texts are more likely to contain actual words, you could count the number of dictionary words that occur in each.

You could use Wordnet for example.

The number or proportion of word hits, and their length, could be features for a model or maybe it would be enough with a simple cutoff rule.

You might want to restrict the word list to the most frequent words in the language you're working with.

Try making features like vowel_count, consonant_count, digitcount , vowel_density(vowel_count/total_length_of_words)

Another wild thing - 
split the strigns with numbers and _ using regex and try to see if they are english words or not, use a pretrained model like spacy.english or nltk.words to check, make a column representing english words count if any.

edit -

also find the vowel index from last vovel index(relative index) (after removing nonnumeric characters)and eg - 
    A) maryjesus -> a(1, because 1st index), e(5th index - index of a(1st index) = 4), u(2)
    
    B)qn4foml-> o(4)

because in english words , vowels ties the consonant together to form words, therefore an english words vowel's relative distance should be quite less.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange