Are there libraries or techniques for 'noisifying' text data?

https://datascience.stackexchange.com/questions/14805

16-10-2019
|

Question

Data augmentation techniques for image data and audio data (eg speech recognition) have proven successful and are now common.

Are there libraries or techniques for augmenting text data?

For example:

in: 'How are you?'
out: ['how are you?', 'HOW ARE YOU?', 'hwo are y ou?', 'How're you?', 'how r u', ...]

Solution

I you want some kind of data-sets like Google spell checking data I suggest you look into the The WikEd Error Corpus dataset. The corpus consists of more than 12 million sentences with a total of 14 million edits of various types, this edits include: spelling error corrections, grammatical error corrections, stylistic changes. All these from the Wikipedia correction history. The owners (authors) of the data-set describe the data mining process in this paper. Also check this question in quora it contains links to various data-sets with spelling errors. Finally this page can also be useful.

OTHER TIPS

You can code certain simple rules like the ones you have mentioned in the question. Additionally, you can use knowledge bases like Freebase and WordNet to enrich your language model. Note that this will not necessarily "noisify" your data but would have effect similar to the effect on data augmentation on say images for downstream tasks.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange