Dataset for common words to construct basic sentences

https://stackoverflow.com//questions/10704858

13-12-2019
|

Question

So I am making a "fridge magnet" interactive and I was trying to figure out a valid dataset for words to have for the user to drag around.

I am using this data set .. but it is not that great

http://en.wikipedia.org/wiki/Most_common_words_in_English

and ideas where to find a more valid set of words

Solution

One way you could do this yourself is to download a corpus of text, and then run a script that counts up the number of each word that appears. Then pick some value N and divide every count by N (rounding down). For each word, make a magnet for each divided count. You should pick N based on how many magnets you want out at the end.

This has the advantage of having the distribution of magnets match the distribution of words. For example, if "the" appears 1000 times, "man" 320 times, "walks" 150 times, and "skips" 2 times, and you pick N to be 100, then you will end up making 10 "the" magnets, 3 "man", 1 "walks", and 0 "skips".

You might also want to take the logarithm of the counts to try and reduce the skew. Since word distributions are Zipfian, you might end up with thousands of "the" magnets for each "walks").

Finally, the nice thing about this approach is that you could run it on a particular domain to make a word magnet set for that domain. For example, if you want to make word magnets that sound like news stories, then run it on a corpus of news stories. If you want to make word magnets that sound like fairy tales, then run it on a corpus of fairy tales.

If you really want to get fancy you could use something like TF-IDF to pick out the words that are most representative of that domain and then mix them with common function words.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow