Domanda

I have the following problem.

I am writing an Android application which uses an English dictionary, it is an educational App, that submits English language related test to the user.

I have the following problem:

In order to assess the difficulty of the tests the application produces,

I need to have an approximation of how commonly is an English word used.

I need only a high level approximation, any reasonable source would be acceptable.

The problem is that I have to do it for every word in my dictionary (Sqlite database) which contains 95000 words.

Interesting problem, isn't it?

Please any suggestion more than welcome!

EDIT EDIT EDIT

I was thinking about doing Google queries via code and use the results to have an approximation. The point is that I doubt that Google would allow my code to do 95000 automatic queries...

È stato utile?

Soluzione

Use a frequency list (PDF) of English. Words with low frequency or not in that list are not common.

Altri suggerimenti

It is very interesting. One option is to query http://books.google.com/ngrams/graph

and gather statistics that you can then elaborate on. You could set a base using some very common words and compare than your tested word frequency to the base, or do some statistic average etc etc.

Of course it reflects written and not spojen english, but if you limit the date range to the last say, 50 years, it should give you a good aproximation.

for the current test create a HashMap<String,Integer>, create ArrayList<String> for the words in your test and do something like this:

for(String word:words){
    if(word_frequency.containsKey(word)){
        Integer count = (Integer)word_frequency.get(word);
        word_frequency.remove(word);
        word_frequency.put(word,count+1);
    }else{
        word_frequency.put(word,1);
    }
}

this will give you a HashMap which contains every word in the test and how many times this word appears.

note, this is just a sample code, maybe there is faster way, also you might handle the case sensitivity and some things that I can't think of right now.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top