Question

Several times over the years I have wanted to work with frequency lists (character, word, n-gram, etc) of varying quality but never figured out how to use them together.

At the time I intuited that lists with just rank and no other data should be useful. Since then I have learned of Zipf's law and power laws. Though I'm not great at maths so I don't fully understand everything.

I've found some questions in StackOverflow and CrossValidated that seem like they could be related. But I either don't understand them at the right level, or they lack useful answers.

What I want is a way to normalize a list with full frequency data and a list with only rank data so that I can use the two lists together.

For instance a word list with frequency data:

word  per /million
的    50155.13
我    50147.83
你    39629.27
是    28253.52
了    28210.53
不    20543.44
在    12811.05
他    11853.78
我们  11080.02
...
...
...   00000.01

And a word list with only rank data:

word  rank
的    1
一    2
是    3
有    4
在    5
人    6
不    7
大    8
中    9
...
...
...   100,000

How can I normalize both the frequency data and the rank data into the same kind of value that can then be used in comparisons etc?

(The example lists in this question are just examples. Assume much longer lists obtained from external sources over which the programmer has no control.)

Was it helpful?

Solution

It should be obvious, that you can determine a rank, when you have a complete list with frequencies (order the list by frequency in descending order and assign a rank increment), but not the other way round (how would you know, how often a word occurs, given the information that it is ranked at 3rd position? You can only deduce, that it occurs with equal/lower frequency compared to word at 2nd ranked position, and equal/higher frequency compared to word at 4th position).

Applying Zipf's law, you could map the ranking back to some frequency estimation and try to roughly estimate a frequency. But I'm not sure how well this generalizes for different languages.

[edit] You really caught my attention now :) I came across this application of Zipf's Law on Wolfram MathWorld. I'll do some little experiments with an English term corpus which I created a while ago. I'll come back with results, just a little patience.

[edit2] I now took a frequency list from Word Frequencies in Written and Spoken English: based on the British National Corpus. (this one, to be exact; which only contains the top 5000 words or so, but should be enough for this quick consideration) and applied a simple 1/rank to estimate the frequencies. I did the experiment as a KNIME workflow (using the JFreeChart nodes for the chart and the Palladian nodes [disclaimer: I'm the author of the Palladian nodes] for RMSE calculation), which looks as follows:

KNIME workflow

The graph with the actual frequencies and the estimations from the rank looks as follows (rank is log scaled, sorry for not providing an adequate caption on the axis; blue line is the estimation; red line is the actual value from the dataset):

Frequency estimation

So, while there are some outliers on the higher ranks (e.g. 2,3,4), the frequency estimation should still be perfectly decent when using it in conjunction with TF-IDF or something like that. (RMSE is ~ 0.004 in this case, which is of course due to the minimal deviation in the long tail)

Here's a snippet with some actual values:

Frequency estimation list

Btw.; also have a look at this section on the Wikipedia article about Zipf's law, which shows similar results.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top