Question

I've read a bunch of tutorials about the proper way to generate a logarithmic distribution of tagcloud weights. Most of them group the tags into steps. This seems somewhat silly to me, so I developed my own algorithm based on what I've read so that it dynamically distributes the tag's count along the logarthmic curve between the threshold and the maximum. Here's the essence of it in python:

from math import log
count = [1, 3, 5, 4, 7, 5, 10, 6]
def logdist(count, threshold=0, maxsize=1.75, minsize=.75):
    countdist = []
    # mincount is either the threshold or the minimum if it's over the threshold
    mincount = threshold<min(count) and min(count) or threshold
    maxcount = max(count)
    spread = maxcount - mincount
    # the slope of the line (rise over run) between (mincount, minsize) and ( maxcount, maxsize)
    delta = (maxsize - minsize) / float(spread)
    for c in count:
        logcount = log(c - (mincount - 1)) * (spread + 1) / log(spread + 1)
        size = delta * logcount - (delta - minsize)
        countdist.append({'count': c, 'size': round(size, 3)})
    return countdist

Basically, without the logarithmic calculation of the individual count, it would generate a straight line between the points, (mincount, minsize) and (maxcount, maxsize).

The algorithm does a good approximation of the curve between the two points, but suffers from one drawback. The mincount is a special case, and the logarithm of it produces zero. This means the size of the mincount would be less than minsize. I've tried cooking up numbers to try to solve this special case, but can't seem to get it right. Currently I just treat the mincount as a special case and add "or 1" to the logcount line.

Is there a more correct algorithm to draw a curve between the two points?

Update Mar 3: If I'm not mistaken, I am taking the log of the count and then plugging it into a linear equation. To put the description of the special case in other words, in y=lnx at x=1, y=0. This is what happens at the mincount. But the mincount can't be zero, the tag has not been used 0 times.

Try the code and plug in your own numbers to test. Treating the mincount as a special case is fine by me, I have a feeling it would be easier than whatever the actual solution to this problem is. I just feel like there must be a solution to this and that someone has probably come up with a solution.

UPDATE Apr 6: A simple google search turns up a many of the tutorials I've read, but this is probably the most complete example of stepped tag clouds.

UPDATE Apr 28: In response to antti.huima's solution: When graphed, the curve that your algorithm creates lies below the line between the two points. I've been trying to juggle the numbers around but still can't seem to come up with a way to flip that curve to the other side of the line. I'm guessing that if the function was changed to some form of logarithm instead of an exponent it would do exactly what I'd need. Is that correct? If so, can anyone explain how to achieve this?

Was it helpful?

Solution

Thanks to antti.huima's help, I re-thought out what I was trying to do.

Taking his method of solving the problem, I want an equation where the logarithm of the mincount is equal to the linear equation between the two points.

weight(MIN) = ln(MIN-(MIN-1)) + min_weight
min_weight = ln(1) + min_weight

While this gives me a good starting point, I need to make it pass through the point (MAX, max_weight). It's going to need a constant:

weight(x) = ln(x-(MIN-1))/K + min_weight

Solving for K we get:

K = ln(MAX-(MIN-1))/(max_weight - min_weight)

So, to put this all back into some python code:

from math import log
count = [1, 3, 5, 4, 7, 5, 10, 6]
def logdist(count, threshold=0, maxsize=1.75, minsize=.75):
    countdist = []
    # mincount is either the threshold or the minimum if it's over the threshold
    mincount = threshold<min(count) and min(count) or threshold
    maxcount = max(count)
    constant = log(maxcount - (mincount - 1)) / (maxsize - minsize)
    for c in count:
        size = log(c - (mincount - 1)) / constant + minsize
        countdist.append({'count': c, 'size': round(size, 3)})
    return countdist

OTHER TIPS

Let's begin with your mapping from the logged count to the size. That's the linear mapping you mentioned:

   size
    |
max |_____
    |   /
    |  /|
    | / |
min |/  |
    |   |
   /|   |
0 /_|___|____
    0   a

where min and max are the min and max sizes, and a=log(maxcount)-b. The line is of y=mx+c where x=log(count)-b

From the graph, we can see that the gradient, m, is (maxsize-minsize)/a.

We need x=0 at y=minsize, so log(mincount)-b=0 -> b=log(mincount)

This leaves us with the following python:

mincount = min(count)
maxcount = max(count)
xoffset = log(mincount)
gradient = (maxsize-minsize)/(log(maxcount)-log(mincount))
for c in count:
    x = log(c)-xoffset
    size = gradient * x + minsize

If you want to make sure that the minimum count is always at least 1, replace the first line with:

mincount = min(count+[1])

which appends 1 to the count list before doing the min. The same goes for making sure the maxcount is always at least 1. Thus your final code per above is:

from math import log
count = [1, 3, 5, 4, 7, 5, 10, 6]
def logdist(count, maxsize=1.75, minsize=.75):
    countdist = []
    mincount = min(count+[1])
    maxcount = max(count+[1])
    xoffset = log(mincount)
    gradient = (maxsize-minsize)/(log(maxcount)-log(mincount))
    for c in count:
        x = log(c)-xoffset
        size = gradient * x + minsize
        countdist.append({'count': c, 'size': round(size, 3)})
    return countdist

what you have is that you have tags whose counts are from MIN to MAX; the threshold issue can be ignored here because it amounts to setting every count below threshold to the threshold value and taking the minimum and maximum only afterwards.

You want to map the tag counts to "weights" but in a "logarithmic fashion", which basically means (as I understand it) the following. First, the tags with count MAX get max_weight weight (in your example, 1.75):

weight(MAX) = max_weight

Secondly, the tags with the count MIN get min_weight weight (in your example, 0.75):

weight(MIN) = min_weight

Finally, it holds that when your count decreases by 1, the weight is multiplied with a constant K < 1, which indicates the steepness of the curve:

weight(x) = weight(x + 1) * K

Solving this, we get:

weight(x) = weight_max * (K ^ (MAX - x))

Note that with x = MAX, the exponent is zero and the multiplicand on the right becomes 1.

Now we have the extra requirement that weight(MIN) = min_weight, and we can solve:

weight_min = weight_max * (K ^ (MAX - MIN))

from which we get

K ^ (MAX - MIN) = weight_min / weight_max

and taking logarithm on both sides

(MAX - MIN) ln K = ln weight_min - ln weight_max

i.e.

ln K = (ln weight_min - ln weight_max) / (MAX - MIN)

The right hand side is negative as desired, because K < 1. Then

K = exp((ln weight_min - ln weight_max) / (MAX - MIN))

So now you have the formula to calculate K. After this you just apply for any count x between MIN and MAX:

weight(x) = max_weight * (K ^ (MAX - x))

And you are done.

On a log scale, you just plot the log of the numbers linearly (in other words, pretend you're plotting linearly, but take the log of the numbers to be plotted first).

The zero problem can't be solved analytically--you have to pick a minimum order of magnitude for your scale, and no matter what you can't ever reach zero. If you want to plot something at zero, your choices are to arbitrarily give it the minimum order of magnitude of the scale, or to omit it.

I don't have the exact answer, but i think you want to look up Linearizing Exponential Data. Start by calculate the equation of the line passing through the points and take the log of both sides of that equation.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top