Question

I'm developing a 'TreeDict' class in Python. This is a basically a dict that allows you to retrieve its key-value pairs in sorted order, just like the Treemap collection class in Java.

I've implemented some functionality based on the way unique indexes in relational databases can be used, e.g. functions to let you retrieve values corresponding to a range of keys, keys greater than, less than or equal to a particular value in sorted order, strings or tuples that have a specific prefix in sorted order, etc.

Unfortunately, I can't think of any real life problem that will require a class like this. I suspect that the reason we don't have sorted dicts in Python is that in practice they aren't required often enough to be worth it, but I want to be proved wrong.

Can you think of any specific applications of a 'TreeDict'? Any real life problem that would be best solved by this data structure? I just want to know for sure whether this is worth it.

Was it helpful?

Solution

It's useful when you need to go through a Dictionary in order of the keys; which comes up on occasion. I've actually found its infinitely more common in certain programming contests then anything else (think ACM, etc).

The most useful feature of a TreeMap is when you want to quickly find the min or max key; using a sorted dictionary this is often a single method call; and algorithmically can be done in O(log(n)) time, as opposed to iterating over each key looking for a min/max if the collection is unsorted. Basically, a much friendlier interface.

One of the more common times I run into it is when objects are identified by a specific name, and you want to print out the objects ordered according to the name; say a mapping from directory name to number of files in a directory.

One other place I've used it is in an excel spreadsheet wrapper; mapping from row number to row object. This lets you quickly find the last row index, without looping through each row.

Also, it's useful when you can easily define a comparison relation on keys, but not necessarily a hashing function, as needed for HashMaps. The best (though weak) example I can think of is case insensitive string keys.

OTHER TIPS

I've seen several answers pointing to the "walk in ordered sequence" feature, which is indeed important, but none highlighting the other big feature, which is "find first entry with a key >= this". This has many uses even when there's no real need to "walk" from there.

For example (this came up in a recent SO answer), say you want to generate pseudo-random values with given relative frequencies -- i.e, you're given, say, a dict d:

{'wolf': 42, 'sheep': 15, 'dog': 23, 'goat': 15, 'cat': 5}

and need a way to generate 'wolf' with a probability of 42 out of 100 (since 100 is the total of the relative frequencies given), 'sheep' 15 out of 100, and so on; and the number of distinct values can be quite large, as can the relative frequencies.

Then, store the given values (in whatever order) as the values in a tree map, with the corresponding keys being the "total cumulative frequency" up to that point. I.e.:

def preprocess(d):
    tot = 0
    for v in d:
        tot += d[v]
        treemap.insert(key=tot, value=v)
    return tot, treemap

Now, generating a value can be pretty fast (O(log(len(d)))), as follows:

def generate(tot, treemap, r=random):
    n = r.randrange(tot)
    return treemap.firstGTkey(n).value

where firstGTKey is a method that returns the first entry (with .key and .value attributes, in this hypothetical example) with a key > the given argument. I've used this approach with large files stored as B-Trees, for example (using e.g. bsddb.bt_open and the set_location method).

The reason for keeping the elements in sorted order is for faster retrieval. Say I wanted all of the values in the dictionary in a sorted range. This is much faster with a TreeDict then with the regular hashmap. It basically allows you to keep everything in the dictionary in sorted order. I know in the application I'm currently working on uses a class like this to basically query the data structure.

I often use Dict<DateTime, someClassOrValue> when working with industrial process data-- Valve open/close, machinery start/stop, etc.

Having the keys sorted is especially useful when I need to compare time intervals between start/stop or open/close events in a decent amount of time.

However, since I've been able to use linq in C# I've found that it's often easier to just work with IEnumerables and use the IQueryable extension methods to get the information I need.

Almost all "GROUP BY" reporting requires a sorted dictionary.

summary = sortedDefaultDict()
for row in somePileOfData:
    summary[row.group_by] += row.balance
for k in sorted(summary.keys()):
    print k, summary[k]

This is done so often in data warehousing applications, that it's difficult to express how central this is.

If the sorted function call does no work, it saves a ton of time in the long run.

They can make various algorithms easier to implement.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top