Grouping strings lexicographically (python)

https://stackoverflow.com/questions/9246720

28-04-2021
|

Question

I have N strings that I want to divide lexicographic into M even-sized buckets (+/- 1 string). Also, N>>M.

The direct way would be to sort all the strings and split the resulting list into the M buckets.

I would like to instead approximate this by routing each string as it is created to a bucket, before the full list is available.

Is there a fast and pythonic way to assign strings to buckets? I'm essentially looking for a string-equivalent of the integer modulo operator. Perhaps a hash that preserves lexicographic order? Is that even possible?

Solution

You can sort by first two chars of a string, or something of this sort.

Let's say that M=100, so you should divide the characters into sqrt(M) regions, and each should point to another sqrt(M) regions, then for each string you get, you can compare the first char to decide which region to direct the string to and again for the second char, something like a tree with buckets as leaves and comparisons as nodes.

OTHER TIPS

A hash by definition doesn't preserve any order.

And I don't think there is any pythonic way to do this.

You could just create dictionaries (which are basically hashing functions) and keep adding a string to each round-robin style, but it wouldn't preserve any order.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow