How to group strings with similar beginnings in a sorted list?

Question 1

Nice question. How about this small solution:

def commonPrefix(a, b):
  i = 0
  while i < len(a) and i < len(b) and a[i] == b[i]:
    i += 1
  return i

def eachWithPrefix(v):
  p = ''
  for x in v:
    yield commonPrefix(p, x), x
    p = x

Now you can choose what you want:

list(eachWithPrefix(v))

will return a list of your values and each will state how many characters are equal to the former line, so

print '\n'.join(' '*p + x[p:] for p, x in eachWithPrefix(v))

Will print the second solution you proposed.

print '\n'.join('\t' * p + '\\'.join(x[p:]) for p, x in eachWithPrefix(x.split('\\') for x in v))

on the other hand will perform the same action for the delimiter \ and replace the to-be-omitted parts with tab stops. This is not quite the format you proposed in your first output example but I guess you get the point.

Try:

print '\n'.join('\\'.join([ s if i >= p else ' '*len(s) for i, s in enumerate(x) ]) for p, x in eachWithPrefix(x.split('\\') for x in v))

This will replace the equal parts with like-sized just-space strings. The output will still contain the delimiters, though, but maybe that's even nicer:

2014\2014-01 Jan\2014-01-01
    \           \2014-01-02
    \           \2014-01-03
    \           \2014-01-04
    \           \2014-01-05
...
    \           \2014-01-31
    \2014-02 Feb\2014-02-01
    \           \2014-02-02
    \           \2014-02-03
...

To remove also those you can use this approach:

print '\n'.join(' ' * len('\\'.join(x[:p])) + '\\'.join(x)[len('\\'.join(x[:p])):] for p, x in eachWithPrefix(x.split('\\') for x in v))

But this now contains some code doubling, so maybe an iterative loop would be nicer here:

for p, x in eachWithPrefix(x.split('\\') for x in v):
  s = '\\'.join(x)
  c = '\\'.join(x[:p])
  print ' '*len(c) + s[len(c):]

Or as an easy-to-use generator:

def heirarchy(data, separator=","):
  for p, x in eachWithPrefix(x.split(separator) if separator else list(x) for x in data):
    s = separator.join(x)
    c = separator.join(x[:p])
    yield ' '*len(c) + s[len(c):]

So now heirarchy(data, separator='\\') creates exactly your expected output.

Question 2

Seems like you want to reinvent a http://en.wikipedia.org/wiki/Radix_tree

Anyhow, here's a simple generator:

def grouped(iterable):
    prefix = None
    for i in iterable:
        pre, suf = i[:16], i[16:]
        if pre != prefix:
            prefix = pre
            yield pre + suf
        else:
            yield " " * 16 + suf

Question 3

from difflib import SequenceMatcher

def remove_redundant_prefixes(it):
    """
    remove_redundant_prefixes(it) -> iterable (generator)

        Iterate through a list of strings, removing successive common prefixes.
    """
    prev_line = ''
    for line in sorted(it):
        sm = SequenceMatcher(a=prev_line, b=line)
        prev_line = line

        # Returns 3 element tuple, last element is the size of the match.
        match_size = sm.get_matching_blocks()[0][2]

        # No match == no prefix, don't prune the string.
        if match_size == 0:
            yield line
        else:
            # Prune per the match
            yield line.replace(line[:match_size], ' ' * match_size, 1)

Question 4

Ok inspired by the commonprefix answers from this question I played it with it for a bit and inspiration came when I realized I could send a list with just two elements each time!

Here's my code, this handles only the character by character case, and I'm not sure how good this is (i suspect not very much! as a lot of unnecessary copying occurs). But I was able to successfully reproduce the 3rd output from my question. This still leaves the other part unresolved.

def printheirarchy(data,seperator=","):
    if len(data) < 2:
        pprint(data)
        return
    newdata = []
    newdata.append(data[0])
    for i in range(1,len(data)):
        prefix = os.path.commonprefix(data[i-1:i+1])
        newdata.append(data[i].replace(prefix," "*len(prefix),1))
    pprint(newdata)