The simple solution would be to sort a file by the key column e.g., sort tab-separated input by the second column:
#!/bin/bash
printf "a\tz\nb\ty\nc\tx" | sort -k 2 -t $'\t'
And then solve a simpler problem of retrieving 25% of random lines for each unique key where all lines with equal keys are adjacent with a constraint that at least one line for each unique key should be preserved:
#!/usr/bin/env python
import random
import sys
from itertools import chain, groupby
def choose_random(iterator, fraction, random=random.random):
"""Lazy analog of:
L = list(iterator)
k = int(len(L) * fraction + .5) or 1 # get at least one
result = random.sample(L, k)
Note: this function doesn't randomize the order of elements
that would require to keep selected elements in memory
and number of output elements is not exactly k
"""
# always yield at least one item if input is not empty
item = next(iterator)
it = (x for x in chain([item], iterator) if random() < fraction)
for x in chain([next(it, item)], it):
yield x
def getkey(line):
return line.split("\t")[1] # 2nd column
for key, group in groupby(sys.stdin, key=getkey):
sys.stdout.writelines(choose_random(group, fraction=0.25))
Note: the last line in the input file should contain a newline otherwise the output is corrupted if the last line is chosen.
The script accepts sorted (by the key column) input on stdin and prints reduced output to stdout. It requires to store only one line in memory at a time. It is a single-pass algorithm (O(n)).