Extracting a random line in a file without loading the file into RAM in python

Question 1

You can use a heapq to select n records based on a random number, eg:

import heapq
import random

SIZE = 10
with open('yourfile') as fin:
    sample = heapq.nlargest(SIZE, fin, key=lambda L: random.random())

This is remarkably efficient as the heapq remains fixed size, it doesn't require a pre-scan of the data and elements get swapped out as other elements get chosen instead - so at most you'll end up with SIZE elements in memory at once.

Question 2

One option is to do a random seek into the file then look backwards for a newline (or the start of the file) before reading a line. Here's a program that prints a random line of each of the Python programs it finds in the current directory.

import random
import os
import glob

for name in glob.glob("*.py"):
    mode, ino, den, nlink, uid, gid, size, atime,  mtime, ctime = os.stat(name)
    inf = open(name, "r")
    location = random.randint(0, size)
    inf.seek(location)
    while location > 0:
        char = inf.read(1)
        if char == "\n":
            break
        location -= 1
        inf.seek(location)
    line = inf.readline()
    print name, ":", line[:-1]

As long as the lines aren't huge this shouldn't be unduly burdensome.

Question 3

You could scan the file once, counting the number of lines. Once you know that, you can generate the random line number, re-read the file and emit that line when you see it.

Actually since you're interested in multiple lines, you should look at Efficiently selecting a set of random elements from a linked list.