Frage

I have big svmlight files that I'm using for machine learning purpose. I'm trying to see if a sumsampling of those files would lead to good enough results.

I want to extract random lines of my files to feed them into my models but I want to load the less possible information in RAM.

I saw here (Read a number of random lines from a file in Python) that I could use linecache but all the solution end up loading everything in memory.

Could someone give me some hints? Thank you.

EDIT : forgot to say that I know the number of lines in my files beforehand.

War es hilfreich?

Lösung

You can use a heapq to select n records based on a random number, eg:

import heapq
import random

SIZE = 10
with open('yourfile') as fin:
    sample = heapq.nlargest(SIZE, fin, key=lambda L: random.random())

This is remarkably efficient as the heapq remains fixed size, it doesn't require a pre-scan of the data and elements get swapped out as other elements get chosen instead - so at most you'll end up with SIZE elements in memory at once.

Andere Tipps

One option is to do a random seek into the file then look backwards for a newline (or the start of the file) before reading a line. Here's a program that prints a random line of each of the Python programs it finds in the current directory.

import random
import os
import glob

for name in glob.glob("*.py"):
    mode, ino, den, nlink, uid, gid, size, atime,  mtime, ctime = os.stat(name)
    inf = open(name, "r")
    location = random.randint(0, size)
    inf.seek(location)
    while location > 0:
        char = inf.read(1)
        if char == "\n":
            break
        location -= 1
        inf.seek(location)
    line = inf.readline()
    print name, ":", line[:-1]

As long as the lines aren't huge this shouldn't be unduly burdensome.

You could scan the file once, counting the number of lines. Once you know that, you can generate the random line number, re-read the file and emit that line when you see it.

Actually since you're interested in multiple lines, you should look at Efficiently selecting a set of random elements from a linked list.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top