Domanda

I have a huge sorted text file (10 million lines). I want to split it into small files of size about 10,000 each. But there shouldn't be any file whose last line has the same first field as that of first line of any other file. In other words, the split points should be near 10k'th line, but so that the line before split should have a different first field than the line after.

I thought of a tedious way. In another file, somehow print all possible split locations where lineN and lineN+1 have different first field. Then write program to select only those split points that are near 10k. But now I can't use the split command as it only allows fixed lines per splitted file.

How to solve the problem?

È stato utile?

Soluzione

Something like this, untested, should do it:

awk '
nr == 10000 { got10k = 1 }
got10k && ($1 != prev) { nr = 0 }
++nr == 1 { fileName = "outfile" ++numFiles; got10k = 0 }
{ print > fileName; prev = $1 }
' file

Altri suggerimenti

You can use the awk command to write a small script to write data to different files.

Using the awk for loop, you can check if the two fields match with the previous line's fields then continue writing to the same file (after reading/ writing 10k lines), else write to a new file.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top