Something like this, untested, should do it:
awk '
nr == 10000 { got10k = 1 }
got10k && ($1 != prev) { nr = 0 }
++nr == 1 { fileName = "outfile" ++numFiles; got10k = 0 }
{ print > fileName; prev = $1 }
' file
Вопрос
I have a huge sorted text file (10 million lines). I want to split it into small files of size about 10,000 each. But there shouldn't be any file whose last line has the same first field as that of first line of any other file. In other words, the split points should be near 10k'th line, but so that the line before split should have a different first field than the line after.
I thought of a tedious way. In another file, somehow print all possible split locations where lineN and lineN+1 have different first field. Then write program to select only those split points that are near 10k. But now I can't use the split
command as it only allows fixed lines per splitted file.
How to solve the problem?
Решение
Something like this, untested, should do it:
awk '
nr == 10000 { got10k = 1 }
got10k && ($1 != prev) { nr = 0 }
++nr == 1 { fileName = "outfile" ++numFiles; got10k = 0 }
{ print > fileName; prev = $1 }
' file
Другие советы
You can use the awk
command to write a small script to write data to different files.
Using the awk
for
loop, you can check if the two fields match with the previous line's fields then continue writing to the same file (after reading/ writing 10k lines), else write to a new file.