Split a huge file at locations where consecutive lines don't have the same first field

https://stackoverflow.com/questions/19200265

30-06-2022
|

Вопрос

I have a huge sorted text file (10 million lines). I want to split it into small files of size about 10,000 each. But there shouldn't be any file whose last line has the same first field as that of first line of any other file. In other words, the split points should be near 10k'th line, but so that the line before split should have a different first field than the line after.

I thought of a tedious way. In another file, somehow print all possible split locations where lineN and lineN+1 have different first field. Then write program to select only those split points that are near 10k. But now I can't use the split command as it only allows fixed lines per splitted file.

How to solve the problem?

Решение

Something like this, untested, should do it:

awk '
nr == 10000 { got10k = 1 }
got10k && ($1 != prev) { nr = 0 }
++nr == 1 { fileName = "outfile" ++numFiles; got10k = 0 }
{ print > fileName; prev = $1 }
' file

Другие советы

You can use the awk command to write a small script to write data to different files.

Using the awk for loop, you can check if the two fields match with the previous line's fields then continue writing to the same file (after reading/ writing 10k lines), else write to a new file.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow