Pergunta

I am trying to remove identical lines in a file having 1.8 million records and create a new file. Using the following command:

sort tmp1.csv | uniq -c | sort -nr > tmp2.csv

Running the script creates a new file sort.exe.stackdump with the following information:

"Exception: STATUS_ACCESS_VIOLATION at rip=00180144805
..
..
program=C:\cygwin64\bin\sort.exe, pid 6136, thread main
cs=0033 ds=002B es=002B fs=0053 gs=002B ss=002B"

The script works for a small file with 10 lines. Seems like sort.exe cannot handle so many records. How do I work with such a large file with more than 1.8 million records? We do not have any database other than ACCESS and I was trying to do this manually in ACCESS.

Foi útil?

Solução 2

The following awk command seemed to be a much faster way to get rid of the uniqe values:

awk '!v[$0]++' $FILE2 > tmp.csv

where $FILE2 is the file name with duplicate values.

Outras dicas

It sounds like your sort command is broken. Since the path says cygwin, i'm assuming this is GNU sort, which generally should have no problem with this task, given sufficient memory and disk space. Try playing with flags to adjust where and how much it uses the disk: http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top