質問

My task is the following - I have two CSV files:

File 1 (~9.000.000 records):
type(text),number,status(text),serial(number),data1,data2,data3

File 2 (~6000 records): serial_range_start(number),serial_range_end(number),info1(text),info2(text)

The goal is to add to each entry in File 1 the corresponding info1 and info2 from File 2: type(text),number,status(text),serial(number),data1,data2,data3,info1(text),info2(text)

I use the following script:

#!/bin/bash
USER="file1.csv"
RANGE="file2.csv"

for SN in `cat $USER | awk  -F , '{print $4}'`
    do
#echo \n "$SN"
    for LINE in `cat $RANGE` 
            do
    i=`grep $LINE $RANGE| awk -F, '{print $1}'`
    #echo \n "i= " "$i"
    j=`grep $LINE $RANGE| awk -F, '{print $2}'`
    #echo \n "j= " "$j"
    k=`echo $SN`
    #echo \n "k= " "$k"
    if [ $k -ge $i -a $k -le $j ] 
        then
        echo `grep $SN $USER`,`grep $i $RANGE| cut -d',' -f3-4` >> result.csv
        break
        #else
        #echo `grep $SN $USER`,`echo 'N/A','N/A'` >> result.csv
    fi
    done
done

The script works rather good on small files but I'm sure there is a way to optimize it because I am running it on an i5 laptop with 4GB of RAM. I am a newbie in shell scripting and I came up with this script after hours and hours of research, trial and error but now I am out of ideas.

note: not all the info in file 1 can be found in file.

Thank you!
Adrian.

FILE EXAMPLES and additional info:

File 1 example:

prep,28620026059,Active,123452010988759,No,No,No
post,28619823474,Active,123453458466109,Yes,No,No
post,28619823474,Inactive,123453395270941,Yes,Yes,Yes

File 2 example:

123452010988750,123452010988759,promo32,1.11
123453458466100,123453458466199,promo64,2.22
123450000000000,123450000000010,standard128,3.333

Result example (currently):

prep,28620026059,Active,123452010988759,No,No,No,promo32,1.11
post,28619823474,Active,123453458466109,Yes,No,No,promo64,2.22

Result example (nice to have):

prep,28620026059,Active,123452010988759,No,No,No,promo32,1.11
post,28619823474,Active,123453458466109,Yes,No,No,promo64,2.22
post,28619823474,Inactive,123453395270941,Yes,Yes,Yes,NA,NA

File 1 is sorted after the 4th column File 2 is sorted after the first column.

File 2 does not have ranges that overlap

Not all the info in file 1 can be found in a range in file 2

Thanks again!

LE: The script provided by Jonathan seems to have an issue on some records, as follows:

file 2:

123456780737000,123456780737012,ONE 32,1.11
123456780016000,123456780025999,ONE 64,2.22

file 1:

Postpaid,24987326427,Active,123456780737009,Yes,Yes,Yes
Postpaid,54234564719,Active,123456780017674,Yes,Yes,Yes

The output is the following:

Postpaid,24987326427,Active,123456780737009,Yes,Yes,Yes,ONE 32,1.11
Postpaid,54234564719,Active,123456780017674,Yes,Yes,Yes,ONE 32,1.11

and it should be:

Postpaid,24987326427,Active,123456780737009,Yes,Yes,Yes,ONE 32,1.11
Postpaid,54234564719,Active,123456780017674,Yes,Yes,Yes,ONE 64,2.22

It seems that it returns 0 and writes the info on first record from file2...

役に立ちましたか?

解決

I think this will work reasonably well:

awk -F, 'BEGIN   { n = 0; OFS = ","; }
         NR==FNR { lo[n] = $1; hi[n] = $2; i1[n] = $3; i2[n] = $4; n++ }
         NR!=FNR { for (i = 0; i < n; i++)
                   {
                      if ($4 >= lo[i] && $4 <= hi[i])
                      {
                          print $1, $2, $3, $4, $5, $6, $7, i1[i], i2[i];
                          break;
                      }
                   }
                 }' file2 file1

Given file2 containing:

1,10,xyz,pqr
11,20,abc,def
21,30,ambidextrous,warthog

and file1 containing:

A,123,X2,1,data01_1,data01_2,data01_3
A,123,X2,2,data02_1,data02_2,data02_3
A,123,X2,3,data03_1,data03_2,data03_3
A,123,X2,4,data04_1,data04_2,data04_3
A,123,X2,5,data05_1,data05_2,data05_3
A,123,X2,6,data06_1,data06_2,data06_3
A,123,X2,7,data07_1,data07_2,data07_3
A,123,X2,8,data08_1,data08_2,data08_3
A,123,X2,9,data09_1,data09_2,data09_3
A,123,X2,10,data10_1,data10_2,data10_3
A,123,X2,11,data11_1,data11_2,data11_3
A,123,X2,12,data12_1,data12_2,data12_3
A,123,X2,13,data13_1,data13_2,data13_3
A,123,X2,14,data14_1,data14_2,data14_3
A,123,X2,15,data15_1,data15_2,data15_3
A,123,X2,16,data16_1,data16_2,data16_3
A,123,X2,17,data17_1,data17_2,data17_3
A,123,X2,18,data18_1,data18_2,data18_3
A,123,X2,19,data19_1,data19_2,data19_3
A,223,X2,20,data20_1,data20_2,data20_3
A,223,X2,21,data21_1,data21_2,data21_3
A,223,X2,22,data22_1,data22_2,data22_3
A,223,X2,23,data23_1,data23_2,data23_3
A,223,X2,24,data24_1,data24_2,data24_3
A,223,X2,25,data25_1,data25_2,data25_3
A,223,X2,26,data26_1,data26_2,data26_3
A,223,X2,27,data27_1,data27_2,data27_3
A,223,X2,28,data28_1,data28_2,data28_3
A,223,X2,29,data29_1,data29_2,data29_3

the output of the command is:

A,123,X2,1,data01_1,data01_2,data01_3,xyz,pqr
A,123,X2,2,data02_1,data02_2,data02_3,xyz,pqr
A,123,X2,3,data03_1,data03_2,data03_3,xyz,pqr
A,123,X2,4,data04_1,data04_2,data04_3,xyz,pqr
A,123,X2,5,data05_1,data05_2,data05_3,xyz,pqr
A,123,X2,6,data06_1,data06_2,data06_3,xyz,pqr
A,123,X2,7,data07_1,data07_2,data07_3,xyz,pqr
A,123,X2,8,data08_1,data08_2,data08_3,xyz,pqr
A,123,X2,9,data09_1,data09_2,data09_3,xyz,pqr
A,123,X2,10,data10_1,data10_2,data10_3,xyz,pqr
A,123,X2,11,data11_1,data11_2,data11_3,abc,def
A,123,X2,12,data12_1,data12_2,data12_3,abc,def
A,123,X2,13,data13_1,data13_2,data13_3,abc,def
A,123,X2,14,data14_1,data14_2,data14_3,abc,def
A,123,X2,15,data15_1,data15_2,data15_3,abc,def
A,123,X2,16,data16_1,data16_2,data16_3,abc,def
A,123,X2,17,data17_1,data17_2,data17_3,abc,def
A,123,X2,18,data18_1,data18_2,data18_3,abc,def
A,123,X2,19,data19_1,data19_2,data19_3,abc,def
A,223,X2,20,data20_1,data20_2,data20_3,abc,def
A,223,X2,21,data21_1,data21_2,data21_3,ambidextrous,warthog
A,223,X2,22,data22_1,data22_2,data22_3,ambidextrous,warthog
A,223,X2,23,data23_1,data23_2,data23_3,ambidextrous,warthog
A,223,X2,24,data24_1,data24_2,data24_3,ambidextrous,warthog
A,223,X2,25,data25_1,data25_2,data25_3,ambidextrous,warthog
A,223,X2,26,data26_1,data26_2,data26_3,ambidextrous,warthog
A,223,X2,27,data27_1,data27_2,data27_3,ambidextrous,warthog
A,223,X2,28,data28_1,data28_2,data28_3,ambidextrous,warthog
A,223,X2,29,data29_1,data29_2,data29_3,ambidextrous,warthog

This uses a linear search on the list of ranges; you can write functions in awk and a binary search looking for the correct range would perform better on 6,000 entries. That part, though, is an optimization — exercise for the reader. Remember that the first rule of optimization is: don't. The second rule of optimization (for experts only) is: don't do it yet. Demonstrate that it is a problem. This code shouldn't take all that much longer than the time it takes to copy the 9,000,000 record file (somewhat longer, but not disastrously so). Note, though, that if the file1 data is sorted, the tail of the processing will take longer than the start because of the linear search. If the serial numbers are in a random order, then it will all take about the same time on average.

If your CSV data has commas embedded in the text fields, then awk is no longer suitable; you need a tool with explicit support for CSV format — Perl and Python both have suitable modules.


Answer to Exercise for the Reader

awk -F, 'BEGIN   { n = 0; OFS = ","; }
         NR==FNR { lo[n] = $1; hi[n] = $2; i1[n] = $3; i2[n] = $4; n++ }
         NR!=FNR { i = search($4)
                   print $1, $2, $3, $4, $5, $6, $7, i1[i], i2[i];
                 }
         function search(i,    l, h, m)
         {
             l = 0; h = n - 1;
             while (l <= h)
             {
                 m = int((l + h)/2);
                 if (i >= lo[m] && i <= hi[m])
                     return m;
                 else if (i < lo[m])
                     h = m - 1;
                 else
                     l = m + 1;
             }
             return 0;   # Should not get here
         }' file2 file1

Not all that hard to write the binary search. This gives the same result as the original script on the sample data. It has not been exhaustively tested, but appears to work.

Note that the code does not really handle missing ranges in file2; it assumes that the ranges are contiguous but non-overlapping and in sorted order and cover all the values that can appear in the serial column of file1. If those assumptions are not valid, you get erratic behaviour until you fix either the code or the data.

他のヒント

In Unix you may use the join command (type 'man join' for more information) which can be configured to work similarly to join operation in databases. That may help you to add the information from File2 to File1.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top