replace 4th column from the last and also pick unique value from 3rd column at the same time

StackOverflow https://stackoverflow.com/questions/22768817

  •  24-06-2023
  •  | 
  •  

Question

I have two files both of them are delimited by pipe.

First file: has may be around 10 columns but i am interested in first two columns which would useful in updating the column value of the second file.

first file detail:

1|alpha|s3.3|4|6|7|8|9

2|beta|s3.3|4|6|7|8|9

20|charlie|s3.3|4|6|7|8|9

6|romeo|s3.3|4|6|7|8|9

Second file detail:

a1|a2|**bob**|a3|a4|a5|a6|a7|a8|**1**|a10|a11|a12

a1|a2|**ray**|a3|a4|a5|a6|a7|a8||a10|a11|a12

a1|a2|**kate**|a3|a4|a5|a6|a7|a8|**20**|a10|a11|a12

a1|a2|**bob**|a3|a4|a5|a6|a7|a8|**6**|a10|a11|a12

a1|a2|**bob**|a3|a4|a5|a6|a7|a8|**45**|a10|a11|a12

My requirement here is to find unique values from 3rd column and also replace the 4th column from the last . The 4th column from the last may/may not have numeric number . This number would be appearing in the first field of first file as well. I need replace (second file )this number with the corresponding value that appears in the second column of the first file.

expected output:

unique string : ray kate bob

a1|a2|bob|a3|a4|a5|a6|a7|a8|**alpha**|a10|a11|a12

a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12

a1|a2|kate|a3|a4|a5|a6|a7|a8|**charlie**|a10|a11|a12

a1|a2|bob|a3|a4|a5|a6|a7|a8|**romeo**|a10|a11|a12

a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12

I am able to pick the unique string using below command

awk -F'|' '{a[$3]++}END{for(i in a){print i}}' filename

I would dont want to read the second file twice , first to pick the unique string and second time to replace 4th column from the last as the file size is huge. It would be around 500mb and there are many such files.

Currently i am using perl (Text::CSV) module to read the first file ( this file is of small size ) and load the first two columns into a hash , considering first column as key and second as value. then read the second file and replace the n-4 column with hash value. But this seems to be time consuming as Text::CSV parsing seems to be slow.

Any awk/perl solution keeping speed in mind would be really helpful :)

Note: Ignore the ** asterix around the text , they are just to highlight they are not part of the data.

UPDATE : Code

#!/usr/bin/perl
use strict;
use warnings;
use Scalar::Utils;
use Text::CSV;
my %hash;
my $csv = Text::CSV->new({ sep_char => '|' });

my $file = $ARGV[0] or die "Need to get CSV file on the command line\n";

open(my $data, '<', $file) or die "Could not open '$file' $!\n";
while (my $line = <$data>) {
    chomp $line;

    if ($csv->parse($line)) {

        my @fields = $csv->fields();
        $hash{$field[0]}=$field[1];

    } else {
        warn "Line could not be parsed: $line\n";
    }
}
close($data);

my $csv = Text::CSV->new({ sep_char => '|' , blank_is_undef => 1 , eol => "\n"});
my $file2 = $ARGV[1] or die "Need to get CSV file on the command line\n";

open ( my $fh,'>','/tmp/outputfile') or die "Could not open file $!\n";
open(my $data2, '<', $file2) or die "Could not open '$file' $!\n";
while (my $line = <$data2>) {
    chomp $line;

    if ($csv->parse($line)) {

        my @fields = $csv->fields();
        if (defined ($field[-4]) && looks_like_number($field[-4]))
        {
            $field[-4]=$hash{$field[-4]};
        }

        $csv->print($fh,\@fields); 
    } else {
        warn "Line could not be parsed: $line\n";
    }
}
close($data2);
close($fh);
Was it helpful?

Solution 2

Use getline instead of parse, it is much faster. The following is a more idiomatic way of performing this task. Note that you can reuse the same Text::CSV object for multiple files.

#!/usr/bin/perl

use strict;
use warnings;
use 5.010;

use Text::CSV;

my $csv = Text::CSV->new({
    auto_diag      => 1,
    binary         => 1,
    blank_is_undef => 1,
    eol            => $/,
    sep_char       => '|'
}) or die "Can't use CSV: " . Text::CSV->error_diag;

open my $map_fh, '<', 'map.csv' or die "map.csv: $!";

my %mapping;
while (my $row = $csv->getline($map_fh)) {
    $mapping{ $row->[0] } = $row->[1];
}

close $map_fh;

open my $in_fh, '<', 'input.csv' or die "input.csv: $!";
open my $out_fh, '>', 'output.csv' or die "output.csv: $!";

my %seen;
while (my $row = $csv->getline($in_fh)) {
    $seen{ $row->[2] } = 1;

    my $key = $row->[-4];
    $row->[-4] = $mapping{$key} if defined $key and exists $mapping{$key};
    $csv->print($out_fh, $row);
}

close $in_fh;
close $out_fh;

say join ',', keys %seen;

map.csv

1|alpha|s3.3|4|6|7|8|9
2|beta|s3.3|4|6|7|8|9
20|charlie|s3.3|4|6|7|8|9
6|romeo|s3.3|4|6|7|8|9

input.csv

a1|a2|bob|a3|a4|a5|a6|a7|a8|1|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|20|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|6|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12

output.csv

a1|a2|bob|a3|a4|a5|a6|a7|a8|alpha|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|charlie|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|romeo|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12

STDOUT

kate,bob,ray

OTHER TIPS

Here's an option that doesn't use Text::CSV:

use strict;
use warnings;

@ARGV == 3 or die 'Usage: perl firstFile secondFile outFile';

my ( %hash, %seen );
local $" = '|';

while (<>) {
    my ( $key, $val ) = split /\|/, $_, 3;
    $hash{$key} = $val;
    last if eof;
}

open my $outFH, '>', pop or die $!;

while (<>) {
    my @F = split /\|/;
    $seen{ $F[2] } = undef;
    $F[-4] = $hash{ $F[-4] } if exists $hash{ $F[-4] };
    print $outFH "@F";
}

close $outFH;

print 'unique string : ', join( ' ', reverse sort keys %seen ), "\n";

Command-line usage: perl firstFile secondFile outFile

Contents of outFile from your datasets (asterisks removed):

a1|a2|bob|a3|a4|a5|a6|a7|a8|alpha|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|charlie|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|romeo|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12

STDOUT:

unique string : ray kate bob

Hope this helps!

This awk should work.

$ awk '
BEGIN { FS = OFS = "|" }
NR==FNR { a[$1] = $2; next }
{ !unique[$3]++ }
{ $(NF-3) = (a[$(NF-3)]) ? a[$(NF-3)] : $(NF-3) }1
END {
    for(n in unique) print n > "unique.txt"
}' file1 file2 > output.txt

Explanation:

  • We set the input and output field separators to |.
  • We iterate through first file creating an array storing column one as key and assigning column two as the value
  • Once the first file is loaded in memory, we create another array by reading the second file. This array stores the unique values from column three of second file.
  • While reading the file, we look at the forth value from last to be present in our array from first file. If it is we replace it with the value from array. If not then we leave the existing value as is.
  • In the END block we iterate through our unique array and print it to a file called unique.txt. This holds all the unique entries seen on column three of second file.
  • The entire output of the second file is redirected to output.txt which now has the modified forth column from last.

$ cat output.txt
a1|a2|bob|a3|a4|a5|a6|a7|a8|alpha|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|charlie|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|romeo|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12

$ cat unique.txt
kate
bob
ray
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top