Question

I'm processing a huge file with (GNU) awk, (other available tools are: Linux shell tools, some old (>5.0) version of Perl, but can't install modules).

My problem: if some field1, field2, field3 contain X, Y, Z I must search for a file in another directory which contains field4, and field5 on one line, and insert some data from the found file to the current output.

E.g.:

Actual file line:

f1 f2 f3 f4 f5
X  Y  Z  A  B

Now I need to search for another file (in another directory), which contains e.g.

f1 f2 f3 f4
A  U  B  W

And write to STDOUT $0 from the original file, and f2 and f3 from the found file, then process the next line of the original file.

Is it possible to do it with awk?

Was it helpful?

Solution

Let me start out by saying that your problem description isn't really that helpful. Next time, please just be more specific: You might be missing out on much better solutions.

So from your description, I understand you have two files which contain whitespace-separated data. In the first file, you want to match the first three columns against some search pattern. If found, you want to find all lines in another file which contain the fourth and and fifth column of the matching line in the first file. From those lines, you need to extract the second and third column and then print the first column of the first file and the second and third from the second file. Okay, here goes:

#!/usr/bin/env perl -nwa
use strict;
use File::Find 'find';
my @search = qw(X Y Z);

# if you know in advance that the otherfile isn't
# huge, you can cache it in memory as an optimization.

# with any more columns, you want a loop here:
if ($F[0] eq $search[0]
    and $F[1] eq $search[1]
    and $F[2] eq $search[2])
{
  my @files;
  find(sub {
      return if not -f $_;
      # verbatim search for the columns in the file name.
      # I'm still not sure what your file-search criteria are, though.
      push @files, $File::Find::name if /\Q$F[3]\E/ and /\Q$F[4]\E/;
      # alternatively search for the combination:
      #push @files, $File::Find::name if /\Q$F[3]\E.*\Q$F[4]\E/;
      # or search *all* files in the search path?
      #push @files, $File::Find::name;
    }, '/search/path'
  )
  foreach my $file (@files) {
    open my $fh, '<', $file or die "Can't open file '$file': $!";
    while (defined($_ = <$fh>)) {
      chomp;
      # order of fields doesn't matter per your requirement.
      my @cols = split ' ', $_;
      my %seen = map {($_=>1)} @cols;
      if ($seen{$F[3]} and $seen{$F[4]}) {
        print join(' ', $F[0], @cols[1,2]), "\n";
      }
    }
    close $fh;
  }
} # end if matching line

Unlike another poster's solution which contains lots of system calls, this doesn't fall back to the shell at all and thus should be plenty fast.

OTHER TIPS

This is the type of work that got me to move from awk to perl in the first place. If you are going to accomplish this, you may actually find it easier to create a shell script that creates awk script(s) to query and then update in separate steps.

(I've written such a beast for reading/updating windows-ini-style files - it's ugly. I wish I could have used perl.)

I often see the restriction "I can't use any Perl modules", and when it's not a homework question, it's often just due to a lack of information. Yes, even you can use CPAN contains the instructions on how to install CPAN modules locally without having root privileges. Another alternative is just to take the source code of a CPAN module and paste it into your program.

None of this helps if there are other, unstated, restrictions, like lack of disk space that prevent installation of (too many) additional files.

This seems to work for some test files I set up matching your examples. Involving perl in this manner (interposed with grep) is probably going to hurt the performance a great deal, though...

## perl code to do some dirty work

for my $line (`grep 'X Y Z' myhugefile`) {
    chomp $line;
    my ($a, $b, $c, $d, $e) = split(/ /,$line);
    my $cmd = 'grep -P "' . $d . ' .+? ' . $e .'" otherfile';
    for my $from_otherfile (`$cmd`) {
        chomp $from_otherfile;
        my ($oa, $ob, $oc, $od) = split(/ /,$from_otherfile);
        print "$a $ob $oc\n";
    }
}

EDIT: Use tsee's solution (above), it's much more well-thought-out.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top