Domanda

I have this perl script that compares two arrays to give me back those results found in both of them. The problem arises I believe in a regular expression, where it encounters a hyphen ( - ) inside of brackets [].

I am getting the following error:

Invalid [] range "5-3" in regex; marked by <-- HERE in m/>gi|403163623|ref|XP_003323683.2| leucyl-tRNA synthetase [Puccinia graminis f. sp. tritici CRL 75-3 <-- HERE 6-700-3]
MAQSTPSSIQELMDKKQKEATLDMGGNFTKRDDLIRYEKEAQEKWANSNIFQTDSPYIENPELKDLSGEE
LREKYPKFFGTFPYPYMNGSLHLGHAFTISKIEFAVGFERMRGRRALFPVGWHATGMPIKSASDKIIREL
EQFGQDLSKFDSQSNPMIETNEDKSATEPTTASESQDKSKAKKGKIQAKSTGLQYQFQIMESIGVSRTDI
PKFADPQYWLQYFPPIAKNDLNAFGARVDWRRSFITTDINPYYDAFVRWQMNRLKEKGYVKFGERYTIYS
PKDGQPCMDHDRSSGERLGSQEYTCLKMKVLEWGPQAGDLAAKLGGKDVFFV at comparer line 21, <NUC> chunk 168.

I thought the error could be solved by just adding \Q..\E in the regex so as to bypass the [] but this has not worked. Here is my code, and thanks in advance for any and all help that you may offer.

@cyt = <CYT>;
@nuc = <NUC>;

$cyt = join ('',@cyt);
$cyt =~ /\[([^\]]+)\]/g;

@shared = '';

foreach $nuc (@nuc) {
    if ($cyt =~ $nuc) {
        push @shared, $nuc;     
    }
}

print @shared;

What I am trying to achieve with this code is compare two different lists loaded into the arrays @cyt and @nuc. I then compare the name in between the [] of one of the elements in list to to the name in [] of the other. All those finds are then pushed into @shared. Hope that clarifies it a bit.

È stato utile?

Soluzione

Your question describes a set intersection, which is covered in the Perl FAQ.

How do I compute the difference of two arrays? How do I compute the intersection of two arrays?

Use a hash. Here's code to do both and more. It assumes that each element is unique in a given array:

my (@union, @intersection, @difference);
my %count = ();
foreach my $element (@array1, @array2) { $count{$element}++ }
foreach my $element (keys %count) {
  push @union, $element;
  push @{ $count{$element} > 1 ? \@intersection : \@difference }, $element;
}

Note that this is the symmetric difference, that is, all elements in either A or in B but not in both. Think of it as an xor operation.

Applying it to your problem gives the code below.

Factor out the common code to find the names in the data files. This sub assumes that

  • every [name] will be entirely contained within a given line rather than crossing a newline boundary
  • each line of input will contain at most one [name]

If these assumptions are invalid, please provide more representative samples of your inputs.

Note the use of the /x regex switch that tells the regex parser to ignore most whitespace in patterns. In the code below, this permits visual separation between the brackets that are delimiters and the brackets surrounding the character class that captures names.

sub extract_names {
  my($fh) = @_;

  my %name;
  while (<$fh>) {
    ++$name{$1} if /\[   ([^\]]+)   \]/x;
  }

  %name;
}

Your question uses old-fashioned typeglob filehandles. Note that the paramter extract_names expects is a filehandle. Convenient parameter passing is one of many benefits of indirect filehandles, such as those created below.

open my $cyt, "<", "cyt.dat" or die "$0: open: $!";
open my $nuc, "<", "nuc.dat" or die "$0: open: $!";

my %cyt = extract_names $cyt;
my %nuc = extract_names $nuc;

With the names from cyt.dat in the hash %cyt and likewise for nuc.dat and %nuc, the code here iterates over the keys of both hashes and increments the corresponding keys in %shared.

my %shared;
for (keys %cyt, keys %nuc) {
  ++$shared{$_};
}

At this point, %shared represents a set union of the names in cyt.dat and nuc.dat. That is, %shared contains all keys from either %cyt or %nuc. To compute the set difference, we observe that the value in %shared for a key present in both inputs must be greater than one.

The final pass below iterates over the keys in sorted order (because hash keys are stored internally in an undefined order). For truly shared keys (i.e., those whose values are greater than one), the code prints them and deletes the rest.

for (sort keys %shared) {
  if ($shared{$_} > 1) {
    print $_, "\n";
  }
  else {
    delete $shared{$_};
  }
}
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top