Question

I am trying to modify a file which is set up like this:

chr start   ref alt 
chr1    18884   C   CAAAA
chr1    135419  TATACA  T
chr1    332045  T   TTG
chr1    453838  T   TAC
chr1    567652  T   TG
chr1    602541  TTTA    T
chr1    614937  C   CTCTCTG
chr1    654889  C   CA
chr1    736800  AC  A

I want to modify it such that: if column "ref" is a string >1 (i.e line 2) then I generate 2 new columns where:

first new column = start coordinate-1 second new column = start coordinate+(length of string in ref)+1

therefore, for line 2 output would look like:

chr1 135419 TATACA T 135418 135426

or: if length of string in "ref" = 1 and column "alt"=string of length>1 (i.e. line 1) then

first new column = start coordinate second new column = start coordinate+2

so, output for line 1 would be:

chr1 18884 C CAAAA 18884 18886

I have tried to this in awk but without success My perl is non-existent but would this be the best way? Or maybe in R?

Was it helpful?

Solution

Perl solution. Note that your specification does not mention what to do if both strings are length 1.

#!/usr/bin/perl
use warnings;
use strict;
use feature qw(say);

#use Data::Dumper;
<DATA>; # Skip the header;
while (<DATA>) {
    my ($chr, $start, $ref, $alt) = split;
    my @cols;
    if (1 < length $ref) {
          @cols = ( $start - 1, $start + 1 + length $ref);
    } elsif (1 < length $alt) {
        @cols = ($start, $start + 2);
    } else {
        warn "Don't know what to do at $.\n";
    }
    say join "\t", $chr, $start, $ref, $alt, @cols;
}


__DATA__
chr start   ref alt
chr1    18884   C   CAAAA
chr1    135419  TATACA  T
chr1    332045  T   TTG
chr1    453838  T   TAC
chr1    567652  T   TG
chr1    602541  TTTA    T
chr1    614937  C   CTCTCTG
chr1    654889  C   CA
chr1    736800  AC  A

OTHER TIPS

Here's one way using awk. Run like:

awk -f script.awk file | column -t

Contents of script.awk:

NR==1 {
    next
}

length($3)>1 && length($4)==1 {
    print $0, $2-1, $2+length($3)+1
    next
}

length($3)==1 && length($4)>1 {
    print $0, $2, $2+2
    next
}1

Results:

chr1  18884   C       CAAAA    18884   18886
chr1  135419  TATACA  T        135418  135426
chr1  332045  T       TTG      332045  332047
chr1  453838  T       TAC      453838  453840
chr1  567652  T       TG       567652  567654
chr1  602541  TTTA    T        602540  602546
chr1  614937  C       CTCTCTG  614937  614939
chr1  654889  C       CA       654889  654891
chr1  736800  AC      A        736799  736803

Alternatively, here's the one-liner:

awk 'NR==1 { next } length($3)>1 && length($4)==1 { print $0, $2-1, $2+length($3)+1; next } length($3)==1 && length($4)>1 { print $0, $2, $2+2; next }1' filem | column -t

The code should be pretty self-explanatory. The 1 on the end of the script simply enables default printing (i.e. '1' returns true) of each line. HTH.

Doing it in perl is trivial (but so is in awk):

#!/usr/bin/perl
while (<>) {
  chmop;
  my ($chr,$start,$ref,$alt)=split(/\s+/,$_);
  if (len($ref) > 1) {
print STDOUT
  "$chr\t$start\t$ref\t$alt\t",
    $start+len($ref)+1,"\n";
  } elsif (len($ref)==1) {
print STDOUT
  "$chr\t$start\t$ref\t$alt\t",
    $start+2,"\n";
  } else {
print STDERR "ERROR: ???\n"; #actually impossible
  }
}

Stick it in a file morecols.pl , chmod +x morecols.pl, run more morecols.pl . (Beware, lots of assumptions in this code/instructions). I have a feeling your actual problem is more with programming/text processing then tools or languages. If so, this code is just a stopgap solution....

Cheers.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top