awk cartesian product

https://stackoverflow.com/questions/10180990

01-06-2021
|

Question

I have large tab-separated two-column text file, like this:

...
"001R_FRG3G"    "81941549; 47060116; 49237298"
"002L_FRG3G"    "49237299; 47060117; 81941548"
"002R_IIV3" "106073503; 123808694; 109287880"
...

As you see second column doesn't contain atomic values. That's why i want to "normalise" this file to have something like:

...
"001R_FRG3G"    "81941549"
"001R_FRG3G"    "47060116"
"001R_FRG3G"    "49237298"
"002L_FRG3G"    "49237299"
"002L_FRG3G"    "47060117"
"002L_FRG3G"    "81941548"
"002R_IIV3" "106073503"
"002R_IIV3" "123808694"
"002R_IIV3" "109287880"
...

Anyone knows how to do it effectively?

Solution

Perl:

perl -lne '
s/[";]//g;
($a, @b) = split;
print qq("$a" "$_") for @b;
' FILE

OTHER TIPS

awk '{for (i=2; i<=NF; i++) {gsub(/[";]/, "", $i); printf "%s%s\"%s\"", $1, OFS, $i; printf "%s", "\n"}}' inputfile

For each field after $1, strip quotation marks and semicolons, then print $1 followed by the contents of the field surrounded by quotes. Do this for each line in the input file.

This might work for you (GNU awk):

awk '{while(/;/) $0=gensub(/^((.*[ \t]").*);[ \t]*/,"\\1\"\n\\2",1)};1' file
"001R_FRG3G"    "81941549"
"001R_FRG3G"    "47060116"
"001R_FRG3G"    "49237298"
"002L_FRG3G"    "49237299"
"002L_FRG3G"    "47060117"
"002L_FRG3G"    "81941548"
"002R_IIV3" "106073503"
"002R_IIV3" "123808694"
"002R_IIV3" "109287880"

or, it's not awk but it elegantly solves the problem.

sed -i ':a;s/\(\(.*\s"\).*\);\s*/\1"\n\2/;ta' file
"001R_FRG3G"    "81941549"
"001R_FRG3G"    "47060116"
"001R_FRG3G"    "49237298"
"002L_FRG3G"    "49237299"
"002L_FRG3G"    "47060117"
"002L_FRG3G"    "81941548"
"002R_IIV3" "106073503"
"002R_IIV3" "123808694"
"002R_IIV3" "109287880"

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow