Pregunta
I have large tab-separated two-column text file, like this:
...
"001R_FRG3G" "81941549; 47060116; 49237298"
"002L_FRG3G" "49237299; 47060117; 81941548"
"002R_IIV3" "106073503; 123808694; 109287880"
...
As you see second column doesn't contain atomic values. That's why i want to "normalise" this file to have something like:
...
"001R_FRG3G" "81941549"
"001R_FRG3G" "47060116"
"001R_FRG3G" "49237298"
"002L_FRG3G" "49237299"
"002L_FRG3G" "47060117"
"002L_FRG3G" "81941548"
"002R_IIV3" "106073503"
"002R_IIV3" "123808694"
"002R_IIV3" "109287880"
...
Anyone knows how to do it effectively?
Solución
Perl:
perl -lne '
s/[";]//g;
($a, @b) = split;
print qq("$a" "$_") for @b;
' FILE
Otros consejos
awk '{for (i=2; i<=NF; i++) {gsub(/[";]/, "", $i); printf "%s%s\"%s\"", $1, OFS, $i; printf "%s", "\n"}}' inputfile
For each field after $1
, strip quotation marks and semicolons, then print $1
followed by the contents of the field surrounded by quotes. Do this for each line in the input file.
This might work for you (GNU awk):
awk '{while(/;/) $0=gensub(/^((.*[ \t]").*);[ \t]*/,"\\1\"\n\\2",1)};1' file
"001R_FRG3G" "81941549"
"001R_FRG3G" "47060116"
"001R_FRG3G" "49237298"
"002L_FRG3G" "49237299"
"002L_FRG3G" "47060117"
"002L_FRG3G" "81941548"
"002R_IIV3" "106073503"
"002R_IIV3" "123808694"
"002R_IIV3" "109287880"
or, it's not awk but it elegantly solves the problem.
sed -i ':a;s/\(\(.*\s"\).*\);\s*/\1"\n\2/;ta' file
"001R_FRG3G" "81941549"
"001R_FRG3G" "47060116"
"001R_FRG3G" "49237298"
"002L_FRG3G" "49237299"
"002L_FRG3G" "47060117"
"002L_FRG3G" "81941548"
"002R_IIV3" "106073503"
"002R_IIV3" "123808694"
"002R_IIV3" "109287880"
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow