Question

I have a file that contains sequence data, where each new paragraph (separated by two blank lines) contain a new sequence:

#example

ASDHJDJJDMFFMF
AKAKJSJSJSL---
SMSM-....SKSKK
....SK


SKJHDDSNLDJSCC
AK..SJSJSL--HG
AHSM---..SKSKK
-.-GHH

and I want to end up with a file looking like:

ASDHJDJJDMFFMFAKAKJSJSJSL---SMSM-....SKSKK....SK
SKJHDDSNLDJSCCAK..SJSJSL--HGAHSM---..SKSKK-.-GHH

each sequence is the same length (if that helps).

I would also be looking to do this over multiple files stored in different directiories.

I have just tried

sed -e '/./{H;$!d;}' -e 'x;/regex/!d' ./text.txt

however this just deleted the entire file :S

any help would bre appreciated - doesn't have to be in sed, if you know how to do it in perl or something else then that's also great.

Thanks.

Was it helpful?

Solution 2

awk '
    /^[[:space:]]*$/ {if (line) print line; line=""; next}
    {line=line $0}
    END {if (line) print line}
'
perl -00 -pe 's/\n//g; $_.="\n"'

For multiple files:

# adjust your glob pattern to suit, 
# don't be shy to ask for assistance
for file in */*.txt; do
    newfile="/some/directory/$(basename "$file")"
    perl -00 -pe 's/\n//g; $_.="\n"' "$file" > "$newfile"
done

OTHER TIPS

All you're asking to do is convert a file of blank-lines-separated records (RS) where each field is separated by newlines into a file of newline-separated records where each field is separated by nothing (OFS). Just set the appropriate awk variables and recompile the record:

$ awk '{$1=$1}1' RS= OFS= file
ASDHJDJJDMFFMFAKAKJSJSJSL---SMSM-....SKSKK....SK
SKJHDDSNLDJSCCAK..SJSJSL--HGAHSM---..SKSKK-.-GHH

A Perl one-liner, if you prefer:

perl -nle 'BEGIN{$/=""};s/\n//g;print $_' file

The $/ variable is the equivalent of awk's RS variable. When set to the empty sting ("") it causes two or more empty lines to be treated as one empty line. This is the so-called "paragraph-mode" of reading. For each record read, all newline characters are removed. The -l switch adds a newline to the end of each output string, thus giving the desired result.

just try to find those double linebreaks: \n or \r and replace first those with an special sign like :$: after that you replace every linebreak with an empty string to get the whole file in one line. next, replace your special sign with a simple line break :)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top