Pregunta

Say I have a file with multiple paragraphs similar to

Lorem ipsum dolor sit amet. Velit et ornare feugiat ve fringilla adipiscing, non
augue risus, eleifend. Laoreet a, taciti porttitor mus. Erat leo metus
venenatis. Natoque eni, nunc quis elit est. Nec enim dui. Sem parturient lectus,
sed, egestas. Amet nascetur quisque, nonummy amet ut odio proin hymenaeos sit,
consequat proin hymenaeos vestibulum. Duis ad penatibus natoque, fames nec amet
eni inceptos. Ligula orci scelerisque laoreet, massa leo dictumst feugiat
praesent varius netus suspendisse. Et et quis volutpat quam, aenean sit, magnis
integer ad luctus hendrerit per. Lectus adipiscing nascetur quisque consectetuer
feugiat etiam eros. Natoque massa. Semper ut nam tortor. Odio ut nullam mus,
sociis at, luctus aliquet at odio habitant fames.

Penatibus ipsum lacus blandit ad dis ante dolor. Cursus porta penatibus
facilisi. Nisl erat rutrum primis dis elit dolor penatibus pretium duis
sollicitudin ut. Sed urna leo massa cubilia eget, elementum mus. Ve metus ac
vitae at litora tincidunt id, ac hac. Dis justo nullam. Fames sollicitudin,
augue ve at. Tristique. Primis convallis praesent, eget. Nullam, penatibus ut,
proin non mus id nascetur dis, lorem arcu. Magna urna nascetur ornare, nunc
proin quisque cum, pharetra. Quisque, litora eu lobortis diam eros. Vel mi
hymenaeos ipsum in. Ligula curabitur ve, magnis hymenaeos euismod.

The file was generated by processing a markdown file, which as you can see has broken lines at around 80 characters. Using Perl or sed or awk (I'm running Linux so could use any solution but I not much of a Python or Ruby user), how can I undo the breaking of lines within paragraphs?

I know how to strip \n from an entire file, but that would run the two paragraphs shown into a single unbroken line. I don't want that. I just want to operate a paragraph at a time, so any solution should skip lines where \n is the only content.

The file I have uses Unix/Linux file-endings, i.e. line feeds, hence only \n are present. I do need to preserve the spaces between paragraphs.

¿Fue útil?

Solución

Breaks/newlines are replaced with space char,

perl -00 -lpe 's|\r?\n| |g' file

Here is brief explanation of switches, and deparsed source


perl -MO=Deparse -00 -lpe 's|\r?\n| |g' file
BEGIN { $/ = ""; $\ = "\n\n"; }      # see below
LINE: while (defined($_ = <ARGV>)) { # -p switch
    chomp $_;                        # also -l switch
    s/\r?\n/ /g;
}
continue {
    print $_;                        # -p switch
}
  • -00 => $/ = ""; # input record separator set to paragraph mode
  • -l => $\ = "\n\n"; # output record separator set to $/

Otros consejos

Try to chomp() last newline when a regular expression matches any line with a non-blank character:

perl -pe 'chomp if m/\S/' infile 

EDIT: To keep a blank line between paragraphs and a final newline character, try the following:

perl -pe 'm/\S/ ? chomp() : print "\n"; END { print "\n" }' infile

Without having to read the whole file into memory:

$ cat file
Lorem ipsum dolor sit amet. Velit et ornare feugiat ve fringilla adipiscing, non
augue risus, eleifend. Laoreet a, taciti porttitor mus. Erat leo metus
venenatis. Natoque eni, nunc quis elit est.

Penatibus ipsum lacus blandit ad dis ante dolor. Cursus porta penatibus
facilisi. Nisl erat rutrum primis dis elit dolor penatibus pretium duis
sollicitudin ut. Sed urna leo massa cubilia eget, elementum mus. Ve metus ac
vitae at litora tincidunt id, ac hac. Dis justo nullam.

$ awk -v RS= -v ORS='\n\n' -F'\n' '{$1=$1}1' file
Lorem ipsum dolor sit amet. Velit et ornare feugiat ve fringilla adipiscing, non augue risus, eleifend. Laoreet a, taciti porttitor mus. Erat leo metus venenatis. Natoque eni, nunc quis elit est.

Penatibus ipsum lacus blandit ad dis ante dolor. Cursus porta penatibus facilisi. Nisl erat rutrum primis dis elit dolor penatibus pretium duis sollicitudin ut. Sed urna leo massa cubilia eget, elementum mus. Ve metus ac vitae at litora tincidunt id, ac hac. Dis justo nullam.

lines where \n is the only content. means at least two consecutive newline chars.

You can do it easily with regex. A regex pattern would be (?:[^\r\n])\n(?:[^\r\n])

A sample python file

import re

mystring = """sjdfkj

adlfklk 
dlkfl """ 

print re.sub(r"(?:[^\r\n])\n(?:[^\r\n])"," ",mystring)
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top