How do I delete newlines ('\n', 0x0A) from non-empty lines using tr(1)?

https://stackoverflow.com/questions/8178250

03-03-2021
|

Pregunta

I have a file named file1 with following content:

The answer t
o your question 

A conclusive a
nswer isn’t al
ways possible.

When in doubt, ask pe
ople to cite their so
urces, or to explain

Even if we don’t agre
e with you, or tell y
ou.

I would like to convert file1 into file2. Latter should look like this:

The answer to your question

A conclusive answer isn’t always possible.

When in doubt, ask people to cite their sources, or to explain

Even if we don’t agree with you, or tell you.

In case I simply execute cat file1 | tr -d "\n" > file2", all the newline characters will be deleted. Ho do delete only those newline characters, which are on the non-empty lines using the tr(1) utility?

Solución

perl -00 -lpe 'tr/\n//d'

-00 is Perl's "paragraph" mode, reading the input with one or more blank lines as the delimiter. -l appends the system newline character to the print command, so it's safe to delete all newlines in the input.

Otros consejos

tr can't do this, but sed easily can

sed -ne '$!H;/^$/{x;s/\n//g;G;p;d;}' file1 > file2

This finds non-empty lines and holds them. Then, on empty lines, it removes newlines from the held data and prints the result followed by a newline. The held data is deleted and the process repeats.

EDIT:

Per @potong's comment, here's a version which doesn't require an extra blank line at the end of the file.

sed -ne 'H;/^$/{x;s/\n//g;G;p;};${x;s/\n//g;x;g;p;}' file1 > file2

If there's a character that you know doesn't appear in your input, you could do something like this:

# Assume that the input doesn't contain the '|' character at all
tr '\n' '|' < file1 | sed 's/\([^|]\)|\([^|]\)/\1\2/g' | tr '|' '\n' > file2

This replaces all newlines with the replacement character |; sed then deletes all instances of | that come after and before some other character; and finally, it replaces | back with newlines.

This may work for you:

# sed '1{h;d};H;${x;s/\([^\n]\)\n\([^\n]\)/\1\2/g;p};d' file

The answer to your question 

A conclusive answer isn't always possible.

When in doubt, ask people to cite their sources, or to explain

Even if we don't agree with you, or tell you.

The newlines in file1 fall into four classes:

newline followed by another newline
newline preceded by newline
newline at the end of file
sandwiched newline

Deleting the first class by reading the entire input (the -000 option) and substituting one newline everywhere we see a pair of them (s/\n\n/\n/g) gets us

$ perl -000 -pe 's/\n\n/\n/g' file1 
The answer t
o your question 
A conclusive a
nswer isn’t al
ways possible.
When in doubt, ask pe
ople to cite their so
urces, or to explain
Even if we don’t agre
e with you, or tell y
ou.

That's not what we want because the first class of newlines should terminate lines in file2.

We may try to be clever and use negative look-behind to delete newlines preceded by other newlines (the second class), but the output is indistinguishable from the previous case, which makes sense because this time we're deleting the latter rather than the former in each adjoined pair of newlines.

$ perl -000 -pe 's/(?<=\n)\n//g' file1 
The answer t
o your question 
A conclusive a
nswer isn’t al
ways possible.
When in doubt, ask pe
ople to cite their so
urces, or to explain
Even if we don’t agre
e with you, or tell y
ou.

Even so, this still isn't what we want because newlines preceded by other newlines become the blank lines in file2.

It's obvious that we want to hang on to the newline at the end of file1.

What we want then is a program that deletes the fourth class only: each newline that is not preceded by another newline and that is followed by neither another newline nor logical end-of-input.

Using Perl's look-around assertions, specification is straightforward although perhaps a bit intimidating in appearance. "Not preceded by newline" is the negative look-behind (?<!\n). Using negative look-ahead (?!...) we don't want to see another newline or (|) the end of the input ($).

Putting it all together we get

$ perl -000 -pe 's/(?<!\n)\n(?!\n|$)//g' file1 
The answer to your question

A conclusive answer isn’t always possible.

When in doubt, ask people to cite their sources, or to explain

Even if we don’t agree with you, or tell you.

Finally, to create file2, redirect the standard output.

perl -000 -pe 's/(?<!\n)\n(?!\n|$)//g' file1  >file2

You can't get that with tr by itself. tr is very handy, but is strictly a char-by-char filter, no look-ahead or look-behind.

You might be able to get your example output with sed, but it would really be painful (I think!). edit (sed master @Sorpigal proves me wrong!)

Here's a solution with awk

/home/shellter:>cat <<-EOS \
| awk 'BEGIN{RS="\n\n"}; { gsub("\n", "", $0) ;printf("%s %s", $0, "\n\n") }'
The answer t
o your question 

A conclusive a
nswer isn’t al
ways possible.

When in doubt, ask pe
ople to cite their so
urces, or to explain

Even if we don’t agre
e with you, or tell y
ou.
EOS


# output
The answer to your question

A conclusive answer isnt always possible.

When in doubt, ask people to cite their sources, or to explain

Even if we dont agree with you, or tell you.

Weird, it is displaying as triple-spaced, but it is really dbl-spaced.

Awk has predefined variables that it populates for each file, and each line of text that it reads, i.e.

RS = RecordSeperator -- normally a line of data, but a configurable value, that when set 
                     to '\n\n' means a blank line, or a typical separation on a paragraph

$0 = complete line of text (as defined by the internal variables RS (RecordSeparator)
                             In this problem, it is each paragraph of data, viewed though
                             as a record.

$1 = first field in text (as defined by the internal variables FS (FieldSeparator)
                           which defaults to (possibly multiple) space chars OR tab char
                          a line with 2 connected spaces chars and 1 tab char has 3 fields)

NF = Number(of)Fields in current line of data (again fields defined by value of FS as 
                                                described above)

(there are many others, besides, $0, $n, $NF, $FS, $RS).

you can programatically increment for values like $1, $2, $3, by using a variable as in the example code, like $i (i is a variable that has a number between 2 and NF. The leading '$' says give me the value of field i (i.e. $2, $3, $4 ...)

I hope this helps.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow