Removing trailing / starting newlines with sed, awk, tr, and friends

https://stackoverflow.com/questions/7359527

28-10-2019
|

문제

I would like to remove all of the empty lines from a file, but only when they are at the end/start of a file (that is, if there are no non-empty lines before them, at the start; and if there are no non-empty lines after them, at the end.)

Is this possible outside of a fully-featured scripting language like Perl or Ruby? I’d prefer to do this with sed or awk if possible. Basically, any light-weight and widely available UNIX-y tool would be fine, especially one I can learn more about quickly (Perl, thus, not included.)

해결책

From Useful one-line scripts for sed:

# Delete all leading blank lines at top of file (only).
sed '/./,$!d' file

# Delete all trailing blank lines at end of file (only).
sed -e :a -e '/^\n*$/{$d;N;};/\n$/ba' file

Therefore, to remove both leading and trailing blank lines from a file, you can combine the above commands into:

sed -e :a -e '/./,$!d;/^\n*$/{$d;N;};/\n$/ba' file

다른 팁

So I'm going to borrow part of @dogbane's answer for this, since that sed line for removing the leading blank lines is so short...

tac is part of coreutils, and reverses a file. So do it twice:

tac file | sed -e '/./,$!d' | tac | sed -e '/./,$!d'

It's certainly not the most efficient, but unless you need efficiency, I find it more readable than everything else so far.

here's a one-pass solution in awk: it does not start printing until it sees a non-empty line and when it sees an empty line, it remembers it until the next non-empty line

awk '
    /[[:graph:]]/ {
        # a non-empty line
        # set the flag to begin printing lines
        p=1      
        # print the accumulated "interior" empty lines 
        for (i=1; i<=n; i++) print ""
        n=0
        # then print this line
        print
    }
    p && /^[[:space:]]*$/ {
        # a potentially "interior" empty line. remember it.
        n++
    }
' filename

Note, due to the mechanism I'm using to consider empty/non-empty lines (with [[:graph:]] and /^[[:space:]]*$/), interior lines with only whitespace will be truncated to become truly empty.

using awk:

awk '{a[NR]=$0;if($0 && !s)s=NR;}
    END{e=NR;
        for(i=NR;i>1;i--) 
            if(a[i]){ e=i; break; } 
        for(i=s;i<=e;i++)
            print a[i];}' yourFile

As mentioned in another answer, tac is part of coreutils, and reverses a file. Combining the idea of doing it twice with the fact that command substitution will strip trailing new lines, we get

echo "$(echo "$(tac "$filename")" | tac)"

which doesn't depend on sed. You can use echo -n to strip the remaining trailing newline off.

Here's an adapted sed version, which also considers "empty" those lines with just spaces and tabs on it.

sed -e :a -e '/[^[:blank:]]/,$!d; /^[[:space:]]*$/{ $d; N; ba' -e '}'

It's basically the accepted answer version (considering BryanH comment), but the dot . in the first command was changed to [^[:blank:]] (anything not blank) and the \n inside the second command address was changed to [[:space:]] to allow newlines, spaces an tabs.

An alternative version, without using the POSIX classes, but your sed must support inserting \t and \n inside […]. GNU sed does, BSD sed doesn't.

sed -e :a -e '/[^\t ]/,$!d; /^[\n\t ]*$/{ $d; N; ba' -e '}'

Testing:

prompt$ printf '\n \t \n\nfoo\n\nfoo\n\n \t \n\n' 



foo

foo



prompt$ printf '\n \t \n\nfoo\n\nfoo\n\n \t \n\n' | sed -n l
$
 \t $
$
foo$
$
foo$
$
 \t $
$
prompt$ printf '\n \t \n\nfoo\n\nfoo\n\n \t \n\n' | sed -e :a -e '/[^[:blank:]]/,$!d; /^[[:space:]]*$/{ $d; N; ba' -e '}'
foo

foo
prompt$

Using bash

$ filecontent=$(<file)
$ echo "${filecontent/$'\n'}"

In bash, using cat, wc, grep, sed, tail and head:

# number of first line that contains non-empty character
i=`grep -n "^[^\B*]" <your_file> | sed -e 's/:.*//' | head -1`
# number of hte last one
j=`grep -n "^[^\B*]" <your_file> | sed -e 's/:.*//' | tail -1`
# overall number of lines:
k=`cat <your_file> | wc -l`
# how much empty lines at the end of file we have?
m=$(($k-$j))
# let strip last m lines!
cat <your_file> | head -n-$m
# now we have to strip first i lines and we are done 8-)
cat <your_file> | tail -n+$i

Man, it's definitely worth to learn "real" programming language to avoid that ugliness!

For an efficient non-recursive version of the trailing newlines strip (including "white" characters) I've developed this sed script.

sed -n '/^[[:space:]]*$/ !{x;/\n/{s/^\n//;p;s/.*//;};x;p;}; /^[[:space:]]*$/H'

It uses the hold buffer to store all blank lines and prints them only after it finds a non-blank line. Should someone want only the newlines, it's enough to get rid of the two [[:space:]]* parts:

sed -n '/^$/ !{x;/\n/{s/^\n//;p;s/.*//;};x;p;}; /^$/H'

I've tried a simple performance comparison with the well-known recursive script

sed -e :a -e '/^\n*$/{$d;N;};/\n$/ba'

on a 3MB file with 1MB of random blank lines around a random base64 text.

shuf -re 1 2 3 | tr -d "\n" | tr 123 " \t\n" | dd bs=1 count=1M > bigfile
base64 </dev/urandom | dd bs=1 count=1M >> bigfile
shuf -re 1 2 3 | tr -d "\n" | tr 123 " \t\n" | dd bs=1 count=1M >> bigfile

The streaming script took roughly 0.5 second to complete, the recursive didn't end after 15 minutes. Win :)

For completeness sake of the answer, the leading lines stripping sed script is already streaming fine. Use the most suitable for you.

sed '/[^[:blank:]]/,$!d'
sed '/./,$!d'

A bash solution.

Note: Only useful if the file is small enough to be read into memory at once.

[[ $(<file) =~ ^$'\n'*(.*)$ ]] && echo "${BASH_REMATCH[1]}"

$(<file) reads the entire file and trims trailing newlines, because command substitution ($(....)) implicitly does that.
=~ is bash's regular-expression matching operator, and =~ ^$'\n'*(.*)$ optionally matches any leading newlines (greedily), and captures whatever comes after. Note the potentially confusing $'\n', which inserts a literal newline using ANSI C quoting, because escape sequence \n is not supported.
Note that this particular regex always matches, so the command after && is always executed.
Special array variable BASH_REMATCH rematch contains the results of the most recent regex match, and array element [1] contains what the (first and only) parenthesized subexpression (capture group) captured, which is the input string with any leading newlines stripped. The net effect is that ${BASH_REMATCH[1]} contains the input file content with both leading and trailing newlines stripped.
Note that printing with echo adds a single trailing newline. If you want to avoid that, use echo -n instead (or use the more portable printf '%s').

I'd like to introduce another variant for gawk v4.1+

result=($(gawk '
    BEGIN {
        lines_count         = 0;
        empty_lines_in_head = 0;
        empty_lines_in_tail = 0;
    }
    /[^[:space:]]/ {
        found_not_empty_line = 1;
        empty_lines_in_tail  = 0;
    }
    /^[[:space:]]*?$/ {
        if ( found_not_empty_line ) {
            empty_lines_in_tail ++;
        } else {
            empty_lines_in_head ++;
        }
    }
    {
        lines_count ++;
    }
    END {
        print (empty_lines_in_head " " empty_lines_in_tail " " lines_count);
    }
' "$file"))

empty_lines_in_head=${result[0]}
empty_lines_in_tail=${result[1]}
lines_count=${result[2]}

if [ $empty_lines_in_head -gt 0 ] || [ $empty_lines_in_tail -gt 0 ]; then
    echo "Removing whitespace from \"$file\""
    eval "gawk -i inplace '
        {
            if ( NR > $empty_lines_in_head && NR <= $(($lines_count - $empty_lines_in_tail)) ) {
                print
            }
        }
    ' \"$file\""
fi

@dogbane has a nice simple answer for removing leading empty lines. Here's a simple awk command which removes just the trailing lines. Use this with @dogbane's sed command to remove both leading and trailing blanks.

awk '{ LINES=LINES $0 "\n"; } /./ { printf "%s", LINES; LINES=""; }'

This is pretty simple in operation.

Add every line to a buffer as we read it.
For every line which contains a character, print the contents of the buffer and then clear it.

So the only things that get buffered and never displayed are any trailing blanks.

I used printf instead of print to avoid the automatic addition of a newline, since I'm using newlines to separate the lines in the buffer already.

This AWK script will do the trick:

BEGIN {
    ne=0;
}

/^[[:space:]]*$/ {
    ne++;
}

/[^[:space:]]+/ {
    for(i=0; i < ne; i++)
        print "";
    ne=0;
    print
}

The idea is simple: empty lines do not get echoed immediately. Instead, we wait till we get a non-empty line, and only then we first echo out as much empty lines as seen before it, and only then echo out the new non-empty line.

perl -0pe 's/^\n+|\n+(\n)$/\1/gs'

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow