Strange problem with cut,colrm,awk and sed: fail to cut characters from a pipe stream

https://stackoverflow.com/questions/4789385

24-10-2019
|

Question

I have created a script to enumerate all files in a directory and below it. I wanted to add some progression feed-back by using pv, because I usually use it from the root directory.

The problem is find which always include fractional seconds in its time output (%TT), but I don't want to record so much detail.

If I write the script to do every thing in one pass, I get the right output. But if I use intermediate files to have an estimation during a "second" pass, the result change and I do not see why.

This version give the right result:

#!/bin/bash

find -printf "%11s %TY-%Tm-%Td %TT %p\n" 2> /dev/null |
# - Remove the fractional seconds from the time
# before:       4096 2011-01-19 22:43:51.0000000000 .
# after :       4096 2011-01-19 22:43:51 .
colrm 32 42 |
pv -ltrbN "Enumerating files..." |
# - Sort every thing by filename
sort -k 4

But the sort can take a long time, so I tried something like this, to have a little more feed-back:

#!/bin/bash

TMPFILE1=$(mktemp)
TMPFILE2=$(mktemp)

# Erase temporary files before quitting
trap "rm $TMPFILE1 $TMPFILE2" EXIT

find -printf "%11s %TY-%Tm-%Td %TT %p\n" 2> /dev/null |
pv -ltrbN "Enumerating files..." > $TMPFILE1
LINE_COUNT="$(wc -l $TMPFILE1)"

#cat $TMPFILE1 | colrm 32 42 |                   #1
#cat $TMPFILE1 | cut -c1-31,43- |                #2
#cut -c1-31,43- $TMPFILE1 |                      #3
#sed s/.0000000000// $TMPFILE1 |                 #4
awk -F".0000000000" '{print $1 $2}' $TMPFILE1 |  #5
pv -lN "Removing fractional seconds..." -s $LINE_COUNT > $TMPFILE2

echo "Sorting list by filenames..." >&2
cat $TMPFILE2 |
sort -k 4

None of the 5 "solutions" works. The ".0000000000" part is left in the output.

Can someone explain why?

My final solution is to combine the cutting operation with the find and use only one temporary file. Only the sort is done separately.

Solution

You can truncate the seconds within the argument to -printf using a field precision specifier (at least using GNU find 4.4.2):

find -printf "%11s %TY-%Tm-%Td %.8TT %p\n"

which leaves the eight characters in "HH:MM:SS".

The rest of my answer is possibly moot:

The reason your #1-5 don't work is that the output of wc includes the filename (and especially a space). The space causes pv to see the filename from the wc command as an input file. The command line argument has higher precedence than stdin. Since it happens to be the same as the input file that's being passed through the pipe, the output file looks like an unprocessed input file (because it is, since the pipeline is ignored).

To capture only the count without the filename:

LINE_COUNT=$(wc -l < "$TMPFILE1")

Here are some minor improvements:

< $TMPFILE1 colrm 32 42 |                   #1 No need for cat

colrm 32 42 < $TMPFILE1 |                   #1

< $TMPFILE1 cut -c1-31,43- |                #2

cut -c1-31,43- < $TMPFILE1 |                #2

sed s/\.0000000000// $TMPFILE1 |            #4 The dot should be escaped

OTHER TIPS

If this an actual working tool, and not just a toy, then I'd just drop the "progress feedback" all together... maybe comeback to it when it doesn't complicate your life. In the meantime you've probably spent more time trying to figure out how to give feedback than you will ever spent waiting for your script to return.

If you absolutely MUST give some sort of feedback then just echo "Sorting wc -l $TMPFILE lines ..."

You'll get a feeling for how long it'll take to sort so-many lines from experience.

Kiss it my son, kiss it.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow