Why is printing data to a file considerably slower with awk than with the shell?

https://stackoverflow.com/questions/20227462

05-08-2022
|

Question

I've been using a shell script for ~3 years that starts a tail -F -n0 /path/to/LOGFILE, parses and formats the output and dumps it into a pipe delimited file. However, we've gone from several thousand log lines a day to several million log lines a day and the script has started to eat huge amounts of memory and CPU.

I recently gone and replaced all of the data parsing logic with awk, which seems in testing to be many MANY orders of magnitude faster at parsing data. To test the new awk code I tried pumping an entire days worth of logs through the logic (~6 million lines) using both the awk code and the shell code and unsurprisingly the awk code pulled the ~92K relevant lines in ~10 seconds while the shell code took over 15 minutes to do the exact same thing. However, if I use the exact same code, but instead of cat /path/to/file|awk '...' I do tail -F -n0 /path/to/file|awk '... there is a HUGE delay as to when the text is written to the file, up to ~2-3 minutes as opposed to ~0.5-1.0 seconds when it's going through the shell code.

Shell code (yes I know how ugly shell code is):

outDir="/opt/redacted/logs/allerrors/"
tail -F -n0 /opt/redacted/logs/manager.log|while read -a logLine;do
if [[ "${logLine[2]}" == "E" ]];then
  if [[ "${logLine[7]:0:1}" == "@" ]];then
    echo "${logLine[0]}|${logLine[1]}|${logLine[6]}|${logLine[7]}|${logLine[@]:8:${#logLine[@]}}" >> ${outDir}allerrors.${logLine[0]//\//.}.log
  else
    echo "${logLine[0]}|${logLine[1]}|${logLine[6]}|NULL|${logLine[@]:7:${#logLine[@]}}" >> ${outDir}allerrors.${logLine[0]//\//.}.log
  fi
fi

done

awk code:

  outDir="/opt/redacted/logs/allerrors/"
  tail -F -n0 /opt/redacted/logs/manager.log|awk -v dir=$outDir '{OFS="|"}
{
  if ($3 == "E")
  {
    file="allerrors."$1".log"
    gsub("/",".",file)
    if ($8 ~ /@/)
      print $1,$2,$7,$8,substr($0, index($0,$9)) >> dir file
    else {if ($8 !~ /@/)
      print $1,$2,$7,"NULL",substr($0, index($0,$8)) >> dir file
    }
  }
}'

To be clear, both sets of code work and create identical output if I use cat instead of tailing the file, but with the awk code I don't see results in my output file until ~2-3 minutes after it appears in the log while the shell version only takes a few seconds.

Solution

awk buffers by default while sh does not. This increases awk's throughput but also its latency.

Just add fflush(); to your awk code to force the buffers to flush:

  outDir="/opt/redacted/logs/allerrors/"
  tail -F -n0 /opt/redacted/logs/manager.log|awk -v dir=$outDir '{OFS="|"}
{
  if ($3 == "E")
  {
    file="allerrors."$1".log"
    gsub("/",".",file)
    if ($8 ~ /@/)
      print $1,$2,$7,$8,substr($0, index($0,$9)) >> dir file
    else {if ($8 !~ /@/)
      print $1,$2,$7,"NULL",substr($0, index($0,$8)) >> dir file
    }
    fflush();
  }
}'

OTHER TIPS

tail -f -n0 filename

blocks until there is new I/O against the file, so it is as already suggested, an input issue, related to buffering for stdout in the awk process. However if there is zero change to the log file for 2 minutes awk I/O buffering does not fully account for the time delay.

Try this to show the effect of line buffering:

while true                    
do                        
echo -n 'this is a string'
sleep 5                   
echo ' add a newline'     
done | awk '{print $0}'

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow