How can I remove responses from LiveHTTPHeaders output using awk, perl or sed?

https://stackoverflow.com/questions/1812940

06-07-2019
|

Question

Let's say I have something like this (this is only an example, actual request will be different: I loaded StackOverflow with LiveHTTPHeaders enabled to have some samples to work on):

http://stackoverflow.com/

GET / HTTP/1.1
Host: stackoverflow.com
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.2) Gecko/20070220 Firefox/2.0.0.2
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

HTTP/1.x 200 OK
Cache-Control: private
Content-Type: text/html; charset=utf-8
Content-Encoding: gzip
Expires: Sat, 28 Nov 2009 16:04:24 GMT
Vary: Accept-Encoding
Server: Microsoft-IIS/7.0
Date: Sat, 28 Nov 2009 16:04:23 GMT
Content-Length: 19015
----------------------------------------------------------
...

Full log of requests and responses is available on pastebin

And I want to remove all responses (HTTP/1.x 200 OK and everything in that response, for example) and all one liners showing page address. I would like to only have all requests left in text file with saved LiveHTTPHeaders output.

So, the output would be:

GET / HTTP/1.1
Host: stackoverflow.com
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.2) Gecko/20070220 Firefox/2.0.0.2
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

GET /so/all.css?v=5290 HTTP/1.1
Host: sstatic.net
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.2) Gecko/20070220 Firefox/2.0.0.2
Accept: text/css,*/*;q=0.1
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: http://stackoverflow.com/

...

Again, the full text of what I want to keep is available on pastebin.

If I save LiveHTTPHeaders captured session to text file and I would like to get result like from second 'code' in this question, how do I do this? Maybe with awk, sed or perl? Or something else? I'm on Linux.

Edit: I'm trying to run Sinan's script. Script is this:

#!/usr/bin/perl
local $/ = "\n\n";
while (<>) {
    print if /^GET|POST/; # Add more request types as needed
}

I tried running it this way:

./cleanup-headers.pl livehttp.txt > filtered.txt

And this way:

perl cleanup-headers.pl < livehttp.txt > filtered.txt

... file filtered.txt was created but it's totally empty.

Anyone tried it on FULL headers i pasted into pastebin? Did it worked?

Full headers

Solution

Looks like you're having trailing whitespace issues.

$ sed -e 's/^\s*$//' livehttp.txt | \
  perl -e '$/ = ""; while (<>) { print if /^(GET|POST)/ }'

This works by putting Perl's readline operator into paragraph mode (via $/ = ""), which grabs records a chunk at a time, separated by two or more consecutive newlines.

It's nice when it works, but it's a bit brittle. Blank but not empty lines will gum up the works, but sed can clean those up.

Equivalent and more concise command:

$ sed -e 's/^\s*$//' livehttp.txt | perl -000 -ne 'print if /^(GET|POST)/'

OTHER TIPS

In Perl:

local $/ = "\n\n";
while (<>) {
    print if /^(?:GET|POST)/; # Add more request types as needed
}

Notes: Looking at the output generated by LiveHTTPHeaders, entries are quite clearly separated by two newlines, so I think setting $/ = "\n\n" is more appropriate than setting $/ = ''. I believe your problems were due to the fact that the lines in your input file were actually indented.

I did originally download the file from pastebin and use the full file to test my script. I do not believe the file you were using to test on your computer was identical to the one you put on pastebin.

If you want to robustly deal with possibly indented lines while remaining consistent with the format of the output of LiveHTTPHeaders, you should use something like the following:

#!/usr/bin/perl

use strict; use warnings;

local $/ = "\n\n";
while (<>) {
    next unless /^\s*(?:GET|POST)/;
    s!^\s+!!gm;
    print;
}

I consider using sed and perl in the same pipeline to be a little bit of an abomination.

just one gawk command

awk -vRS= '/^(GET|POST)/' ORS="\n\n" file

you can use the bash shell

while read -r line
do    
    case "$line" in
        GET*|POST*) flag=1;;        
        "") flag=0;;
    esac
    [ "$flag" -eq 1 ] && echo "$line"
done < "file"

Run Sinan's code as:

perl test.pl < infile.txt > outfile.txt

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow