SED or AWK to make url querystring readable

https://stackoverflow.com/questions/13185155

26-07-2021
|

Question

I need to split a querystring to several unbounded amount of variables for debugging purposes:

The output comes from tshark and the purpose is to live debug google analytics events. The output from tshark looks like this:

82.387501       hampus -> domain.net 1261 GET /__utm.gif?utmwv=5.3.7&utms=22&utmn=1234&utmhn=domain.com&utmt=event&utme=5(x*y*z%2Fstart%2Fklipp%2F166_SS%20example)(10)&utmcs=UTF-8~ HTTP/1.1

What i want is a more human readable version:

utmhn:  domain.com
utmt:   event
utme:   5(x*y*z/start/klipp/166_SS/example)(10)
utmcs:  UTF-8

or even better:

utmhn:  domain.com
utmt:   event
utme:   5(
          x
          y
          z/start/klipp/166_SS/example
         )(10)
utmcs:  UTF-8

But can't get my head around sed (or awk) for this purpose...

Solution

Another way using Perl :

#!/usr/bin/perl -l
use strict; use warnings;

while (<>) {
    my @arr;
    my ($qs) = m/.*?GET.*?\?(\S+)\s/;
    my @pairs = split(/[&~]/, $qs);
    foreach my $pair (@pairs){
         my ($name, $value) = split(/=/, $pair);
         if ($name eq 'utme') {
            $value =~ s!(%2F|%20)!/!g;
            $value =~ s!\*!\n\t\t!g;
            $value =~ s!\(!(\n\t\t!;
            $value =~ s/\)\(/\n\t)(/;
         }
         # let's URI unescape stuff
         $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
         if ($name eq 'utmhn') {
            print "$name: $value";
        }
        else {
            push @arr, "$name: $value";
        }
    }

    print join "\n", @arr;
    print "\n";
}

OUTPUT

utmhn: domain.com
utmwv: 5.3.7
utms: 22
utmn: 1234
utmt: event
utme: 5(
                x
                y
                z/start/klipp/166_SS/example
        )(10)
utmcs: UTF-8

USAGE

tshark ... | ./script.pl

ADVANTAGES

I take care to display utmhn: domain.com at the first line
I run an URI unescape on values
It's not limited to "utmhn", "utmt", "utme", "utmcs" only

OTHER TIPS

file

82.387501       hampus -> domain.net 1261 GET /__utm.gif?utmwv=5.3.7&utms=22&utmn=1234&utmhn=domain.com&utmt=event&utme=5(x*y*z%2Fstart%2Fklipp%2F166_SS%20example)(10)&utmcs=UTF-8~ HTTP/1.1

command

 sed 's/.*utmhn=/uthmhn:   /
     s/&utmt=/\nutmt:     /
     s/&utme=/\nutme:     /
     s/utmcs=/\nutmcs:    /
     s:[%]2F:/:g
     s:[%]20: :g
     s:[\(]:(\n\t    :
     s:\*:\n\t    :g
     s:[\)]:\n\t  ):
     s/[~].*$//' samp1.txt

output

uthmhn:   domain.com
utmt:     event
utme:     5(
            x
            y
            z/start/klipp/166_SS example
          )(10)&
utmcs:    UTF-8

I'm not sure what to say about your %20 VS the expected result of '/' char in your sample data. Did you manually type some of this in?

Here's one way using GNU awk. Run like:

awk -f script.awk file.txt

Contents of script.awk:

BEGIN {
    FS="[ \t=&~]+"
    OFS="\t"
}

{
    for (i=1; i<=NF; i++) {
        if ($i ~ /^utmhn$|^utmt$|^utme$|^utmcs$/) {

             if ($i == "utme") {
                 sub(/\(/,"(\n\t  ", $(i+1))
                 gsub(/*/,"\n\t  ", $(i+1))
                 sub(/\)/,"\n\t )", $(i+1))
             }

             print $i":", $(i+1)
        }
    }
}

Results:

utmhn:  domain.net
utmt:   event
utme:   5(
          x
          y
          z%2Fstart%2Fklipp%2F166_SS%20example
         )(10)
utmcs:  UTF-8

Alternatively, here's the one-liner:

awk 'BEGIN { FS="[ \t=&~]+"; OFS="\t" } { for (i=1; i<=NF; i++) { if ($i ~ /^utmhn$|^utmt$|^utme$|^utmcs$/) { if ($i == "utme") { sub(/\(/,"(\n\t  ", $(i+1)); gsub(/*/,"\n\t  ", $(i+1)); sub(/\)/,"\n\t )", $(i+1)) } print $i":", $(i+1) } } }' file.txt

assuming your data is in a file called "file":

awk -F "&" '{ for ( i=2;i<=NF;i++ ){sub(/=/,":\t",$i);sub(/[~].*$/,"",$i);gsub(/\%2F/,"/",$i);gsub(/\%20/," ",$i);print $i} }' tst

produces output:

utms:   22
utmn:   1234
utmhn:  domain.com
utmt:   event
utme:   5(x*y*z/start/klipp/166_SS example)(10)
utmcs:  UTF-8

it's a bit dirty, but it works.

$ cat tst.awk
BEGIN { FS="[&=~]"; OFS=":\t" }
{
   for (i=1;i<=NF;i++) {
      map[$i]=$(i+1)
   }

   sub(/\(/,"&\n\t  ", map["utme"])
   gsub(/\*/,"\n\t  ", map["utme"])
   gsub(/%2./,"/",     map["utme"])
   sub(/\)/,"\n\t&",   map["utme"])

   print "utmhn", map["utmhn"]
   print "utmt",  map["utmt"]
   print "utme",  map["utme"]
   print "utmcs", map["utmcs"]
}
$
$ awk -f tst.awk file
utmhn:  domain.com
utmt:   event
utme:   5(
          x
          y
          z/start/klipp/166_SS/example
        )(10)
utmcs:  UTF-8

This might work for you (GNU sed):

sed 's/.*\(utmhn.*=\S*\).*/\1/;s/&/\n/g;s/=/:\t/g;s/(/&\n\t/;s/*/\n\t/g;s/%2F/\//g;s/%20/ /g;s/)/\n\t&/' file

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow