Question

Say you have a webserver log (apache, nginx, whatever). From it you extract a large list of URLs:

/article/1/view
/article/2/view
/article/1/view
/article/1323/view
/article/1/edit
/help
/article/1/view
/contact
/contact/thank-you
/article/8/edit
...

or

/blog/2012/06/01/how-i-will-spend-my-summer-vacation
/blog/2012/08/30/how-i-wasted-my-summer-vacation
...

You explode these urls into their pieces such that you have ['article', '1323', 'view'] or ['blog', '2012', '08', '30', 'how-i-wasted-my-summer-vacation'].

How would one go about analyzing and comparing these urls to detect and call out "variables" in the url path. That is to say, you would want to recognize things like /article/XXX/view, /article/XXX/edit, and /blog/XXX/XXX/XXX/XXX such that you can summarize information about those lines in the logs.

I assume that there will need to be some statistical threshold for the number of differences that constitute a mutable piece vs a similar looking but different template. I am also unsure as to what data structure would make this analysis quick and easy.

I would like the output of the script to output what it thinks are all the url templates that are present on the server, possibly with some confidence value if appropriate.

Was it helpful?

Solution

How about using a DAWG? Except the nodes would store not letters, but the URI pieces. Like this:

enter image description here

This is a very nice data structure: it has pretty minimal memory requirements, it's easy to traverse, and, being a DAG, there are plenty of easy and well-researched algorithms for it. It also happens to describe the state machine that accepts all URLs in the sample and rejects all others (so we might actually build a regular expression out of it, which is very neat, but I'm not clever enough to know how to go about it from there).

Anyhow, with a structure like this, your problem translates into that of finding the "bottlenecks". I'd guess there are proper algorithms for that, but with a large enough sample where variables vary wildly enough, it's basically this: the more nodes there are at a certain depth, the more likely it's a mutable part.

A probably naive approach to do it would be like this: keeping separate DAWGs for every starting part, I'd find the mean width of the DAWG (possibly weighted based on the depth). And if a level's width is above that mean, I'd consider it a variable with the probability depending on how far away it is from the mean. You may very well unleash the power of statistics at this point. modeling the distribution of the width.

This approach wouldn't fare well with independent patterns starting with the same part, like "shop/?/?" and "shop/admin/?/edit". This could be perhaps mitigated by examining the DAWG-s in a more dynamic fashion, using a sliding window of sorts, always examining only a part of the DAWG at once, but I don't know how. Oh and, the whole thing fails horribly if the very first part is a variable, but that's thankfully rare.

You may also look out for certain little things like all nodes of the same level having numerical values (more likely to be a variable), and I'd certainly check for common date patterns in the sample before building the DAWGs, factoring them out would make handling the blog-like patterns easier.

(Oh and, adding the "algorithm" tag would probably attract more attention to the question.)

OTHER TIPS

A simple solution would be to count path occurrences and learn which values correspond to templates. Assume that the file input contains the URLs from your first snippet. Then compute the per-path visits:

awk -F '/' '{ for (i=2; i<=NF; ++i) { for (j=2; j<=i; ++j) printf "/%s", $j; printf "\n" }}' input \
    | sort \
    | uniq -c \
    | sort -rn

This yields:

7 /article
4 /article/1
3 /article/1/view
2 /contact
1 /help
1 /contact/thank-you
1 /article/8/edit
1 /article/8
1 /article/2/view
1 /article/2
1 /article/1323/view
1 /article/1323
1 /article/1/edit

Now you have a weight for each path which you can feed into a score function f(x, y), where x represents the count and y the depth of the path. For example, the first line would result in the invocation f(7,2) and may return a value in [0,1], say 0.8, to tell you that the given parametrization corresponds to a template with 80%. Of course, all the magic happens in f and you would have to come up with reasonable values based on the paths that you see being accessed. To develop a good f, you could use logistic regression on some a small data set and see if it predicts well the binary feature of being a template or not.

You can also take a mundane route: just drop the tail, e.g., all values <= 1.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top