auto-tokenize user agents strings for statistics?

https://stackoverflow.com/questions/1948235

21-09-2019
|

Question

We keep track of user agent strings in our website. I want to do some statistics on them, to see how many IE6 users we have ( so we know what we have to develop against), and also how many mobile users we have.

So we have log entires like this:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0; .NET CLR 2.0.50727)

And ideally, it would be pretty neat to see all the 'meaningful' strings, which would just mean probably strings longer than a certain length. For instance, I might like to see how many entries have FunWebProducts in it, or .NET CLR, or .NET CLR 1.0.3705 -- but I don't want to see how many have a semi-colon. So I'm not necessarily looking for unique strings, but all strings, even sub-sets. So, I would want to see the count of all Mozilla, knowing that this includes the counts for Mozilla/5.0 and Mozilla/4.0. It would be nice if there were a nested display for this, starting with the shortest strings, and working its way down. Something perhaps like

4,2093 Mozilla
 1,093 Mozilla/5.0
    468 Mozilla/5.0 (Windows;
     47 Mozilla/5.0 (Windows; U 
 2,398 Mozilla/4.0

This sounds like a computer science homework. What would this be called? Does something like this exist out there, or do I write my own?

Solution

You are looking at a longest common substring problem, or, given your specific example above, a longest common prefix problem, which can be approached with a trie.

However, going from your example above, you probably don't even need to be efficient about this. Instead, simply:

Tokenize strings on some punctuation subset, like [ ;/]
Save each unique prefix of however many tokens, replacing the original delimiters
For each prefix, get a count of which records it matches and save that

OTHER TIPS

If you break it up into the major name (part before the opening paren), and then store each part separated by semicolon as a child record, you could do whatever analysis you want. For example, store it in a relational database:

BrowserID   BrowserText
---------   -----------
1           Mozilla/4.0
2           Mozilla/5.0

FeatureID   FeatureText
---------   -----------
1           compatible
2           MSIE 7.0
3           Windows NT 5.1
4           FunWebProducts
5           .NET CLR 1.0.3705
6           .NET CLR 1.1.4322
7           Media Center PC 4.0
8           .NET CLR 2.0.50727

Then log references to browser and parts and you can do any type of analysis you want.

What about using a regex to parse the user agent string into its relevant component parts? The basic spec for a user agent string is '[name]/[version]' or '[name] [version]'. With that information we can use a regex like ([^\(\)\/\\;\n]+)([ ]((?=\d*\.+\d*|\d*_+\d*)[\d\.Xx_]+)|[/]([^\(\)\/; \n]+)) to get match sets where the first match in a set is the [name] and the second match in a set is the [version]. Of course, you'll have to strip the spaces and / from the second match in the set, or modify the regex to use lookbehind (which several regex flavors don't support, so I didn't include it here).

After you get all these tuples you can manipulate and count them however you want.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow