With "three page subset sequence" I am guessing they mean the three pages must be next to each other, but their internal order does not matter. (A B C
= C A B
)
public Tuple<string,string,string> GetMostFrequentTriplet(
IEnumerable<LogEntry> entries,
TimeSpan? maxSeparation = null)
{
// Assuming 'entries' is already ordered by timestamp
// Store the last two URLs for each user
var lastTwoUrls = new Dictionary<int,Tuple<string,string,DateTime>>();
// Count the number of occurences of each triplet of URLs
var counters = new Dictionary<Tuple<string,string,string>,int>();
foreach (var entry in entries)
{
Tuple<string,string,DateTime> lastTwo;
if (!lastTwoUrls.TryGetValue(entry.UserId, out lastTwo))
{
// No previous URLs
lastTwoUrls[entry.UserId] = Tuple.Create((string) null, entry.Url, entry.Timestamp);
}
// (comparison with null => false)
else if (entry.Timestamp - lastTwo.Item3 > maxSeparation) {
// Treat a longer separation than maxSeparation as two different sessions.
lastTwoUrls[entry.UserId] = Tuple.Create((string) null, entry.Url, entry.Timestamp);
}
else
{
// One or two previous URLs
if (lastTwo.Item1 != null)
{
// Two previous URLs; Three with this one.
// Sort the three URLs, so that their internal order won't matter
var urls = new List<string> { lastTwo.Item1, lastTwo.Item2, entry.Url };
urls.Sort();
var key = Tuple.Create(urls[0], urls[1], urls[2]);
// Increment count
int count;
counters.TryGetValue(key, out count); // sets to 0 if not found
counters[key] = count + 1;
}
// Shift in the new value, discarding the oldest one.
lastTwoUrls[entry.UserId] = Tuple.Create(lastTwo.Item2, entry.Url, entry.Timestamp);
}
}
Tuple<string,string,string> maxKey = null;
int maxCount = 0;
// Find the key with the maximum count
foreach (var pair in counters)
{
if (maxKey == null || pair.Value > maxCount)
{
maxKey = pair.Key;
maxCount = pair.Value;
}
}
return maxKey;
}
The code goes over the log entries and separates the stream for each user. For each three consecutive URLs for any user, we increment the count for that triplet. Since the order of the three pages are not important, we reorder them in a consistent way, by sorting. In the end, we return the triplet that has the highest count.
Since we only need the last three URLs for each user, we only store the previous two. Combined with the current URL, that makes the triplet we need.
For n URLs, m unique URLs, u users, and s single-visit users, the method will do 2 n - 2 u + s (= O(n)) dictionary lookups, and store up to C(m,3) + u (= O(m3 + u)) tuples.
Edit:
Infer sessions by the duration between requests. If they differ by more than maxSeparation
, the new request is treated as the first from that user.