Amazon interview: Timestamp sorting: Find the three page subset sequence repeated maximum number of times [closed]

StackOverflow https://stackoverflow.com/questions/21155884

  •  28-09-2022
  •  | 
  •  

Question

The Amazon interview question is:

Given a log file containing (User_Id, URL, Timestamp) user can navigate page from one to the other. Find the three page subset sequence repeated maximum number of times. Records are sorted by Timestamp.

I found this question from this reddit thread.

The poster wrote:

"Given a log file containing (User_Id, URL, Timestamp) user can navigate page from one to the other. Find the three page subset sequence repeated maximum number of times. Records are sorted by Timestamp."

(Although, I wasn't told until late in the interview the file is sorted by timestamp. One of the first things I had asked was if the log was sorted, and my interviewer said no.)

I do think I gave it my all -- I seemed to have been on the right track using a hashmap. I always let my interview know what I was thinking and gave possible outcomes, time complexities, etc.

I am not sure how to approach this problem. What does "Find the three page subset sequence repeated maximum number of times" mean? And if the question didn't say "Records are sorted by Timestamp" (as happened to the poster), then how would that affect the problem?

Was it helpful?

Solution

With "three page subset sequence" I am guessing they mean the three pages must be next to each other, but their internal order does not matter. (A B C = C A B)

public Tuple<string,string,string> GetMostFrequentTriplet(
        IEnumerable<LogEntry> entries,
        TimeSpan? maxSeparation = null)
{
    // Assuming 'entries' is already ordered by timestamp

    // Store the last two URLs for each user
    var lastTwoUrls = new Dictionary<int,Tuple<string,string,DateTime>>();
    // Count the number of occurences of each triplet of URLs
    var counters = new Dictionary<Tuple<string,string,string>,int>();

    foreach (var entry in entries)
    {
        Tuple<string,string,DateTime> lastTwo;
        if (!lastTwoUrls.TryGetValue(entry.UserId, out lastTwo))
        {
            // No previous URLs
            lastTwoUrls[entry.UserId] = Tuple.Create((string) null, entry.Url, entry.Timestamp);
        }
        // (comparison with null => false)
        else if (entry.Timestamp - lastTwo.Item3 > maxSeparation) {
            // Treat a longer separation than maxSeparation as two different sessions.
            lastTwoUrls[entry.UserId] = Tuple.Create((string) null, entry.Url, entry.Timestamp);
        }
        else
        {
            // One or two previous URLs
            if (lastTwo.Item1 != null)
            {
                // Two previous URLs; Three with this one.

                // Sort the three URLs, so that their internal order won't matter
                var urls = new List<string> { lastTwo.Item1, lastTwo.Item2, entry.Url };
                urls.Sort();
                var key = Tuple.Create(urls[0], urls[1], urls[2]);

                // Increment count
                int count;
                counters.TryGetValue(key, out count); // sets to 0 if not found
                counters[key] = count + 1;
            }

            // Shift in the new value, discarding the oldest one.
            lastTwoUrls[entry.UserId] = Tuple.Create(lastTwo.Item2, entry.Url, entry.Timestamp);
        }
    }

    Tuple<string,string,string> maxKey = null;
    int maxCount = 0;

    // Find the key with the maximum count
    foreach (var pair in counters)
    {
        if (maxKey == null || pair.Value > maxCount)
        {
            maxKey = pair.Key;
            maxCount = pair.Value;
        }
    }

    return maxKey;
}

The code goes over the log entries and separates the stream for each user. For each three consecutive URLs for any user, we increment the count for that triplet. Since the order of the three pages are not important, we reorder them in a consistent way, by sorting. In the end, we return the triplet that has the highest count.

Since we only need the last three URLs for each user, we only store the previous two. Combined with the current URL, that makes the triplet we need.

For n URLs, m unique URLs, u users, and s single-visit users, the method will do 2 n - 2 u + s (= O(n)) dictionary lookups, and store up to C(m,3) + u (= O(m3 + u)) tuples.

Edit: Infer sessions by the duration between requests. If they differ by more than maxSeparation, the new request is treated as the first from that user.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top