Question

Given a set of 1 million (very large) no. of URL's. Find the "first" "unique" URL from the list.

My Approach: Build a hash using perfect hashing function, that can help. But my question is to hash large data is not possible., then how can I solve this question.

Is there any method to do inplace? Please help. Thanks in advance.

Was it helpful?

Solution

Given an input list of ["c","a","b","a","c"], my first approach would be:

  1. Convert the list of URLs into a list of tuples which associates each element which its position in the list. Now you have [(0,"c"),(1,"a"),(2,"b"),(3,"a"),(4,"c")].
  2. Sort the list lexicographically by the second tuple element (the URL). Now you have [(1,"a"),(3,"a"),(2,"b"),(0,"c"),(4,"c")].
  3. Group sequences of subsequent equal tuples (a tuple is equal if the second element equals) into sub-lists. Now you have [[(1,"a"),(3,"a")],[(2,"b")],[(0,"c"),(4,"c")]].
  4. Filter the list so that you only have lists of length 1. Now you have [[(2,"b")]].
  5. If the resulting list is empty, there is no unique URL in the list. If it is non-empty, Sort the list by the first tuple element (the position in the string). In this case, you get the same list back - [[(2,"b")]].
  6. take the first element of the list. Now you have [(2,"b")].
  7. The (only) tuple in this list tells you the first unique URL, and the position in the input list: it's the URL b at position 2 in the input list.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top