Frage

I have a huge set (S) of long unsigned integers in a .txt file. How can I find the max subset (Pmax) of S with the following property:

P{X1,X2,X3,...,Xn) | X1>=(Xn/4)

More details:

  1. When I say max subset I mean the subset with the greatest number of elements(n->max).
  2. I can't load .txt into an array because of limited memory.
  3. My system memory is 200MB
  4. txt file has 10^6 integers. Each integer can be long unsigned 32bit.
  5. I need to find the biggest subset of S with the condition:

X1 < X2 < X3 < ... < Xn-1 < Xn such as X1 >= (XN/4)

For example if the txt file has the following: 15,14,13,4,2,2,3,10,1,2,2 then these are the possible subsets:

P1(4,10,13,14,15)

P2(3,4,10)

P3(1,2,2,2,2,3,4)

so Pmax(1,2,2,2,2,3,4) because it has more elements.

In fact I don't want to find exactly which is the Pmax. I just want to find the number of elements of the subset Pmax. So here it is 7.

The algorithm should be really fast.

I don't look for someone to do my work. I just need a corresponding problem so I can look for the efficient solution. Thanks in advance!!!

War es hilfreich?

Lösung 2

The easiest solution is:

  1. Sort the list first (Complexity O(nlogn)
  2. With a moving window, find the largest acceptable window. (Complexity O(n))

Complexity: O(nlogn).

More details about step2:

Let low keep track of the lowest element and high the highest element.

Initialization: Set low to the first element. Do a binary search for 4*x[low], and that is your high location. Set maxWindow=high-low+1.

At every step: Increment high by 1, and increment low such that x[low]>=x[high]. Calculate number of elements = high-low+1, and update maxWindow accordingly.

Andere Tipps

Assuming your condition means "where all elements in the subset are larger than X1 divided by 4" you'd need 2 simple nested loops and some helper variables.

In pseudocode something like this should work:

var idx = 0, largest = 0, currentIdx = 0;

while(var current = getIntegerFromFileById(currentIdx))
{
  var size = 1;
  while(getIntegerFromFileById(currentIdx + size++) > current / 4);
  if(size > largest) {
    idx = currentIdx;
    largest = size;
  }
  currentIdx++;
}
print "Longest subset is at index {idx}.";
print "It contains {largest} consecutive elements.";

This is also the de facto optimal implementation. The most obvious optimization would be to load the integers progressively in an in-memory buffer during the scan to prevent double I/O operations.

In case I misunderstood the condition this should still be easily adaptable to most other conditions, the surrounding algorithm stays the same, you just modify the condition in the inner while.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top