快速算法以重复计算百分位数？

https://stackoverflow.com/questions/3738349

03-10-2019
|

题

在算法中，我必须计算第75个百分点每当我添加一个值时，数据集。现在我正在这样做：

获得价值 x
插入 x 在后面已经分类的阵列中
交换 x 向下直到排序阵列
阅读位置的元素 array[array.size * 3/4]

点3为O（n），其余的是O（1），但这仍然很慢，尤其是在数组变大的情况下。有什么方法可以优化吗？

更新

谢谢Nikita！由于我正在使用C ++，这是最容易实现的解决方案。这是代码：

template<class T>
class IterativePercentile {
public:
  /// Percentile has to be in range [0, 1(
  IterativePercentile(double percentile)
    : _percentile(percentile)
  { }

  // Adds a number in O(log(n))
  void add(const T& x) {
    if (_lower.empty() || x <= _lower.front()) {
      _lower.push_back(x);
      std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
    } else {
      _upper.push_back(x);
      std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
    }

    unsigned size_lower = (unsigned)((_lower.size() + _upper.size()) * _percentile) + 1;
    if (_lower.size() > size_lower) {
      // lower to upper
      std::pop_heap(_lower.begin(), _lower.end(), std::less<T>());
      _upper.push_back(_lower.back());
      std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
      _lower.pop_back();
    } else if (_lower.size() < size_lower) {
      // upper to lower
      std::pop_heap(_upper.begin(), _upper.end(), std::greater<T>());
      _lower.push_back(_upper.back());
      std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
      _upper.pop_back();
    }            
  }

  /// Access the percentile in O(1)
  const T& get() const {
    return _lower.front();
  }

  void clear() {
    _lower.clear();
    _upper.clear();
  }

private:
  double _percentile;
  std::vector<T> _lower;
  std::vector<T> _upper;
};

解决方案

你可以用两个堆. 。不确定是否有较少的“人为”解决方案，但该解决方案提供了 O(logn) 时间复杂性和堆也包含在大多数编程语言的标准库中。

第一个堆（堆A）包含最小的75％元素，另一个堆（堆B） - 其余（最大25％）。第一个具有最大的元素，第二个元素最小。

添加元素。

查看新元素是否 x 是<= max(A). 。如果是，请将其添加到堆中 A, ，否则 - 堆 B.
现在，如果我们添加 x 要堆积A，它变得太大（持有超过75％的元素），我们需要从中删除最大元素 A （o（logn））并将其添加到堆B（也是O（logn））中。
如果堆B变得太大，则类似。

寻找“ 0.75中位数”

只需从A（或B中最小）中获取最大元素。需要O（logN）或O（1）时间，具体取决于堆的实现。

编辑
作为海豚指出，我们需要精确指定每个n堆应有多大的（如果我们需要精确的答案）。例如，如果 size(A) = floor(n * 0.75) 和 size(B) 其余的 n > 0, array[array.size * 3/4] = min(B).

其他提示

一个简单的订单统计树足够了。

该树的平衡版本支持O（logN）时间插入/删除和按等级访问。因此，您不仅获得了75％的百分点，而且还获得了66％或50％或不需要更改代码而需要的任何东西。

如果您经常访问75％的百分位数，但仅插入频率较低，则可以在插入/删除操作期间始终缓存75％百分位数的元素。

大多数标准实现（例如Java的Treemap）是订单统计树。

您可以使用二进制搜索来在O（log n）中找到正确的位置。但是，将阵列移动仍然是O（n）。

这是一个JavaScript解决方案。在浏览器控制台中复制它，并且可以使用。 $scores 包含分数列表和 $percentile给出 n-th percentile 列表。因此，第75个百分位数为76.8，而99％为87.9。

function get_percentile($percentile, $array) {
    $array = $array.sort();
    $index = ($percentile/100) * $array.length;
    if (Math.floor($index) === $index) {
         $result = ($array[$index-1] + $array[$index])/2;
    }
    else {
        $result = $array[Math.floor($index)];
    }
    return $result;
}

$scores = [22.3, 32.4, 12.1, 54.6, 76.8, 87.3, 54.6, 45.5, 87.9];

get_percentile(75, $scores);
get_percentile(90, $scores);

如果您有一组已知的值，以下将非常快：

创建一个大量的整数（甚至字节将起作用），其元素数量等于数据的最大值。例如，如果T的最大值为100,000，则创建一个数组

int[] index = new int[100000]; // 400kb

现在迭代整个值，如

for each (int t : set_of_values) {
  index[t]++;
}

// You can do a try catch on ArrayOutOfBounds just in case :)

现在计算百分位数为

int sum = 0, i = 0;
while (sum < 0.9*set_of_values.length) {
  sum += index[i++];
}

return i;

如果值未确认这些限制，您也可以考虑使用Treemap而不是数组。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow