在 C++ 中创建稀疏数组的最佳方法是什么？

https://stackoverflow.com/questions/4306

08-06-2019
|

题

我正在研究一个需要操作巨大矩阵的项目，特别是用于连接计算的金字塔求和。

简而言之，我需要跟踪矩阵（多维数组）中大量零的相对少量的值（通常为 1，在极少数情况下超过 1）。

稀疏数组允许用户存储少量值，并假设所有未定义的记录都是预设值。由于物理上不可能将所有值存储在内存中，因此我只需要存储少数非零元素。这可能是数百万个条目。

速度是一个巨大的优先级，我还想在运行时动态选择类中的变量数量。

我目前正在开发一个使用二叉搜索树（b 树）来存储条目的系统。有谁知道更好的系统吗？

解决方案

对于 C++，映射效果很好。几百万个对象不会有问题。在我的计算机上，1000 万个项目大约需要 4.4 秒，大约 57 兆。

我的测试应用程序如下：

#include <stdio.h>
#include <stdlib.h>
#include <map>

class triple {
public:
    int x;
    int y;
    int z;
    bool operator<(const triple &other) const {
        if (x < other.x) return true;
        if (other.x < x) return false;
        if (y < other.y) return true;
        if (other.y < y) return false;
        return z < other.z;
    }
};

int main(int, char**)
{
    std::map<triple,int> data;
    triple point;
    int i;

    for (i = 0; i < 10000000; ++i) {
        point.x = rand();
        point.y = rand();
        point.z = rand();
        //printf("%d %d %d %d\n", i, point.x, point.y, point.z);
        data[point] = i;
    }
    return 0;
}

现在要动态选择变量的数量，最简单的解决方案是表示 索引作为字符串, ，然后使用字符串作为地图的键。例如，位于 [23][55] 的项目可以通过“23,55”字符串表示。我们还可以将此解决方案扩展到更高的维度；例如对于三个维度，任意索引将类似于“34,45,56”。该技术的简单实现如下：

std::map data<string,int> data;
char ix[100];

sprintf(ix, "%d,%d", x, y); // 2 vars
data[ix] = i;

sprintf(ix, "%d,%d,%d", x, y, z); // 3 vars
data[ix] = i;

其他提示

作为一般建议，使用字符串作为索引的方法实际上是非常慢的。一个更有效但在其他方面等效的解决方案是使用向量/数组。完全没有必要将索引写入字符串中。

typedef vector<size_t> index_t;

struct index_cmp_t : binary_function<index_t, index_t, bool> {
    bool operator ()(index_t const& a, index_t const& b) const {
        for (index_t::size_type i = 0; i < a.size(); ++i)
            if (a[i] != b[i])
                return a[i] < b[i];
        return false;
    }
};

map<index_t, int, index_cmp_t> data;
index_t i(dims);
i[0] = 1;
i[1] = 2;
// … etc.
data[i] = 42;

然而，使用一个 map 在实践中，由于是根据平衡二叉搜索树实现的，因此通常效率不高。在这种情况下，性能更好的数据结构是哈希表，如 std::unordered_map.

Boost 有一个名为 uBLAS 的 BLAS 模板化实现，其中包含一个稀疏矩阵。

http://www.boost.org/doc/libs/1_36_0/libs/numeric/ublas/doc/index.htm

指数比较中的小细节。您需要进行字典顺序比较，否则：

a= (1, 2, 1); b= (2, 1, 2);
(a<b) == (b<a) is true, but b!=a

编辑：所以比较应该是：

return lhs.x<rhs.x
    ? true 
    : lhs.x==rhs.x 
        ? lhs.y<rhs.y 
            ? true 
            : lhs.y==rhs.y
                ? lhs.z<rhs.z
                : false
        : false

哈希表具有快速插入和查找的功能。您可以编写一个简单的哈希函数，因为您知道您只会处理整数对作为键。

本征是一个 C++ 线性代数库，具有执行的稀疏矩阵。它甚至支持针对稀疏矩阵优化的矩阵运算和求解器（LU 分解等）。

完整的解决方案列表可以在维基百科中找到。为了方便起见，我将相关章节引用如下。

https://en.wikipedia.org/wiki/Sparse_matrix#Dictionary_of_keys_.28DOK.29

键字典 (DOK)

DOK由一个词典组成，该字典将映射（行，列）映射到元素的值。字典中缺少的元素为零。该格式非常适合按随机顺序构建稀疏的矩阵，但对于以词典顺序迭代的迭代而较差。一个通常以这种格式构造矩阵，然后转换为另一种更有效的处理格式。[1

列表列表（LIL）

LIL每行存储一个列表，每个条目包含列索引和值。通常，这些条目通过列索引进行排序以进行更快的查找。这是一种适合增量矩阵结构的格式。[2

坐标列表 (COO)

COO 存储（行、列、值）元组的列表。理想情况下，将条目（按行索引，然后是列索引）进行排序，以改善随机访问时间。这是另一种适合增量矩阵结构的格式。[3

压缩稀疏行（CSR、CRS 或 Yale 格式）

压缩的稀疏行（CSR）或压缩行存储（CRS）格式代表矩阵M，分别包含非零值，行的扩展和列指数。它类似于COO，但压缩了行索引，因此名称。这种格式允许快速访问和矩阵矢量乘法（MX）。

实现稀疏矩阵的最佳方法是不要实现它们——至少不要自己实现。我建议使用 BLAS（我认为它是 LAPACK 的一部分），它可以处理非常大的矩阵。

由于只有 [a][b][c]...[w][x][y][z] 的值才有意义，因此我们只存储索引本身，而不是几乎无处不在的值 1 - 总是相同+无法散列它。注意到维度灾难的存在，建议使用一些成熟的工具 NIST 或 Boost，至少阅读其来源以避免不必要的错误。

如果工作需要捕获未知数据集的时间依赖性分布和参数趋势，那么具有单值根的 Map 或 B-Tree 可能不实用。对于所有 1 值，我们可以仅存储索引本身，如果排序（表示的敏感性）可以服从运行时时域的减少，则进行散列。由于除 1 之外的非零值很少，因此明显的候选值是您可以轻松找到并理解的任何数据结构。如果数据集确实是巨大的宇宙大小，我建议使用某种滑动窗口来自己管理文件/磁盘/持久IO，根据需要将部分数据移动到范围内。（编写您可以理解的代码）如果您致力于向工作组提供实际的解决方案，那么如果不这样做，您就会受到消费级操作系统的摆布，而这些操作系统的唯一目标就是抢走您的午餐。

这是一个相对简单的实现，应该提供合理的快速查找（使用哈希表）以及对行/列中的非零元素的快速迭代。

// Copyright 2014 Leo Osvald
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

#ifndef UTIL_IMMUTABLE_SPARSE_MATRIX_HPP_
#define UTIL_IMMUTABLE_SPARSE_MATRIX_HPP_

#include <algorithm>
#include <limits>
#include <map>
#include <type_traits>
#include <unordered_map>
#include <utility>
#include <vector>

// A simple time-efficient implementation of an immutable sparse matrix
// Provides efficient iteration of non-zero elements by rows/cols,
// e.g. to iterate over a range [row_from, row_to) x [col_from, col_to):
//   for (int row = row_from; row < row_to; ++row) {
//     for (auto col_range = sm.nonzero_col_range(row, col_from, col_to);
//          col_range.first != col_range.second; ++col_range.first) {
//       int col = *col_range.first;
//       // use sm(row, col)
//       ...
//     }
template<typename T = double, class Coord = int>
class SparseMatrix {
  struct PointHasher;
  typedef std::map< Coord, std::vector<Coord> > NonZeroList;
  typedef std::pair<Coord, Coord> Point;

 public:
  typedef T ValueType;
  typedef Coord CoordType;
  typedef typename NonZeroList::mapped_type::const_iterator CoordIter;
  typedef std::pair<CoordIter, CoordIter> CoordIterRange;

  SparseMatrix() = default;

  // Reads a matrix stored in MatrixMarket-like format, i.e.:
  // <num_rows> <num_cols> <num_entries>
  // <row_1> <col_1> <val_1>
  // ...
  // Note: the header (lines starting with '%' are ignored).
  template<class InputStream, size_t max_line_length = 1024>
  void Init(InputStream& is) {
    rows_.clear(), cols_.clear();
    values_.clear();

    // skip the header (lines beginning with '%', if any)
    decltype(is.tellg()) offset = 0;
    for (char buf[max_line_length + 1];
         is.getline(buf, sizeof(buf)) && buf[0] == '%'; )
      offset = is.tellg();
    is.seekg(offset);

    size_t n;
    is >> row_count_ >> col_count_ >> n;
    values_.reserve(n);
    while (n--) {
      Coord row, col;
      typename std::remove_cv<T>::type val;
      is >> row >> col >> val;
      values_[Point(--row, --col)] = val;
      rows_[col].push_back(row);
      cols_[row].push_back(col);
    }
    SortAndShrink(rows_);
    SortAndShrink(cols_);
  }

  const T& operator()(const Coord& row, const Coord& col) const {
    static const T kZero = T();
    auto it = values_.find(Point(row, col));
    if (it != values_.end())
      return it->second;
    return kZero;
  }

  CoordIterRange
  nonzero_col_range(Coord row, Coord col_from, Coord col_to) const {
    CoordIterRange r;
    GetRange(cols_, row, col_from, col_to, &r);
    return r;
  }

  CoordIterRange
  nonzero_row_range(Coord col, Coord row_from, Coord row_to) const {
    CoordIterRange r;
    GetRange(rows_, col, row_from, row_to, &r);
    return r;
  }

  Coord row_count() const { return row_count_; }
  Coord col_count() const { return col_count_; }
  size_t nonzero_count() const { return values_.size(); }
  size_t element_count() const { return size_t(row_count_) * col_count_; }

 private:
  typedef std::unordered_map<Point,
                             typename std::remove_cv<T>::type,
                             PointHasher> ValueMap;

  struct PointHasher {
    size_t operator()(const Point& p) const {
      return p.first << (std::numeric_limits<Coord>::digits >> 1) ^ p.second;
    }
  };

  static void SortAndShrink(NonZeroList& list) {
    for (auto& it : list) {
      auto& indices = it.second;
      indices.shrink_to_fit();
      std::sort(indices.begin(), indices.end());
    }

    // insert a sentinel vector to handle the case of all zeroes
    if (list.empty())
      list.emplace(Coord(), std::vector<Coord>(Coord()));
  }

  static void GetRange(const NonZeroList& list, Coord i, Coord from, Coord to,
                       CoordIterRange* r) {
    auto lr = list.equal_range(i);
    if (lr.first == lr.second) {
      r->first = r->second = list.begin()->second.end();
      return;
    }

    auto begin = lr.first->second.begin(), end = lr.first->second.end();
    r->first = lower_bound(begin, end, from);
    r->second = lower_bound(r->first, end, to);
  }

  ValueMap values_;
  NonZeroList rows_, cols_;
  Coord row_count_, col_count_;
};

#endif  /* UTIL_IMMUTABLE_SPARSE_MATRIX_HPP_ */

为了简单起见，它是 immutable, ，但您可以使其可变；一定要改变 std::vector 到 std::set 如果您想要合理有效的“插入”（将零更改为非零）。

我建议做类似的事情：

typedef std::tuple<int, int, int> coord_t;
typedef boost::hash<coord_t> coord_hash_t;
typedef std::unordered_map<coord_hash_t, int, c_hash_t> sparse_array_t;

sparse_array_t the_data;
the_data[ { x, y, z } ] = 1; /* list-initialization is cool */

for( const auto& element : the_data ) {
    int xx, yy, zz, val;
    std::tie( std::tie( xx, yy, zz ), val ) = element;
    /* ... */
}

为了帮助保持数据稀疏，您可能需要编写一个子类 unorderd_map, ，其迭代器会自动跳过（并删除）任何值为 0 的项目。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow