从内存映射格式的文件中读取整数

https://stackoverflow.com/questions/4198404

11-10-2019
|

题

我的内存映射了一个大的格式化（文本）文件，该文件包含每个行的一个整数：

因此，我在第一个字节上有一个指向内存的指针，也有一个指向最后一个字节的记忆指针。我试图尽可能快地将所有这些整数读成一个阵列。最初，我创建了一个专业的STD :: StreamBuf类，以与STD :: ISTream一起从该内存中读取，但似乎相对较慢。

您对如何有效解析诸如“ 1231232 r n123123 r r n123 r r r n1231 r r r n2387897 ...”之类的字符串有任何建议。 ..}？

文件中的整数数量未知。

解决方案

std::vector<int> array;
char * p = ...; // start of memory mapped block
while ( not end of memory block )
{
    array.push_back(static_cast<int>(strtol(p, &p, 10)));
    while (not end of memory block && !isdigit(*p))
        ++p;
}

此代码有点不安全，因为不能保证 strtol 将停在内存映射块的末尾，但这是一个开始。即使添加了额外的检查，也应该非常快。

其他提示

对于我来说，这是一项非常有趣的任务，要了解有关C ++的更多信息。

承认，该代码很大，并且检查了很多错误，但这仅显示在解析过程中可能出问题了多少。

#include <ctype.h>
#include <limits.h>
#include <stdio.h>

#include <iterator>
#include <vector>
#include <string>

static void
die(const char *reason)
{
  fprintf(stderr, "aborted (%s)\n", reason);
  exit(EXIT_FAILURE);
}

template <class BytePtr>
static bool
read_uint(BytePtr *begin_ref, BytePtr end, unsigned int *out)
{
  const unsigned int MAX_DIV = UINT_MAX / 10;
  const unsigned int MAX_MOD = UINT_MAX % 10;

  BytePtr begin = *begin_ref;
  unsigned int n = 0;

  while (begin != end && '0' <= *begin && *begin <= '9') {
    unsigned digit = *begin - '0';
    if (n > MAX_DIV || (n == MAX_DIV && digit > MAX_MOD))
      die("unsigned overflow");
    n = 10 * n + digit;
    begin++;
  }

  if (begin == *begin_ref)
    return false;

  *begin_ref = begin;
  *out = n;
  return true;
}

template <class BytePtr, class IntConsumer>
void
parse_ints(BytePtr begin, BytePtr end, IntConsumer out)
{
  while (true) {
    while (begin != end && *begin == (unsigned char) *begin && isspace(*begin))
      begin++;
    if (begin == end)
      return;

    bool negative = *begin == '-';
    if (negative) {
      begin++;
      if (begin == end)
        die("minus at end of input");
    }

    unsigned int un;
    if (!read_uint(&begin, end, &un))
      die("no number found");

    if (!negative && un > INT_MAX)
      die("too large positive");
    if (negative && un > -((unsigned int)INT_MIN))
      die("too small negative");

    int n = negative ? -un : un;
    *out++ = n;
  }
}

static void
print(int x)
{
  printf("%d\n", x);
}

int
main()
{
  std::vector<int> result;
  std::string input("2147483647 -2147483648 0 00000 1 2 32767 4 -17 6");

  parse_ints(input.begin(), input.end(), back_inserter(result));

  std::for_each(result.begin(), result.end(), print);
  return 0;
}

我努力不调用任何形式 不确定的行为, ，在将未签名的数字转换为签名数字或调用时，这可能会变得非常棘手 isspace 在未知数据类型上。

由于这是内存映射的简单副本，因此将字符和ATOI映射到另一个内存映射的文件顶部的另一个整数数组，将是非常有效的。这样，分页文件根本不用于这些大型缓冲区。

open memory mapped file to output int buffer

declare small stack buffer of 20 chars
while not end of char array
  while current char not  line feed
    copy chars to stack buffer
    null terminate the buffer two chars back
    copy results of int buffer output buffer
    increment the output buffer pointer
  end while  
end while

虽然这不使用A库具有最大程度地将内存使用量最小化到内存映射文件的优势，因此温度缓冲区仅限于堆栈中，并且内部使用了Atoi。可以根据需要将输出缓冲区扔掉或将其保存到文件中。

注意：此答案已经编辑了几次。

按行读取内存（基于关联和关联).

class line 
{
   std::string data;
public:
   friend std::istream &operator>>(std::istream &is, line &l) 
   {
      std::getline(is, l.data);
      return is;
   }
   operator std::string() { return data; }    
};

std::streambuf osrb;
setg(ptr, ptr, ptrs + size-1);
std::istream istr(&osrb);

std::vector<int> ints;

std::istream_iterator<line> begin(istr);
std::istream_iterator<line> end;
std::transform(begin, end, std::back_inserter(ints), &boost::lexical_cast<int, std::string>);

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow