Class using std::vector allocation slower than pointer allocation by a LOT

https://stackoverflow.com/questions/20863478

23-09-2022
|

سؤال

I have a class for handling allocation of arrays. My class is simple and it is defined as follows:

DimArray.hpp:

#ifndef DIMARRAY_HPP_INCLUDED
#define DIMARRAY_HPP_INCLUDED

#include <vector>

template<typename T>
class DimArray
{
    private:
        int Width, Height;
        std::vector<T> Data;

    public:
        DimArray(int Width, int Height);
        DimArray(T* Data, int Width, int Height);
        DimArray(T** Data, int Width, int Height);

        DimArray(const DimArray &da);
        DimArray(DimArray &&da);

        inline int size() {return Width * Height;}
        inline int size() const {return Width * Height;}

        inline int width() {return Width;}
        inline int width() const {return Width;}

        inline int height() {return Height;}
        inline int height() const {return Height;}

        inline T* operator [](int Index) {return const_cast<T*>(Data.data()) + Height * Index;}
        inline const T* operator [](int Index) const {return Data.data() + Height * Index;}

        DimArray& operator = (DimArray da);
};

template<typename T>
DimArray<T>::DimArray(int Width, int Height) : Width(Width), Height(Height), Data(Width * Height, 0) {}

template<typename T>
DimArray<T>::DimArray(T* Data, int Width, int Height) : Width(Width), Height(Height), Data(Width * Height, 0) {std::copy(Data, Data + Width * Height, const_cast<T*>(this->Data.data()));}

template<typename T>
DimArray<T>::DimArray(T** Data, int Width, int Height) : Width(Width), Height(Height), Data(Width * Height, 0) {std::copy(Data[0], Data[0] + Width * Height, const_cast<T*>(this->Data.data()));}

template<typename T>
DimArray<T>::DimArray(const DimArray &da) : Width(da.Width), Height(da.Height), Data(da.Data) {}

template<typename T>
DimArray<T>::DimArray(DimArray &&da) : Width(std::move(da.Width)), Height(std::move(da.Height)), Data(std::move(da.Data)) {}

template<typename T>
DimArray<T>& DimArray<T>::operator = (DimArray<T> da)
{
    this->Width = da.Width;
    this->Height = da.Height;
    this->Data.swap(da.Data);
    return *this;
}

#endif // DIMARRAY_HPP_INCLUDED

For benchmark timing, I use the following:

Timer.hpp:

#ifndef TIME_HPP_INCLUDED
#define TIME_HPP_INCLUDED

#include <chrono>

#if defined _WIN32 || defined _WIN64
#include <windows.h>

template<typename T>
class Timer
{
    private:
        typedef T duration;
        typedef typename T::rep rep;
        typedef typename T::period period;
        typedef std::chrono::time_point<Timer, duration> time_point;
        std::chrono::time_point<Timer, duration> Time;
        static const bool is_steady = true;

        const rep g_Frequency = []() -> rep
        {
            LARGE_INTEGER frequency;
            QueryPerformanceFrequency(&frequency);
            return frequency.QuadPart;
        }();

        inline std::chrono::time_point<Timer, duration> now()
        {
            LARGE_INTEGER count;
            QueryPerformanceCounter(&count);
            return time_point(duration(count.QuadPart * static_cast<rep>(period::den) / g_Frequency));
        }

    public:
        inline void Start() {this->Time = this->now();}
        inline rep End() {return std::chrono::duration_cast<T>(this->now() - this->Time).count();}
};
#else
template<typename T>
class Timer
{
    private:
        static const bool is_steady = true;
        std::chrono::high_resolution_clock Clock;
        std::chrono::time_point<std::chrono::high_resolution_clock> Time;

    public:
        inline void Start() {this->Time = this->Clock.now();}
        inline T::rep End() {return std::chrono::duration_cast<T>(this->Clock.now() - this->Time).count();}
};
#endif

#endif // TIME_HPP_INCLUDED

And my benchmark is as follows:

int main()
{
    Timer<std::chrono::nanoseconds> T;

    T.Start();
    for (int i = 0; i < 100; ++i)
    {
        int** T2DArray = new int*[10000];
        for (int i = 0; i < 10000; ++i)
        {
            T2DArray[i] = new int[10000];
        }

        for (int i = 0; i < 10000; ++i)
        {
            delete[] T2DArray[i];
        }
        delete[] T2DArray;
    }
    std::cout<<T.End()<<" us\n\n";

    T.Start();
    for (int i = 0; i < 100; ++i)
    {
        DimArray<int> TwoDArray(10000, 10000);
    }
    std::cout<<T.End()<<" us\n\n";
}

The results it printed were:

4.9599256 seconds  //for int**
42.9303941 seconds //for DimArray<int>

That's a huge difference! I cannot figure out why?!

So I changed it to:

int main()
{
    Timer<std::chrono::nanoseconds> T;

    T.Start();
    for (int i = 0; i < 100; ++i)
    {
        int** T2DArray = new int*[10000];
        for (int i = 0; i < 10000; ++i)
        {
            T2DArray[i] = new int[10000];
        }

        for (int i = 0; i < 10000; ++i)
        {
            delete[] T2DArray[i];
        }
        delete[] T2DArray;
    }
    std::cout<<T.End()<<" us\n\n";

    T.Start();
    for (int i = 0; i < 100; ++i)
    {
        int* TwoDArray = new int[10000 * 10000];
        delete[] TwoDArray;
    }
    std::cout<<T.End()<<" us\n\n";
}

and the results were:

4.6085725 seconds //for int**
0.1142958 seconds //for int*

Any ideas why my class that uses std::vector is so slow compared to using a raw pointer?

المحلول

vector will zero out the memory it allocates for you. Your code with new gives you "garbage-init"'d memory. Thus, the allocation TwoDArray(10000, 10000) gives you an array full of zeros, while new int[10000 * 10000] gives you an array of undefined contents. (The mere observation of which causes undefined behavior)

Note that this means that in the vector case, your program actually writes to all 100000000 ints, while in the new case your program only sets aside address space for that many ints.

For comparable measurements you'd have to zero the new'd array out first; e.g. with int* TwoDArray = new int[10000 * 10000](); instead of int* TwoDArray = new int[10000 * 10000];.

نصائح أخرى

To illustrate Billy ONeal's suggestion, and the dramatic results, I did this:

template <class T>
class no_init_alloc
    : public std::allocator<T>
{
public:
    using std::allocator<T>::allocator;

    template <class U, class... Args> void construct(U*, Args&&...) {}
};

template<typename T>
class DimArray
{
    private:
        int Width, Height;
        std::vector<T, no_init_alloc<T>> Data;

    public:
        …

And my results changed from:

2959854464 us

28734347029 us

to:

2980402236 us

850190 us

I am a little late from this New Year's Eve party.

Apart from providing a custom allocator, one can also call reserve() and then push_back() / emplace_back() when the actual data becomes available. It is also safe from the the point of view of undefined behavior: You can only access already initialized data. Unfortunately, this approach may not be suitable for your application.

One has to be very careful with this kind of benchmarking. For example, Linux does lazy allocation; if I run the code below, I get the followings:

$ /usr/bin/time --verbose ./a.out
int[]: 0 ms
vector: 0 ms
...
Maximum resident set size (kbytes): 1004

If I uncomment any of the code in comment, it will become 391 MB.

#include <chrono>
#include <cstdio>
#include <vector>

using namespace std;
using namespace std::chrono;

constexpr size_t SIZE = 100000000;

int main(int argc, char* argv[]) {

  auto t1 = high_resolution_clock::now();

  {
    int* pInt = new int[SIZE];
    //int* pInt = new int[SIZE]();
    pInt[0] = 0;
    delete[] pInt;
  }

  auto t2 = high_resolution_clock::now();  

  {
    vector<int> v;
    v.reserve(SIZE);
    v.push_back(0);
    //v.resize(SIZE, 0);
  }

  auto t3 = high_resolution_clock::now();

  printf("int[]:  %ld  ms\n", duration_cast<milliseconds>(t2-t1).count());
  printf("vector: %ld  ms\n", duration_cast<milliseconds>(t3-t2).count());

}

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow