Pergunta

I have a fairly complex C++ component whose performance has become a problem. Profiling shows that that most of the execution time is simply spent allocating memory for std::strings.

I know that there's a lot of redundancy among those strings. A handful of values repeat very frequently but there are also lots of unique values. The strings are typically fairly short.

I'm now just thinking if it would make sense to somehow reuse those frequent allocations. Instead of 1000 pointers to 1000 distinct "foobar" values, I could have 1000 pointers to one "foobar" value. The fact that this would be more memory efficient is a nice bonus but I'm mostly concerned about latency here.

I guess one option would be to maintain some kind of a registry of already allocated values but is it even possible to make the registry lookups faster than redundant memory allocations? Is this a viable approach?

Foi útil?

Solução

I lean heavily on interned strings as Basile suggests, where a string lookup translates to a 32-bit index to store and compare. That's useful in my case since I sometimes have hundreds of thousands to millions of components with a property named "x", e.g., which still needs to be a user-friendly string name since it's often accessed by scripters by name.

I use a trie for the lookup (experimented also with unordered_map but my tuned trie backed by memory pools at least started out performing better and was also easier to make thread-safe without just locking every time the structure was accessed) but it isn't as fast for construction as creating std::string. The point is more to speed up the subsequent operations like checking for string equality which, in my case, just boils down to checking two integers for equality and to drastically reduce memory usage.

I guess one option would be to maintain some kind of a registry of already allocated values but is it even possible to make the registry lookups faster than redundant memory allocations?

That's going to be tough to make a search through a data structure much faster than a single malloc, e.g. If you have a case where you are reading a boatload of strings from an external input like a file, for example, then my temptation would be to use a sequential allocator if possible. That comes with the downside that you can't free memory of an individual string. All the memory pooled by the allocator has to be freed at once or not at all. But a sequential allocator can be handy in cases where you just need to allocate a boatload of tiny variable-sized chunks of memory in a straight sequential fashion, only to then toss it all away later. I don't know if that applies in your case or not, but when applicable, it can be an easy way to fix a hotspot related to frequent teeny memory allocations (which might have more to do with cache misses and page faults than the underlying algorithm used by, say, malloc).

Fixed-sized allocations are easier to speed up without the sequential allocator constraints that prevent you from freeing specific chunks of memory to be reused later. But making variable-sized allocation faster than the default allocator is pretty tough. Basically making any kind of memory allocator that's faster than malloc is generally extremely tough if you don't apply constraints that narrow its applicability. One solution is to use a fixed-sized allocator for, say, all strings which are 8 bytes or less if you have a boatload of them and longer strings are a rare case (for which you can just use the default allocator). That does mean 7 bytes are wasted for 1-byte strings, but it should eliminate allocation-related hotspots, if, say, 95% of the time, your strings are very short.

Another solution that just occurred to me is to use unrolled linked lists which might sound crazy but hear me out.

enter image description here

The idea here is to make each unrolled node a fixed-size instead of variable-sized. When you do that, you can use an extremely fast fixed-sized chunk allocator which pools memory, allocating fixed-sized chunks for variable-sized strings linked together. That won't reduce memory use, it'll tend to add to it because of the cost of the links, but you can play with the unrolled size to find a balance suitable for your needs. It's kind of a wacky idea but should eliminate memory-related hotspots since you can now effectively pool memory already-allocated in bulky contiguous blocks and still have the benefits of freeing strings individually. Here's a simple ol' fixed allocator I wrote (illustratory one I made for someone else, devoid of production-related fluff) which you can freely use:

#ifndef FIXED_ALLOCATOR_HPP
#define FIXED_ALLOCATOR_HPP

class FixedAllocator
{
public:
    /// Creates a fixed allocator with the specified type and block size.
    explicit FixedAllocator(int type_size, int block_size = 2048);

    /// Destroys the allocator.
    ~FixedAllocator();

    /// @return A pointer to a newly allocated chunk.
    void* allocate();

    /// Frees the specified chunk.
    void deallocate(void* mem);

private:
    struct Block;
    struct FreeElement;

    FreeElement* free_element;
    Block* head;
    int type_size;
    int num_block_elements;
};

#endif

#include "FixedAllocator.hpp"
#include <cstdlib>

struct FixedAllocator::FreeElement
{
    FreeElement* next_element;
};

struct FixedAllocator::Block
{
    Block* next;
    char* mem;
};

FixedAllocator::FixedAllocator(int type_size, int block_size): free_element(0), head(0)
{
    type_size = type_size > sizeof(FreeElement) ? type_size: sizeof(FreeElement);
    num_block_elements = block_size / type_size;
    if (num_block_elements == 0)
        num_block_elements = 1;
}

FixedAllocator::~FixedAllocator()
{
    // Free each block in the list, popping a block until the stack is empty.
    while (head)
    {
        Block* block = head;
        head = head->next;
        free(block->mem);
        free(block);
    }
    free_element = 0;
}

void* FixedAllocator::allocate()
{
    // Common case: just pop free element and return.
    if (free_element)
    {
        void* mem = free_element;
        free_element = free_element->next_element;
        return mem;
    }

    // Rare case when we're out of free elements.
    // Create new block.
    Block* new_block = static_cast<Block*>(malloc(sizeof(Block)));
    new_block->mem = malloc(type_size * num_block_elements);
    new_block->next = head;
    head = new_block;

    // Push all but one of the new block's elements to the free stack.
    char* mem = new_block->mem;
    for (int j=1; j < num_block_elements; ++j)
    {
        void* ptr = mem + j*type_size;
        FreeElement* element = static_cast<FreeElement*>(ptr);
        element->next_element = free_element;
        free_element = element;
    }
    return mem;
}

void FixedAllocator::deallocate(void* mem)
{
    // Just push a free element to the stack.
    FreeElement* element = static_cast<FreeElement*>(mem);
    element->next_element = free_element;
    free_element = element;
}

Outras dicas

You might want to have some interned string machinery (but the strings should be immutable, so use const std::string-s). You could want some symbols. You might look into smart pointers (e.g. std::shared_ptr). Or even std::string_view in C++17.

Once upon a time in compiler construction we used something called data-chair (instead of data-bank, a colloquial German translation for DB). This simply created a hash for a string and used that for allocation. So any string was not some piece of memory on heap/stack but a hash code into this data-chair. You could replace String by such a class. Needs quite some code rework, though. And of course this is only usable for r/o strings.

Notice how memory allocation and actual memory used both relate to poor performance:

The cost of actually allocating the memory is, of course, very high. Therefore std::string might already use in-place allocation for small strings, and the amount of actual allocations might therefore be lower than you might first assume. In case the size of this buffer isn't big enough, then you might be inspired by e.g. Facebook's string class (https://github.com/facebook/folly/blob/master/folly/FBString.h) which uses 23 chars internally before allocating.

The cost of using a lot of memory is also worth noting. This is perhaps the biggest offender: You might have plenty of RAM in your machine, however, the cache sizes are still small enough that it'll hurt performance when accessing memory that isn't already cached. You can read about this here: https://en.wikipedia.org/wiki/Locality_of_reference

Instead of making string operations faster, another approach is to reduce the number of string operations. Would it be possible to replace strings with an enum, for example?

Another approach that might be useful is used in Cocoa: There are cases where you have hundreds or thousands of dictionaries, all with mostly the same key. So they let you create an object which is a set of dictionary keys, and there is a dictionary constructor that takes such an object as an argument. The dictionary behaves the same as any other dictionary, but when you add a key/value pair with a key in that key set, the key isn't duplicated but just a pointer to the key in the key set is stored. So these thousands of dictionaries need only one copy of each key string in that set.

Licenciado em: CC-BY-SA com atribuição
scroll top