optimized memcpy

https://stackoverflow.com/questions/1209529

06-07-2019
|

Question

Are there faster alternatives to memcpy() in C++?

Solution

Unlikely. Your compiler/standard library will likely have a very efficient and tailored implementation of memcpy. And memcpy is basically the lowest api there is for copying one part of memory to another.

If you want further speedups, find a way to not need any memory copying.

OTHER TIPS

First, a word of advice. Assume that the people who wrote your standard library are not stupid. If there was a faster way to implement a general memcpy, they'd have done it.

Second, yes, there are better alternatives.

In C++, use the std::copy function. It does the same thing, but it is 1) safer, and 2) potentially faster in some cases. It is a template, meaning that it can be specialized for specific types, making it potentially faster than the general C memcpy.
Or, you can use your superior knowledge of your specific situation. The implementers of memcpy had to write it so it performed well in every case. If you have specific information about the situation where you need it, you might be able to write a faster version. For example, how much memory do you need to copy? How is it aligned? That might allow you to write a more efficient memcpy for this specific case. But it won't be as good in most other cases (if it'll work at all)

Optimization expert Agner Fog has published optimized memory functions: http://agner.org/optimize/#asmlib. It's under GPL though.

Some time ago Agner said that these functions should replace GCC builtins because they're a lot faster. I don't know if it's been done since then.

This answer for a very simiar question (about memset()) applies to here, too.

clearing a small integer array: memset vs. for loop

It basically says that compilers generate some very optimal code for memcpy()/memset() - and different code depending on the nature of the objects (size, alignment, etc).

And remember, only memcpy() PODs in C++.

In order to find or write a fast memory copy routine, we should understand how processors work.

Processors since Intel Pentium Pro do “Out-of-order execution”. They may execute many instructions in parallel if the instructions don’t have dependencies. But this is only the case when the instructions operate with registers only. If they operate with memory, additional CPU units are used, called “load units” (to read data from memory) and “store units” (to write data to memory). Most CPUs have two load units and one store unit, i.e. they can execute in parallel two instructions that reads from memory and one instruction that writes into memory (again, if they don’t affect each other). The size of these units is usually the same as the maximum register size – if the CPU has XMM registers (SSE) – it’s 16 bytes, if it has YMM registers (AVX) – it is 32 bytes, and so on. All instructions that read or write memory are translated to micro-operations (micro-ops) that go to the common pool of micro-ops and wait there for the load and store units to be able to serve them. A single load or store unit can only serve one micro-op at a time, regardless of the data size it needs to load or store, be it 1 byte or 32 bytes.

So, fastest memory copy would be move to and from registers with maximum size. For AVX-enabled processors, fastest way to copy memory would be to repeat the following sequence, loop-unrolled:

vmovdqa     ymm0,ymmword ptr [rcx]
vmovdqa     ymm1,ymmword ptr [rcx+20h]
vmovdqa     ymmword ptr [rdx],ymm0
vmovdqa     ymmword ptr [rdx+20h],ymm1

The Google code posted earlier by hplbsh is not very good, because they use all 8 xmm registers to hold the data before they begin to write it back, while it is not needed – since we only have two load units and one store unit. So just two registers give best results. Using that many registers in no way improves performance.

A memory copy routine may also use some "advanced" techniques like “prefetch” to instruct the processor to load memory into cache in advance and “non-temporal writes” (if you are copying very large memory chunks and don’t need the data from the output buffer to be immediately read), aligned vs unaligned writes, etc.

Modern processors, released since 2013, if they have the ERMS bit in the CPUID, have so-called “enhanced rep movsb”, so for large memory copy, the “rep movsb” may be used – the copy will be very fast, even faster than with the ymm registers, and it will work with cache properly. However, startup costs of this instruction are very high – about 35 cycles, so it pays up only on large memory blocks.

I hope it should now be easier for you to choose or write the best memory copy routine needed for your case.

You can even keep the standard memcpy/memmove, but get your own special largememcpy() for your needs.

Depending on what you're trying to do... if it's a big enough memcpy, and you are only be writing to the copy sparsely, a mmap with MMAP_PRIVATE to create a copy-on-write mapping could conceivably be faster.

Depending on your platform there may be for specific use cases, like if you know the source and destination are aligned to a cache line and the size is an integer multiple of the cache line size. In general most compilers will produce fairly optimal code for memcpy though.

I'm not sure that using the default memcpy is always the best option. Most memcpy implementations I've looked at tend to try and align the data at the start, and then do aligned copies. If the data is already aligned, or is quite small, then this is wasting time.

Sometimes it's beneficial to have specialized word copy, half word copy, byte copy memcpy's, as long as it doesn't have too negative an effect on the caches.

Also, you may want finer control over the actual allocation algorithm. In the games industry it's exceptionally common for people to write their own memory allocation routines, irrespective of how much effort was spent by the toolchain developers in the first place developing it. The games I've seen almost always tend to use Doug Lea's Malloc.

Generally speaking though, you'd be wasting time trying to optimize memcpy as there'll no doubt be lots of easier bits of code in your application to speed up.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow