Introduction

So of course I wanted to make a highly controvertial title, how many times have we seen `the fastest algorithm EVER` before; but I needed your attention and I was successful in that! However, my title is not without justification!

The title of `fastest` does NOT belong to me for EVERY size copy. Since optimizing for any one size is a tradeoff. I think the only size I was consistently outperformed in Assembler was a 16-byte copy (and possibly some other small byte combinations, possibly also 8-bytes and 32-bytes if you optimize specifically to beat these algorithms). This code is faster for copies with size > 16 bytes. For less than that I only have a 1 or 2 clock cycle penalty! But for all the other sizes, that title belongs to me! These functions will also reach your MAXIMUM memory bandwidth limits very quickly!

Q. Why is this faster than my built-in memcpy/memmove?

A. (I thought I would answer this question early as it must be on your mind!)

1) This is NOT a SINGLE memcpy/memmove function, this is actually THREE separate functions with different caracteristics, algorithms and optimizations; the code will choose the best function for the CPU at runtime, each function is built specifically for 3 different CPU architectures. 64-bit processors with SSE4.2 (Core i generation without penalties for unaligned memory), older 64-bit processors (Core, Core 2 and AMD equivalents) and 32-bit processors with SSE2. If none of these features are found, the functions will fall-back to using the built-in memcpy/memmove!

2) The first time you call either memcpy/memmove, they do a CPU feature detection (CPUID) and select the most optimal function for your CPU at runtime. This is a ONCE-off penalty the first time you call them. After that, it's gravy! So there are actually 3 self contained functions here, and they will select; and use the most optimal one automatically!

3) Most built-in memcpy/memmove functions (including MSVC and GCC) use an extremely optimized QWORD (64-bit) copy loop. The apex functions use SSE2 load/loadu/store/storeu and SSE streaming, with/without data pre-fetching depending on the situation. In-other-words, everything adapts to the situation for small or large copies!

4) If you don't have SSE2, they will default to the built-in functions which use QWORD. Basically the PC has to be older than 15 years for this to happen, as P4's in 2001 already had SSE2! We detect the presense of SSE4.2; CPU's with SSE4.2 have NO penalty for reading/writing unaligned SSE data (loadu/storeu). This is more or less the Core i generation!

5) In the LARGE data copy loop, they use SSE2 streaming intrinsics, these are the fastest data copy methods; I include a high performance 4K data prefetch (CPU hint). As the function is copying, it constantly issues a prefetching command 4K ahead. This design has never been done like this before! `tiberium` (one of the functions) will pre-align the memory (to 16-byte boundaries) to avoid misaligned penalties, then execute SSE streaming on the aligned bytes! The streaming intrinsics are designed by Intel (and AMD) for high performance! You WILL copy at the MAXIMUM bandwidth throughput of your machine!

6) These functions use EVERY trick in the book! They come from a LONG family of functions. Some of the techniques and algorithms I've used have never been published before. They were developed over several months in 2013/2014. Every copy size uses a different technique. The shortest code path is optimized for `size <= 112` bytes, then `size >= 16` then the rest. So small byte copies have the shortest code path with least jumps!

7) Although these are C/C++ functions, they were `designed by disassembly` ... meaning I was paying close attention to the compiler output. This is NOT normally advisable (we don't want the compiler to dictate how we design our functions), however, when you are designing things for maximum performance, it's important to pay attention to how the code will compile! The strange layout of the functions is a testament to my close observation of the compiler output. The way instructions are ordered also prevents the compiler from making some `assumptions`! They were desgined primary on the Visual Studio 2010 compiler, however, GCC never produced worse code, so they will compile equally well (usually even better) on GCC, and probably LLCV/Clang as well!

8) An optimized assembler version of these algorithms WILL be faster (I know because I have built assembler versions), but they are sometimes harder to implement/add to existing libraries. I wanted to give a copy/paste version of code that could be used anywhere. Also, the algorithm used by these functions are what make them faster, not micro-optimizations, although I did everything I could to help the compilers build the most optimal code!

9) There are several code paths, each one optimized for a different size.

size < 16 bytes
size <= 112 bytes
size < 256KB
size >= 256KB (SSE streaming + 4KB `prefetch look ahead`) My 4KB prefetch ahead is unique! Optimized for large copies! Allowing the code to reach maximum memory bandwidth throughput!

Background

Please note, I'm writing/releasing this article over 2 YEARS AFTER writing these functions!

In late 2013, my OCD took over and I became totally obsessed with writing the fastest memcpy/memmove function in the world; which took over my work and life. I became so obsessed that I wrote 80,000 lines of code in over 140 variations of memmove, mostly copies with small variations and tweaks; benchmarking them on a P4, a Core laptop, a Core 2 E6600, 3rd generation i5 3550 and i7, against the best of the best algorithms I could find (including Agner Fog's excellent A_memmove() from asmlib).

Originally I started by disassembling and studying the memcpy() of Visual Studio, then GCC etc. I wrote several QWORD copy implementations but struggled to outperform the built-in functions with only C code. Eventually I started to study Agner Fog's A_memmove, which I had been using for several years. One of the code paths was an AVX (256-bit) version. Eventually I re-engineered his code into C so I could analyze the algorithm (my C version of his algorithm is called avx_memcpy0 in the string.zip file). It took at least 20 functions before I could outperform Agner's code and the built-in versions, mainly through my own ignorance. Eventually after about 100 function combinations I was able to consistently beat them.

So in order to prevent myself from feeling the urge to go back and re-live this dark madness, I must release what I have and hope that someone can make sense of my madness!

Benchmarks

These are only ESTIMATES taken from the original article, which did not include my fastest implementations which were yet to come; so these estimates are from older slower variations.

large copy (>= 128 bytes)

32-bit = 40% faster

64-bit = 30% faster

small copy (< 128-bytes)

15%~40% faster

These are very old numbers! The functions included here are faster! Depending on hardware of course!

Code

To be as brief as I can; the code consists of 3 files, a header (.h), .c file for C and .cpp file for C++ using the `apex` namespace! Choose if you want the C or C++ version ... no difference in terms of performance!

You don't have to worry about this; but the code uses a memcpy/memmove dispatcher function based on your CPU features (inspired by Agner Fog). This is a ONCE off penalty the first time you call the functions to detect your CPU features like SSE4.2 (Core i) and SSE2 ... and then route the function pointer to the appropriate/most optimal function at runtime. I've included 3 functions for different scenarios. But as I said, you don't need to worry about this! Just call `apex::memmove()' in C++ or `apex_memmove()` in C. They are all safe to call on overlapping data. For overlapping data they read/write in reverse direction!

Note: the CPU feature dection is NOT for the SSE 4.2 instruction set, it's for the ARCHitecture of the computers with that instruction set. ie. Computers with SSE4.2 have fast `UNaligned` memory reads. Meaning they have no clock cycle penalty from using `loadu` which is to load unaligned memory. So we don't have to `align` the memory by reading 1-15 bytes first. `kryptonite` does the copies on machines that have NO penalty for UNaligned reads (eg. Core i), and `tiberium` for machines that require alignment for optimal efficiency. `mithril` is used on 32-bit + SSE2 machines, or they will default to the normal/built-in memmove()/memcpy() so they will ALWAYS be safe to copy no matter what hardware they run on!

I gave my fastest functions code names, hence the names `kryptonite`, `tiberium` and `mithril`. I wanted to present these general purpose functions because I believe they could make a significant contribution to the world!

`apex` is the name of my general purpose function library, which includes many other functions which are ALL faster than stdlib, GCC, MSVC etc. I have faster functions for string manipulation, lcase, ucase, strlen, strcpy etc. But I have not release any others except these two. Anyways, enjoy the madness!

Download apex_memmove.zip

memmove-OLD-archive.zip

I uploaded this file ONLY for RESEARCH purposes, for anyone investigating and doing research on this topic! It just includes most/many of my original functions with a lot of comments. About the first 100 functions were named `sse2_memcpy##`, then I changed the naming convention to `memmove##` because the `move` test is only about 1 or 2 clock cycles! It should include many DWORD/QWORD variations as well, although I haven't even looked in the file, it has been 2 years and if I look at that file it will haunt me. Just take what you can get and be happy please! Don't ask me questions about it unless you are desperate. I don't want to be sucked into that dark world again! Actually, the `string.zip` file below should contain a more complete account of my functions, as it contains my original AVX experiments etc.

Download memmove-OLD-archive.zip

string.zip

Read Update 5 below! This file contains the original 80,000 lines of code! Note: this is for REASEARCH purposes! Don't read this unless you wanna go nuts! This includes my original conversion of Agner Fog's `A_memmove` AVX (256-bit) function written in Assembler but converted to C source code (avx_memcpy0)! I did the conversion with line-by-line comments!

Download string.zip

benchmarks.zip

Some OLD benchmarks, again just for posterity. Don't ask what all the numbers mean, I knew what they meant at one time. Read Update 3 below for more info.

Download benchmarks-OLD.zip

memmove64-asm.zip

The file contains an `optimized` assembler version of `mithril`. `mithril` is NOT my fastest version. This was a CONVERSION of the disassembled code. Read update 2 below!

Download memmove64-OLDER-asm.zip

Q. Isn't an Assembler function faster?

Yes, however, I'll get you 99% of the way with these functions! I give other details on this below in the section where I copied my original unpublished article from 2 years ago, but I thought I would answer this question here anyway.

I included my 64-bit Assembler version of `mithril` (memmove13/mm13) function in Update 2, and you can go through and analyse the changes I made to the code. Please remember that this is now almost 3 years after I wrote that assembler code, but I'm pretty confident that it's still probably one of the fastest memmove functions ever written (since I have some highly efficient copy algorithms in it that I never published!)!

An assembler version WILL be faster, however, there are several things to keep in mind. When I wrote each C function, I looked VERY closely at the disassembly (that's why you see some highly irregular things like `size = -size`, and other weird tests. I made MANY algorithmic changes based on what I saw my compiler (Visual Studio 2010) was going. You can see the implication of looking at the disassembly when you see things like (size <=112) instead of 128, because Visual Studio was not using ALL the available registers. Same thing goes for XMM0-XMM5 ... I couldn't use 2 of the XMM registers, as soon as I used 7 or 8 XMM registers Visual Studio would just f*** up the whole thing and start writing data to temporary stack variables etc. (Visual Studio I believe reserves 2 XMM registers, can't remember why). GCC would be better, but I wanted to target the `least` common denominator between them! GCC can benefit even more from some of the other algorithms I wrote, they can be found in the zip file, but they are more specific, and I wanted a general purpose, copy/paste version that would compile fast on most guys machines!

So, these C algorithms were `written by disassembly`, or at least `optimized by disassembly` ... so I'm confident they will compile to very good, high enough machine code! To optimize each function took me 2~3 hours, and in most cases I could only save a few clock cycles. I'm sure a much better assembler writer like Agner Fog (who I've been in contact with) can improve them further, but then you loose the convenience of a copy/paste C code. To have a copy/paste version in C that can be added to anyone's library was a far greater importance to me. Most guys would probably not attempt to implement/replace their memmove with assembler code, but C should have a much bigger audience!

So yes, an optimized assembler version is faster, but these are faster than anything you've had before! Just copy/paste and compile (and sort out anything missing from your build)!

size = -size

I honestly can't remember why I was doing this. I KNOW it's an UNSIGNED value and I'm using a negative, I just can't remember why I was doing it. Somewhere in my code I saw a comment about `it saves 2 clock cycles`, but this is like 3 years later, and I'm confident in the algorithm because I tested and VERIFYIED ALL memory I copied/moved to make sure my algorithm was functioning correctly. This statement had something to do with the way Visual Studio was handling the alternatives. It was a hack/trick I was using to reduce instructions.

I KNOW i'm not supposed to do stuff like this, but I don't give a damn. Use the code or be slow and don't use it ... I really don't care. Just leave the instruction. It was important!

Self Update: I think this is supposed to be `size = ~size` ... but I'm just not sure anymore! Why the hell was I doing this? seriously, isn't it supposed to be `size = ~size`? Why wasn't I using that instruction then??

Conclusion

I must end this article here and just present the code to you, or I will never finish it. I want to release this code in the hopes that it will be useful to someone/ANYONE else. Maybe MicroSoft or the GCC/LLVM/Clang/stdlib guys can get some ideas from this; even if I help them improve their versions by 5%, 10% or 20%, I feel it would have been worth the madness.

I know this article is much shorter than I wanted it to be, but I find that if I go into too much detail I get lost, frustrated to explain it and just go round in circles; so even if I feel this article is not very professional (I'm not a writer) I must release it as soon as I can!

May this code go forth and improve humanity!

Update 1:

"Mithril"

I've decided to present `mithril` to you. It's located in the zip file as memmove13 (mm13). Ah, I just remembered that I have an optimized assembler version of this function as well, I'll release it after this!

Q. What is `mithril`?

A. `mithril` is one of the fastest, general purpose, multi-platform (32-bit AND 64-bit) implementations I have. This is a general purpose replacement for ALL built-in memmove/memcpy implementations in ALL compilers! This function WILL outperform both Visual Studio (2010) AND GCC memmove/memcpy, as long as you have a P4 (circa 2001) or newer! Since all MY PC's are less than 15 years old, and ALL 64-bit processors have SSE2, it's pretty safe to use! You could also just put an `#if is64bit` (`#if _WIN64') test in-front of it if you really want? Or if you are really serious, you can do a CPU feature detection like above. However, this function is ALSO optimized for 32-bit as they have fewer general purpose registers (the function uses fewer variables, which uses fewer registers so it's still well suited for 32-bit machines, but maintains an optimal and high efficiency inner loop, especially for larger data). This function WILL SIGNIFICANTLY outperform BOTH MSVC AND GCC built-in functions (in 99.9% of the situations) when compiled because of the algorithm! `tiberium` and `kryptonite` above are faster because they use more registers, optimized more for 64-bit.

`mithril` is my proof that you can write a general purpose memmove/memcpy function in C that outperforms the build-in ones! You just need to spend a few months writing, testing, benchmarking and optimizing it! Or you can just copy/paste and compile my code!

The original mm13 code included several alternative code paths, but I've removed them here for presentation purposes. If you want to study the original mm13 then search for `mithril` in the zip file!

You can copy this code into any library/namespace you have, or just leave it global! This function will do BOTH memmove AND memcpy faster than anything else you have! Enjoy!

Update on `mithril` code

I was asked to reduce the code length of the article. So I have included `mithril` in the `apex_memmove.zip` file above! `mithril` is used for the 32-bit code when SSE2 is detected!

Update 2

`mithril` above in (optimized) 64-bit assembler = the `Atomic Edition`

`mithril` above was the last function I converted to assembler. It took almost 3 hours to optimize each function after compiling and disassembling. So I stopped doing it after mithril. I've given this function to Agner Fog as well, so I thought I would release it here for academic studies. I will also upload my full/final 64-bit assembler file. Please note that this file is 3 years old, and it DOES NOT include `tiberium`/`kryptonite` above, only a few versions I disassembled and optimized. You can view the following code as the fastest 64-bit assembler version I ever produced (after disassembly and cleanup)!

PS: It was MUCH faster and easier to produce/test/debug/benchmark these functions in C; than in assembler! Because I could just copy/paste entire sections/blocks of code in C to test various combinations for different sizes than to write these in pure assembler to begin with! That's why I could produce 140 different variations, it would take me months to write & test all those combinations in assembler (even with copy/paste you need to make sure your registers for each block/section are still the same, I have nightmares about it!)!

FOR ACADEMIC PURPOSES!

Download memmove64-OLDER-asm.zip

Update

I removed the <code> section that was here. The assembler listing is too long. Look inside memmove64-OLDER-asm.zip for the asm_memmove13 function!

Update 3

OLD Benchmarks files

Now I've uploaded ALL the files I can find.

This file includes the original article template, which was unfinished and unpublished.

As well as 4 `benchmarks.xls` files. I cannot remember what the hell all those benchmarks are about. I believe the value used was my `Bpc` value `bytes-per-counter` ... so more was better I guess. I just remember that it probably took me about 20~40 functions before I could start beating other implementations.

If you flip though some of the tabs in the `benchmarks` files, especially in file 2 and 3, you should see some nice graphs I was plotting to analyze the characteristics of each function. The functions were tested the most on my Core i5 and Core 2 E6600. So you should see i5 and C2, but I also used a Core i7, Core (solo) etc. for testing. I tested them on all the machines I had available to me!

Also, I remember testing all the sizes from 1~128 bytes; as well as MANY various large sizes; aligned, MISaligned, UNaligned, cached and uncached! I can't even remember what the difference between MISaligned and UNaligned was. I think MISaligned means that both source and destination are BOTH on unaligned addresses!?!?

I had a VERY advanced testing/benchmarking test suite but I cannot find it now! :( If I ever find it, I'll upload it here! It was VERY impressive! I'm just so sorry I can't find it now!

Download benchmarks-OLD.zip

Update 4

NEW/Improved `memmove_dispatcher()`

This is complete re-write of the memmove_dispatcher() function which detects the compiler, the CPU architecture AND the CPU features.

I re-wrote this function because I'm sure there will be a lot of GCC guys that want to test/benchmark my functions, and __get_cpuid() is a bit of a pain to implement. So this should be a good copy/paste version for you!

Please note, you will need to ADD the code for `mithril`, `tiberium` and `kryptonite` above! You need to rename the one function from `memmove` to `mithril` when you add it. `mithril` is used on 32-bit code when the CPU has SSE2 instructions (ie. 32-bit + SSE2 = `mithril`). If it's a super old CPU, then we just default to the standard built-in memmove/memcpy functions, which usually just use QWORDS for copies!

Basically, once you've implemented this function, and added `mithril`, `tiberium` and `kryptonite` to the list, you have EVERY possible combination, a COMPLETELY faster memmove/memcpy implementation no matter what CPU architecture you are running! You'll cater for every situation! These functions WILL MAXIMIZE the memory bandwidth you have, particularly for large copies; which is quite possibly one of the fastest loops in the world!

Happy copying/moving! And good luck trying to beat the speed of these functions!

Update on the improved memmove_dispatcher()

I have already included this code in the apex_memmove.zip file!

Update 5

The FINAL 80,000 lines of code (string.zip)

While in communication with Agner Fog, I found my final implementation of all this code. It was sitting in a file called `string.hpp` ... no wonder I couldn't find it before!

This file contains my original conversion of Agner Fog's A_memcpy() function into C code, the function is called avx_memcpy0() ... that's a zero! This is a very interesting function to study! It is a conversion of the AVX (256-bit) version of Agner Fogs code, he has several code paths based on CPU features (as I do), this was the most advanced one I found at the time I wrote the functions. My original goal was to better understand his code paths/structure, so I could investigate the design further! I wrote about 23 more AVX based functions after this before giving up on AVX entirely. In theory, AVX is supposed to be faster, however, in practice, design and testing, using AVX had little or no benefit! As I said to Agner Fog, I was testing on 3rd generation Core i, so it's entirely possible that there is a significant benefit in 4th generation, but I already reached the memory bandwidth limits, so there was no benefit for me!

I release all this code WITHOUT warranty! `string.zip` is released for educational/research purposes. Especially if you want to compare/understand Agner Fog's functions in C code. It's usually easier to read and understand C code than assembler!

Download string.zip

Original UNFINISHED & UNPUBLISHED Article (for posterity) - 2 years old!

This is actually my second attempt at writing this article. My first draft was written during this dark time in my life, it was very long and detailed but was never published (until today); because I started writing it after having written 22 functions, but kept coming up with new ideas and the document fell behind the details.

I'm going to copy the original article here for reference. It documents SOME of the ideas I had at the time, but remember that it was only written after writing 22 functions, so I was only 20% of the way to insanity! The fastest functions I wrote were memmove09 for "size <= 112"; and memmove40 / memmove41 for "size > 112"

Also, please note that I don't want to be asked what I was thinking at the time, it was over 2 years ago and much of the detail is lost to me. I only include this article for people really researching this topic! If you are just a regular developer wanting to use the functions, don't even look at the original article! It's very confusing!

... Start of original UNFINISHED/UNPUBLISHED article ...

Introduction

This is the story of my journey into obsession with writing a faster memcpy() implementation, one that I hope culminates and ends with this article! My only wish is that someone, somewhere will benefit from the many days I spent writing and profiling different algorithms, from this article and from the code I present, whether directly or indirectly!

I will present an SSE2 intrinsic based memcpy() implementation written in C/C++ that runs over 40% faster than the 32-bit memcpy() function in Visual Studio 2010 for large copy sizes, and 30% faster than memcpy() in 64-bit builds. For small copy sizes, the speed will vary anywhere from 15% to 40% faster for various sizes below 128 bytes. This is only one of at least 22 SSE2 memcpy() functions I've written, each one with various characteristics, such as improvements for aligned/unaligned memory, various cache prefetching schemes and improvements for various copy sizes from small size with less than 16, 32, 64 or 128 bytes or larger copy sizes. This is NOT the fastest version I've written or the most compact, as each copy size both large and small have different ways to optimize them, but this is a nice general purpose implementation to just copy/paste, with some interesting characteristics!

Background

What started out as a way to implement a faster strlen() function in C/C++ (which I did), developed into a week long obsession to write a faster memcpy(). I have analysed every single aspect of a memcpy() procedure, where every "if" or "switch" statement you add is a trade-off, and every loop or additional variable influences performance. In fact, I have so much data from days of profiling various functions, I don't even know where to begin or how to present it all, let alone the 22 SSE2 memcpy() versions I wrote and what characteristics they represent, or the 26+ other types of memcpy() functions I wrote to test other methods, ideas or aspects. In total I've written more than 50 memcpy() or related functions, some were removed because they were just specific experiments, others became the basis for further study.

While searching online for faster memcpy() implementations on several developer related forums and community sites, I find one response to be the most common (and most repulsive); "just use the standard memcpy() provided by your compiler, it's already been heavily optimized". Well, I don't think anyone spouting that nonsense has ever profiled memcpy(), and I doubt they've actually tried writing a better implementation, and if they're saying something like that and actually tried to write a better one and failed, then I have no respect for their skills or opinion on the subject! Now, the reason I call this "repulsive" isn't that memcpy() is in fact slower (in MSVC) than a custom writen SSE2 implementation, it's the fact that the question being asked was about addressing a "time-critical" portion of their code, and memcpy() was in fact a "bottleneck" of sorts, and any speed ups would be beneficial to the project.

Assembler vs. C/C++

One thing I must say before you continue reading, is that all my functions are purely C/C++ implementations. Why? Well the main reason is the fact that I can't use inline assembler in 64-bit builds in Visual Studio. And the project I was working on is a Windows desktop/client application, which I usually build with Visual Studio, and the server applications I build with GCC on Centos. Bessides that, my assembler days ended with 32-bit CPU, FPU and MMX instructions, about 200~250 of them but I stopped because Visual Studio didn't support 64-bit inline assembler. I use MASM and I know I could build this in MASM as a separate library and link it, but I just wanted to investigate various algorithms first, and making quick changes or copy/pasting code from the middle of a C function is a lot quicker and easier to do than Assembler!

Unfortunately, there are some assumptions and things that can be done in Assembler that I just can't properly simulate in C/C++, like jump-tables. I know GCC has a fairly nice jump-table construct for C, where you can use "@@label" to put the labels in an array, but Visual Studio doesn't have this. So the closest I can get in Visual Studio is trying to make the switch statements as close to the jump-table that I want in the hopes that the compiler will see the benefit of using a jump-table internally, but I know there are some switch statements or cases that it evaluates "manually", depending somewhat on your compiler settings.

Intel vs. AMD

I know there are some architectural differences, but I don't have an AMD to test. I only have 3 more recent Intel processors, and newer processors are all I'm interested in. Also, I did have a look at the various AMD instructions timing and latencies in Agner Fogs "Instruction tables" for the various AMD architectures, and they look really similar to Intels on never architectures. One thing I must point out, is that the SSE2 implementation I will present here uses `loadu` (unaligned load) instructions to load less than 128 bytes of data. This instruction is slower on older CPU's, however, the main issue would be copying between 17~32 bytes, it `might` be a few cycles slower, because this requires 2x 16-byte loadu/storeu instructions, not enough to compensate for the 3 cycle loadu instruction. For example, the MOVDQU (loadu/storeu) instruction which loads and stores unaligned memory on the AMD K10 (2007) uses 1 cycle for `loadu`, 3 cycles for `storeu`, and 2 cycles for MOVNTDQ (streaming). However, from Bulldozer (2011), all 3 instructions take 1 clock cycle like Intel processors from Nehalem (2008), Sandy Bridge (2009) and Ivy Bridge (2011). I try to do what I can with "prefetching" very early but I don't have an old Intel or AMD to test, my hope is that the early prefetch will overcome some of the older CPU deficiencies. But the prefetch statements represent a measurable 2% improvement even on the latest Intel CPU's!

Intel Core architecture

From looking at Agner Fogs Instruction tables, I think the worst case scenario is going to be the Pentium M, Core Solo, Core Duo, Merom and Wolfdale artchitectures. On Pentium M, Core SOlo and Core Duo, `loadu` is 4 cycles, `storeu` is 8 cycles and the `streaming` instruction is 4 cycles. I use the `streaming` instructions when the copy is larger than 128-bytes to bypass the CPU cache, but `storeu` is used below 128-bytes. On these CPU's, you can use the aligned MOVDQA (load/store) instructions which use 2 cycles for `load` and 2 for `store`, but that means you would need to add more checks to align <= 128-bytes. I already handle the alignment on more than 128-byte copies, but not sizes below and including 128-bytes! I do have the early `prefetching`, which could compensate, for these cases. Alternatively, you can remove the "if size <= 128" statement completely, and let the main code which also does the aligning, handle less than 128-bytes. The reason I left this in is that it's faster on newer CPU's, and it demonstrates a few interesting cases! It's just an interesting piece of code to study and analyze!

On Merom (2006~2009) and Wolfdale (2007~2011), I don't know what Intel did with the `storeu` (save unaligned memory) MOVDQU instruction, but it went from 8 cycles to 9 cycles (Unless Agner's document is wrong?). However, the aligned `load` and `store` instructions MOVDQA and the `streaming` MOVNTDQ instruction are all 1 cycle now.

Method of profiling

I used Visual Studio 2010 in Release build, in general with "Full Optimization" and "Favour Fast Code". All functions were timed with QueryPerformanceCounter() over several million calls, usually lasting several minutes or hours, depending on the nature of the test. I have been running tests overnight for the past week, I would run shorter tests during the day and then create some longer tests before going to bed, and run them while I sleep. Many of the tests were scheduled to run so long that they would take months to complete. 2x 1GB buffers were allocated in 64-bit mode, and 2x 512MB buffers in 32-bit builds. I have a core i5 as my primary development machine, but I also ran my tests on an i7 and core 2 duo. The specific timings and findings on various architectures didn't interest me as much as the timings of each function in relation to each other. I stopped timing on the other machines because the trend was generally the same. I don't have an AMD to use for testing. I did try one of their suggestions in the "Software Optimization Guide for AMD64 Processors" ... which I know is a super old article, but there not many articles on optimizing memcpy(). Anyway, I tried it but I feel there are better ways to optimize in light of more recent architecture.

So let me remind the reader that this didn't start out as a `scientific` study and documenting my progress wasn't on the agenda, so I've actually lost many of my early notes. But I collected so much data, and spent so many hours investigating memcpy() that I feel it would be a shame if nobody else benefited from my findings and observations, and I had no idea it would take more than a week. I started by simply profiling memcpy() with a few simple functions I found online and wrote or modified myself. All the functions I wrote have exactly the same input and output as memcpy() from the standard library. I haven't profiled GCC's memcpy() implementation, because the original purpose of this was a Windows desktop/client application.

One important aspect I should point out about my method of profiling, is that the functions were all put into a function pointer array. This completely eliminates any potential "inlining" benefit these functions might have over each other, but it also puts them all on a fair playing field. All the functions in the test array, were tested over exactly the same range of data, with random source and destination addresses within the allocated buffers. Before each test for that function, srand(0) was called to reset the PRNG so that each function was using exactly the same random numbers for the src and dst addresses as all the other functions.

The tests and results

I've run every test you can imagine. Aligned/unaligned, small/large, with/without prefetching. The data I collected was actually so massive, the results file is more than 10,000 lines. I think that instead of explaining all the results, I will try to focus exclusively on the implementation I present compared to the 64-bit memcpy().

Aligned memory; aligned src, dst and size

So when we talk about "aligned" memory, we could be talking about the `alignment` of the source or destination addresses on a 16 byte boundary, or an `aligned` copy size, eg. 16, 32 or 48-byte copy. Alignment plays a significant role when copying memory with SSE2, because some instructions are designed for aligned memory only, and they have significant performance benefits. This usually applies to older technology, but I've noticed that some of the newer `low power` CPU's from Intel also have a 2 clock cycle penalty for unaligned memory. Bessides that, I have noticed that even on my i5, aligned memory has a very small performance boost. There are several reasons for this, like unaligned memory often requires the CPU to read 2x cache lines, but I won't go into too many technical details here. Many SSE2 implementations don't handle unaligned memory at all, but I feel that this really limits you, so my implementation will accept unaligned memory, and in the worst case scenario, will align the destination address, but leave the source unaligned. Since the source is the "read" address (dst is "write"), it has a worst case scenario of 4-clock-cycles per 16-byte read, but since we spread things out a bit and use prefetching, this penalty should be minimized on older CPU's anyway. No matter if you send aligned/unaligned memory, the code will align the destination address for more than 128 bytes

Bpc

So what is Bpc? Well, it's a funny story. When I started profiling, I would have to look at this huge 64-bit number (which was the results from calls to QueryPerformanceCounter()), and try to figure out which were the fastest functions. This became very annoying, so I eventually changed to a "bytes-per-counter" ratio. It's basically the total bytes copied, divided by the difference between start and end calls to QueryPerformanceCounter(). Think of it as "bytes-per-cycle" or "bytes-per-call" or "bytes-per-copy" or "bytes-per-counter" or whatever ... I'll just call it "Bpc" In general, the bytes copied are astronomical numbers like 2015000000000, that's trillions of bytes over several minutes. Very impractical numbers to work with, and the cycle counts are much worse, so Bpc just gives me a number like 3515.685421 Bpc, which was the peak throughput I was able to achieve!

Function Categories

So, in order to test various copy ranges, I wrote some specific functions, and some more general purpose functions. Some functions were written with one goal in mind, others to test a range of goals, or to see how the different methods could be linked together to form the final function. I believe a fast memcpy() implementation should be fast over every range of numbers, and in many cases the best way to do that is write code specific to that range. So the number ranges are 0-16 bytes, 17-128 and then greater than 128. Within each one of these 3 categeory of sizes, there are sub-ranges. For instance, 0-16 has 1-3, 4, 5-7, 8, 9-15 and 16 bytes. Each one of these has a "best method", but then the others will suffer. It's always a trade-off! For example, to copy 8 bytes, the fastest method would be to use a single 8-byte (64-bit) `long long`/__int64/int64_t copy, but then how do you handle the rest. More tests (if's) or a loop means slower copy. Do you do 16 bytes with 1x SSE2 copy, 2x 8-byte copies or 4x 4-byte copies. A 4-byte copy loop can handle 4, 8, 12 AND 16-bytes, but it's slower than 2x 8-byte copies. Anyways, there are so many different ways to copy data, it's mind-blowing. Every time you think you've developed the "silver-bullet", there's some case that just tanks!

loword

So this category of function were written to test small size copies, generally 16-bytes or less. Some used a for-loop to copy 1, 2 or 4-bytes at a time, and others use bitwise "&". In general, the for-loop overhead slows a function down, so bitwise operations are favourable. But you can't use bitwise for all 64-bits, we are just testing very small sizes and doing one or two copies at a time. There's actually so much to say about this category, but I'm really not sure how interesting it is for others, and in the grand scheme of things, I should have spent less time trying to optimize my 16 byte copies :p

dword

This is the common `naive` 32-bit copy. The 32-bit version of memcpy() in Visual Studio absolutely, definately uses this method to copy, because my implementation had identical performance. The main differences will probably be how they copy the last 3 bytes, and how they "address" the source/destination, do they increment both pointers, or use a common offset like I do? Doesn't matter really either way on modern processors, almost all methods result in the same speed due to the processors able to do more than one operation per cycle.

qword

I also wrote a few 8-byte (64-bit) (long long) copy functions. These functions performed within 10% of the 64-bit memcpy() function from Visual Studio. But this margin is not enough to convince me they use 8-byte copies internally. It's possible that they've unrolled the loop or there might be some assembler tricks they're using to get the extra 10% boost, I'm just not sure.

SSE2 copy

This was the main body of research. I have kept 22 functions. They just represent various attempts at improving different aspects of the copy. Many of them look similar, except for just a few lines, and those were the lines I was testing. I've actually lost a few of the functions, sometimes I wrote a function, and tested it, when it failed to prove its point, I removed it.

Findings

Intrinsic vs. non-intrinsic memcpy()

I profiled both the 32-bit and 64-bit, intrinsic and function call version of memcpy(). I used "#pragma intrinsic(memcpy)" or "#pragma function(memcpy)" statement for this, maybe I'm wrong but I can honestly say that there's little to no perceivable difference between the "intrinsic" and non-intrinsic versions. Maybe someone else can find a better way of forcing the compiler to use the 2 "theoretically" different versions. Under some conditions, I was able to force different results, by adjusting the compiler settings like "Minimize Size" and other settings like disabling inline expansion etc. Under some tests, against better judgement the non-intrinsic version was always faster. There were no tests I performed where the intrinsic version was faster than non-intrinsic (function call) version! So I would actually advocate disabling the memcpy() intrinsic with the above pragma, if you have/use other intrinsics! Nothing good came from using the memcpy() intrinsic on any machine I tested! I'd be happy to be proven wrong on this, tell me what to do, what compiler settings to enable/disable, and what code to test/run that demonstrates the differences clearly!

32-bit vs. 64-bit memcpy()

Over a 256MB range of data, copied millions of times, with random addresses, the 64-bit memcpy() was 12.5% faster than 32-bit version (2700 vs. 2400 Bpc). To put this in perspective, here are some actual results, over millions of runs and several hours of profiling:

32-bit build

memcpy() = 2407.409763 Bpc
memcpy8() = 2426.479289 Bpc **
dword_memcpy1() = 2199.207560 Bpc ***
dword_memcpy2() = 2400.391856 Bpc
dword_memcpy3() = 2387.596476 Bpc
dword_memcpy4() = 2406.398597 Bpc

64-bit build

memcpy() = 2703.055754 Bpc
memcpy8() = 2460.156299 Bpc
dword_memcpy1() = 2341.839341
dword_memcpy2() = 2340.425519
dword_memcpy3() = 2343.732592
dword_memcpy4() = 2342.167511
** memcpy8() is a "naive" simple memcpy() implementation I wrote, which just copies a "long long" (64-bit value) in a for-loop, and trailing bytes. *** Just some notes on the dword implementations so you understand them. dword_memcpy1() uses a while-loop, but the main problem is that it decrements a "bytes-copied" counter. The other implementations are just various memory addressing methods, I wanted to see how different memory calls affect performance. Even though there are some different numbers, I would say that the other functions have basically the same performance. But a difference of 200 Bpc from 2400 Bpc to 2200 Bpc is notable!

Now, that brings me to the most important observations of the methods above. ????????????????

... END OF UNFINISHED ORIGINAL ARTICLE ...