Measuring performances is a tricky business.
I was comparing the timing of two benchmarks, a function I created vs a similar function in a library that I didn't want to use. The other function was 10 times quicker, and after checking its code, I couldn't understand why. I've spent hours optimizing and moving bits in mine, but the results remained slower...
Now I've just realized the compiler has done an unexpected and quite amazing simplification in the other function because of how I tested it. They both return a vector at each iteration, and I tested only one element to make sure the compiler didn't optimize it away. Even though this element means the other ones have been calculated, the compiler managed to remove enough to make it much quicker (the creation of the vector itself).
Fortunately, I finally thought about that possibility and used a trick to tell the compiler the whole result was to be created normally.
At least, my code is well optimized, now...
PS: Seriously, it's tricky. There are books on the subject, with the measures to perform, how to properly measure time, how to analyze the results, how to remove the noise, the different types of load to apply, and so on.