What architecture did you run on? Did you compile with good optimization settings? I just tried your code, with and without the sort (the C++ variant) and did not find any runtime difference. Having a look at the assembler output (gcc.godbolt.org is handy for that) I could also see that there is no branch done on the if, but a cmovge is being used. When using -O2 I see a difference in speed only, but not with -O3... – PlasmaHH Jun 27 '12 at 14:10
159
@GManNickG: I did investigate a bit further, and things are "funny". With O3, both versions (sort/non sort) are the same speed (4.5) but with O2, both are different (3.1/15.7) so I looked at the O2 version. There is a branch. So gcc seems to optimize for "random data" here. To further test if it is branch prediction, I tested the O2 code not with sort, but in the creation phase I set/removed the top bit of the byte for one half, but not the other. Things are the same result here, so it really has nothing to do with the data being sorted, but with the if condition being true/false for one half. – PlasmaHH Jun 27 '12 at 14:16
102
Just to add more fun, on my CPU, when alternating the bits in the input, the branch predictor seems to be able to recognize the pattern. The same for some other alternating bit patterns. – PlasmaHH Jun 27 '12 at 14:37
17
instead of doing a complete sorting then summing, in this particular case try doing a partial sorting (i.e. partitioning) with pivot of 128, then sum from the pivot to the end without any branch statements or unreadable bitwise twiddling. – Lie Ryan Jun 28 '12 at 2:57
24
I think what's most interesting is that the Java VM executes the same code faster than the native C++ version for in original case