You are not logged in.
Hello,
I have a question regarding performance of GMP. The gist of it is, I can't
quite make my own code (which calls GMP) perform the way I think it should,
and was really hoping someone could help. Not sure if this is the right place
for such a question, but somehow I thought I would get a quicker reply on the
friendly Arch Linux forum. Here are the details:
I built it from source with my own pkgbuild, and as far as I can tell, the
configuration is selecting all the right assembly routines for my processor
(an intel core 2). I built src/tune/speed and ran some tests. One important
test is multiplication. Here are the results:
$ ./speed -s 16-32 -t 8 -c mpn_mul_basecase
overhead 6.13 cycles, precision 10000 units of 7.69e-10 secs, CPU freq 1300.00 MHz
mpn_mul_basecase
16 1192.10
24 2657.00
32 4649.67
This should just test the naive multiplication routine (long multiplication).
Doing some math, we see that for 1024 bit integers, we are spending
1192.10 / 256 = 4.657 clock cycles for every multiplication instruction (we
have to do 16**2 = 256 for long multiplication). Considering that the
theoretical lower bound for throughput of the multiplication instruction on a
core 2 is 4 clock cycles, that's pretty good.
Now the problem: I wrote my own test code to try this out in practice, and it
does not perform so well. I created two random integers of the above sizes,
and in a loop of 100000 iterations, multiplied them together with GMP. I
recorded the TSC before and after, and then divided the difference by 100000.
For 16 limbs (1024 bits) I get 2100 clock cycles. For 32, I get 7000. This
is not even close. (Note: The reason it is a little better for 32 is that it
has switched over to some version of Karatsuba at this point, whereas in the
first test, I only measure the naive algorithm.) Does anyone know what on
earth I'm doing wrong? I have a hard time believing that little errors in
measurement aren't going to be drowned out in 100000 trials, which I've run
well over 50 times today. I also measured time with the clock, and then
computed the cycle count based on my clock frequency, and obtained the exact
same results.
Here are things I've tried:
* My cpu has a constant TSC, so I set the frequency governor to always keep
the clock at the maximum (1.3GHz).
* To eliminate loop overhead (which should have been *tiny*) I called the
function repeatedly inside a single iteration.
* I statically linked to the GMP library.
* I messed around with different compiler / linker options, and tried to make
sure mine were similar to that of src/tune/Makefile.
* Instead of calling the high-level routines, I went straight to the "mpn_"
routines (which act on vectors of limbs directly).
* I tried different random numbers (although the run time of multiplication is
probably not very data-dependent).
* I ran the code in the terminal after a fresh boot. No wm, no X, no network,
very few daemons; system was pretty much silent except for this test code.
None of this made more than a tiny bit of difference. Below is the code I
used. I am frustrated, out of ideas, and probably doing something very dumb.
: ( Many thanks for reading this.
Here is some sample output:
GMP info:
cflags: -march=native -O2 -pipe
compiler: gcc -std=gnu99
bits/limb 64
gmp ver. 5.0.4
average TSCs: 7120.15
And here are the compiler options I used:
g++ -march=native -O2 -Wall -std=c++0x -Wl,--sort-common,--as-needed,-z,relro,--hash-style=gnu -static main.cpp -o main -lrt -lgmpxx -lgmp
Last edited by wes (2012-04-03 04:44:48)
Offline
*Update*: I performed the same sequence of steps on my desktop computer at
work (which has an essentially identical software setup) and everything worked
the way I expected: my test code gave the very same results as what ./speed
produces. The processor there is also a core 2, FWIW.
So, whatever strangeness is at work is specific to my laptop. Perhaps it is
best to not waste your time trying to figure it out. I will post an update if
progress is made.
Offline
I have figured out the answer. It is disappointing, and is apparently a bug
in arch linux, or maybe just the linux kernel:
https://bugs.archlinux.org/task/28859
I have tons of interrupts from acpi. If I boot with acpi=off, my test program
gives me the exact same answers as the speed program. It is interesting that
the speed program has made itself immune to this, though.
Hopefully this bug is fixed soon.
Offline
It is interesting that the speed program has made itself immune to this, though.
In case anyone else is interested, I browsed the source a bit, and then spoke
with the GMP developers. They tell me that the trick is to make several
smaller measurements, rather than one big one, and then throw out all the
measurements that seem erroneous.
Offline