You are not logged in.
Hi guys,
Problem: I have an array of arguments, I should apply some function (which calculation is quite time-consuming) and sum the results. Since all calculation are independent I thought it would be great try to parallel my code with pthreads.
I would expect to get the faster program by approximately factor of 1/NUM_THREADS (if NUM_THREADS <= num_of_cpus) compared to the serial varial.
However I get the same time of execution.
Since this is my first experience with threads I'm out of ideas why there is no advantage. Here is the sample code that mimics the problem.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define NUM_THREADS 2
#define NUM_ELEMENTS 36
#define FIB 37
long fib(long n);
long fib(long n)
{
if (n == 0) return 0;
if (n == 1) return 1;
return fib(n - 1) + fib(n - 2);
}
double work(long start, long num)
{
long i, ie = start + num, a;
double res = 0.;
for ( i = start; i < ie; ++i) {
a = fib(FIB);
res += (i + 1.) / a;
}
return res;
}
typedef struct
{
long tid;
long start, num;
double res;
} my_thread_data_t;
void *BusyWork(void *t)
{
int i;
my_thread_data_t * data = (my_thread_data_t *) t;
long tid = data->tid;
printf("Thread %ld starting...\n",tid);
data->res = work(data->start, data->num);
printf("Thread %ld res = %e\n", tid, data->res);
printf("Thread %ld done. \n",tid);
pthread_exit(NULL);
}
int main (int argc, char *argv[])
{
pthread_t thread[NUM_THREADS];
pthread_attr_t attr;
my_thread_data_t data[NUM_THREADS];
int rc;
long num = NUM_ELEMENTS / NUM_THREADS;
long t;
double res;
/* Initialize and set thread detached attribute */
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for(t=0; t<NUM_THREADS; t++) {
printf("Main: creating thread %ld\n", t);
(data + t)->tid = t;
(data + t)->start = t * num;
(data + t)->num = num;
rc = pthread_create(&thread[t], &attr, BusyWork, data + t);
if (rc) {
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
/* Free attribute and wait for the other threads */
pthread_attr_destroy(&attr);
for(t=0; t<NUM_THREADS; t++) {
rc = pthread_join(thread[t], NULL);
if (rc) {
printf("ERROR; return code from pthread_join() is %d\n", rc);
exit(-1);
}
printf("Main: completed join with thread %ld\n",t);
}
res = 0.;
for(t = 0; t < NUM_THREADS; t++) {
res += (data + t)->res;
}
printf("Main: res = %e\n", res);
printf("Main: program completed. Exiting.\n");
return 0;
}
Changing NUM_THREADS from 2 to 1. Makes no difference in time of calculation.
Last edited by ogronom (2010-10-17 02:51:45)
Offline
The reason in this case seems to be that your fib() function takes GIGANTICALLY more time for larger inputs than for smaller ones (in fact, it has complexity O(2^n) to find the nth Fibonacci number), so the second thread does way more work than the first thread. For this algorithm, the fair parallelization is to have the first thread calculate the first FIB-1 numbers, and the second thread only calculate the last number.
Look here for a more reasonable way to calculate the nth Fibonacci number. For the linear (O(n)) algorithm, the fair split is at FIB/sqrt(2) for two threads, but FIB/2 will at least show some speed improvement.
Also, the variable `i' is unused in BusyWork().
Edit: formula fix, clarity, typo.
Last edited by tavianator (2010-10-14 20:30:59)
Offline
@tavianator: Aren't both threads always calculating fib(FIB) == fib(37) for every input value?
I've replaced the fib function with a non-recursive one:
long fib(long n)
{
long prev = -1;
long sum, result = 1;
long i;
for (i = 0; i <= n; ++i) {
sum = result + prev;
prev = result;
result = sum;
}
return result;
}
Doing this results in almost no runtime at all. Is it possible that the recursion is the problem? Does your original code also use recursion?
Offline
@tavianator: Aren't both threads always calculating fib(FIB) == fib(37) for every input value?
Um no, I think the intent of the OP is that thread 1 calculates the sub of fib(i) from i=1 to FIB/NUM_THREADS, and thread 2 from i=FIB/NUM_THREADS + 1 to 2*FIB/NUM_THREADS, etc.
I've replaced the fib function with a non-recursive one ... Doing this results in almost no runtime at all. Is it possible that the recursion is the problem? Does your original code also use recursion?
Yeah, with a non-recursive fib() I needed FIB to be 3700000 before I could get any reasonably accurate timings.
Last edited by tavianator (2010-10-14 20:39:57)
Offline
Maybe I was not quite clear (or my English is that bad), but the code posted above is just an example to illustrate the problem. The original code too large to post here.
Fib was taken in the recursive form on purpose to simulate the complex time-consuming function. There is no more possible (or sufficient) optimization for the original function.
Both threads calculate Fib(37), then do some minor fast modification (in the example (i+1.)/Fib(37)). This is done to get approximately the same (quite large) time of calculation for any input parameter i.
Offline
Sorry, I misinterpreted the code, I confused FIB with NUM_ELEMENTS, and thought you were calling fib(i) rather than fib(FIB). Ignore everything I said above
In this case the lack of increased performance is caused by loop invariant code motion -- meaning that gcc moved the a = fib(FIB); calculation out of the loop because it doesn't depend on anything in the loop and it has no side effects, so fib() is only called once. Compiling without optimization shows giant speed improvements as NUM_THREADS is increased, because fib(FIB) is computed inside the loop (13.202s to 6.539).
Edit: I doubt that this is the problem with your real code, so to help diagnose the real problem, it'd help to have access to the real code or a more thorough code example. I'd be happy to look at it even if it's huge; I happen to like implementing concurrent algorithms.
Last edited by tavianator (2010-10-15 04:20:43)
Offline
Fib was taken in the recursive form on purpose to simulate the complex time-consuming function. There is no more possible (or sufficient) optimization for the original function.
And my question was, if the original code also uses recursion? So now I guess it does not...
Have you already tried a profiling tool like gprof?
Offline
Hi
@ber_t: no I don't have recursion. Yep I profiled my program with gprof. But the question is about threads.
@tavianator: I'll think about posting my code. But right now I grew interest in this simple one. You said that you receive speed up
Here is some modification of the current code. http://pastebin.org/245755
And the results what I have.
% gcc -O0 -DNUM_THREADS=1 pthreads.c -lpthread
% time ./a.out
Thread 0 starting...
Thread 0 res = 2.756872e-05
Thread 0 done.
./a.out 38.99s user 0.28s system 99% cpu 39.460 total
% gcc -O0 -DNUM_THREADS=2 pthreads.c -lpthread
% time ./a.out
Main: creating thread 0
Main: creating thread 1
Thread 1 starting...
Thread 0 starting...
Thread 1 res = 2.049026e-05
Thread 1 done.
Thread 0 res = 7.078454e-06
Thread 0 done.
Main: completed join with thread 0
Main: completed join with thread 1
Main: res = 2.756872e-05
Main: program completed. Exiting.
./a.out 35.42s user 0.21s system 187% cpu 18.956 total
I have
% cat /proc/cpuinfo | grep name
model name : Intel(R) Pentium(R) Dual CPU T2370 @ 1.73GHz
model name : Intel(R) Pentium(R) Dual CPU T2370 @ 1.73GHz
Could you post you results of time calculation?
Offline
Sure:
tavianator@superluminal> gcc -O0 -DNUM_THREADS=1 pthreads.c -pthread && time ./a.out >/dev/null
./a.out > /dev/null 12.68s user 0.00s system 99% cpu 12.702 total
tavianator@superluminal> gcc -O0 -DNUM_THREADS=2 pthreads.c -pthread && time ./a.out >/dev/null
./a.out > /dev/null 13.19s user 0.00s system 199% cpu 6.618 total
tavianator@superluminal> gcc -O0 -DNUM_THREADS=12 pthreads.c -pthread && time ./a.out >/dev/null
./a.out > /dev/null 13.87s user 0.00s system 935% cpu 1.483 total
tavianator@superluminal> gcc -O0 -DNUM_THREADS=18 pthreads.c -pthread && time ./a.out >/dev/null
./a.out > /dev/null 16.40s user 0.00s system 1594% cpu 1.029 total
tavianator@superluminal> gcc -O0 -DNUM_THREADS=36 pthreads.c -pthread && time ./a.out >/dev/null
./a.out > /dev/null 19.47s user 0.00s system 2076% cpu 0.938 total
I have:
tavianator@superluminal> grep name /proc/cpuinfo
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
Offline
Thanks a lot for the help.
I just looked at wrong place
Offline