You are not logged in.
hello,
i have strange behaviour with c functions ...:
on amd athlon x2 4800+ overclocked to 2,7GHZ:
in archlinux
------------------
cmp_i686
--------------
C strcpy: 3764.7 MB/second (2560.0 MB in 680000 clocks)
our strcpy1: 2976.7 MB/second (2560.0 MB in 860000 clocks)
our strcpy2: 3084.3 MB/second (2560.0 MB in 830000 clocks)
C memcpy: 3084.3 MB/second (2560.0 MB in 830000 clocks)
our memcpy: 1969.2 MB/second (2560.0 MB in 1300000 clocks)
cmp_athlon
-----------------
C strcpy: 3764.7 MB/second (2560.0 MB in 680000 clocks)
our strcpy1: 4000.0 MB/second (2560.0 MB in 640000 clocks)
our strcpy2: 3122.0 MB/second (2560.0 MB in 820000 clocks)
C memcpy: 3122.0 MB/second (2560.0 MB in 820000 clocks)
our memcpy: 1984.5 MB/second (2560.0 MB in 1290000 clocks)
in windows:
-----------------------
C:\DundE\anh\Eigene Dateien\downloads>cmp_athlon.exe
C strcpy: 4821.1 MB/second (2560.0 MB in 531 clocks)
our strcpy1: 4196.7 MB/second (2560.0 MB in 610 clocks)
our strcpy2: 3148.8 MB/second (2560.0 MB in 813 clocks)
C memcpy: 2873.2 MB/second (2560.0 MB in 891 clocks)
our memcpy: 1973.8 MB/second (2560.0 MB in 1297 clocks)
C:\DundE\anh\Eigene Dateien\downloads>cmp_i686.exe
C strcpy: 4812.0 MB/second (2560.0 MB in 532 clocks)
our strcpy1: 2976.7 MB/second (2560.0 MB in 860 clocks)
our strcpy2: 3091.8 MB/second (2560.0 MB in 828 clocks)
C memcpy: 2825.6 MB/second (2560.0 MB in 906 clocks)
our memcpy: 1949.7 MB/second (2560.0 MB in 1313 clocks)
okey that is not so bad .... but on my i5 430M
linux:
--------------
[dd@lappy Downloads]$ ./cmp_686
C strcpy: 1113.0 MB/second (2560.0 MB in 2300000 clocks)
our strcpy1: 1497.1 MB/second (2560.0 MB in 1710000 clocks)
our strcpy2: 1523.8 MB/second (2560.0 MB in 1680000 clocks)
C memcpy: 1630.6 MB/second (2560.0 MB in 1570000 clocks)
our memcpy: 1207.5 MB/second (2560.0 MB in 2120000 clocks)
[dd@lappy Downloads]$ ./cmp_core2
C strcpy: 1075.6 MB/second (2560.0 MB in 2380000 clocks)
our strcpy1: 1741.5 MB/second (2560.0 MB in 1470000 clocks)
our strcpy2: 1706.7 MB/second (2560.0 MB in 1500000 clocks)
C memcpy: 1600.0 MB/second (2560.0 MB in 1600000 clocks)
our memcpy: 1213.3 MB/second (2560.0 MB in 2110000 clocks)
windows:
------------------
C:\Users\Lappi\Desktop\xcp>cmp_686.exe
C strcpy: 3731.8 MB/second (2560.0 MB in 686 clocks)
our strcpy1: 3417.9 MB/second (2560.0 MB in 749 clocks)
our strcpy2: 3417.9 MB/second (2560.0 MB in 749 clocks)
C memcpy: 2562.6 MB/second (2560.0 MB in 999 clocks)
our memcpy: 1823.4 MB/second (2560.0 MB in 1404 clocks)
C:\Users\Lappi\Desktop\xcp>cmp_core2.exe
C strcpy: 4238.4 MB/second (2560.0 MB in 604 clocks)
our strcpy1: 3699.4 MB/second (2560.0 MB in 692 clocks)
our strcpy2: 3459.5 MB/second (2560.0 MB in 740 clocks)
C memcpy: 2552.3 MB/second (2560.0 MB in 1003 clocks)
our memcpy: 2051.3 MB/second (2560.0 MB in 1248 clocks)
what is here goind on? i alwas compiled with gcc 4.5.1 -O2 -s
I thought that linux is fast, but with this benchmark ... i don't know, this are the most used functions in all applications -.-"
p.s.: sorry for my bad English
here the testing c code:
/****
*
* modified version from Preston L. Bannister
*
**/
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <string.h>
/*******************************************************************************
*******************************************************************************
* configs
*******************************************************************************
******************************************************************************/
#define LOOPS 10000000
static const char sOut1[] = "QBTnetfnh8TpTWvPzARBNWr2gMFofe3AzwMXVOGbdL2xOOACwMefrMxpxZ62qakW";
static const char sOut2[] = "ct6V7lZ42RoryDlvM1EzT54T5qV3DGUA4UIIhVv0TSK0lTx0TKIFc4E4YIdfjfKp";
/*******************************************************************************
*******************************************************************************
* engine
*******************************************************************************
******************************************************************************/
unsigned int nLength = ::strlen(sOut1);
unsigned int dtLoop = 0;
unsigned int nTotal = 0;
char sWork[256];
typedef void (*doit)(const char*, const char*);
void report_times(const char* s, unsigned int dt)
{
double ts = (double)dt / CLOCKS_PER_SEC;
double mb = (double)(nTotal) / 1000000;
double rate = mb / ts;
printf("%s:\t\t %0.1f MB/second (%0.1f MB in %u clocks)\n", s, rate, mb, dt);
}
int time_function(doit fn)
{
clock_t t0 = ::clock();
for (int i=0; i<LOOPS; ++i) {
const char* s1 = sOut1 + (15 & i);
const char* s2 = sOut2 + nLength - (15 & i);
(*fn)(s1,s2);
}
return (int)(::clock() - t0) - dtLoop;
}
void do_total(const char* s1,const char* s2)
{
nTotal += 4 * nLength;
}
/*******************************************************************************
* end engine
******************************************************************************/
/*******************************************************************************
*******************************************************************************
* benchmark
*******************************************************************************
******************************************************************************/
void do_c_strcpy(const char* s1,const char* s2)
{
::strcpy(sWork,s1);
::strcpy(sWork,s2);
}
void our_strcpy1(char* s1,const char* s2)
{
while (*s1++ = *s2++);
}
void our_strcpy2(char* s1,const char* s2)
{
register unsigned int i;
for (i = 0; s2[i] != 0; ++i)
s1[i] = s2[i];
s1[i] = 0;
}
void do_our_strcpy1(const char* s1,const char* s2)
{
our_strcpy1(sWork,s1);
our_strcpy1(sWork,s2);
}
void do_our_strcpy2(const char* s1,const char* s2)
{
our_strcpy2(sWork,s1);
our_strcpy2(sWork,s2);
}
void do_c_memcpy(const char* s1, const char* s2)
{
int l1 = strlen(s1);
int l2 = strlen(s2);
::memcpy(sWork, s1, l1);
::memcpy(sWork, s2, l2);
}
void our_memcpy(char* dest, const char* src, int size)
{
for(int i=0; i<size; ++i)
dest[i] = src[i];
}
void do_our_memcpy(const char* s1, const char* s2)
{
int l1 = strlen(s1);
int l2 = strlen(s2);
our_memcpy(sWork, s1, l1);
our_memcpy(sWork, s2, l2);
}
/*******************************************************************************
* end benchmark
******************************************************************************/
/*******************************************************************************
*******************************************************************************
* main programm
*******************************************************************************
******************************************************************************/
int main(int ac,char** av)
{
dtLoop = time_function(do_total);
report_times("C strcpy", time_function(do_c_strcpy));
report_times("our strcpy1", time_function(do_our_strcpy1));
report_times("our strcpy2", time_function(do_our_strcpy2));
report_times("C memcpy", time_function(do_c_memcpy));
report_times("our memcpy", time_function(do_our_memcpy));
return 0;
}
Offline
I'm kind of confused. It looks like the linux benchmarks are the same or better than the windows benchmarks, especially with the i5. So what's the problem?
If they're still "kind of high," then you have to consider the numbers in refrence to something else. Perhaps bench linux against one of the bsd's or plan9 or something and see how it stands up.
Last edited by alexandrite (2010-08-29 18:19:15)
Offline
no windows is faster on amd about 1GB/s and on i5 abour 3GB/s ...
Last edited by anhadikal (2010-08-29 18:29:12)
Offline
I'm not sure why you think timing the speed of a loop moving memory around is an accurate representation of an operating system's speed.
Offline
I'm not sure why you think timing the speed of a loop moving memory around is an accurate representation of an operating system's speed.
string manipulation and memory copy ist very often used as I say, so if they are slow, other functionality based on them will also be slowed down
greets
Offline
Interesting comparison. Could it be that standard library is compiled without optimization?
Offline
Check the generated code.
Offline
what compiler are you using on windows? The Intel one by chance?
Offline
what compiler are you using on windows? The Intel one by chance?
on windows also gcc 4.5.1 (MinGW) ...
Offline
Is that even legal C code? What does the colons in e.g.:
::strcpy(sWork,s1); ::strcpy(sWork,s2);
mean?
No, it's C++. The leading double-colon is default namespace resolution, like in "std::cout".
Offline
Is that even legal C code? What does the colons in e.g.:
::strcpy(sWork,s1); ::strcpy(sWork,s2);
mean?
The code the OP provided looks like a bastardized love-child of C and C++.
He's using C libraries with C++ syntax.
I couldn't get it to compile with gcc, I had to use g++.
Offline
cmtptr wrote:I'm not sure why you think timing the speed of a loop moving memory around is an accurate representation of an operating system's speed.
string manipulation and memory copy ist very often used as I say, so if they are slow, other functionality based on them will also be slowed down
greets
You're missing the point. Just because your little "mov reg, reg" loop looks like it runs slower, there is a lot more happening than you're acknowledging. If anything, the most you've proven is that maybe your program on Windows gets more timeslices than it does in Linux? Or any number of other things...
Offline
Could this be related to the kernel scheduling? Another benchmark using something like BFS would be interesting.
Offline
Here is result on my i7 laptop.
mingw64 4.5.1
$ ./a.exe
C strcpy: 7071.8 MB/second (2560.0 MB in 362 clocks)
our strcpy1: 7420.3 MB/second (2560.0 MB in 345 clocks)
our strcpy2: 7130.9 MB/second (2560.0 MB in 359 clocks)
C memcpy: 3944.5 MB/second (2560.0 MB in 649 clocks)
our memcpy: 4406.2 MB/second (2560.0 MB in 581 clocks)
vmware arch64
% ./a.out
C strcpy: 11130.4 MB/second (2560.0 MB in 230000 clocks)
our strcpy1: 8000.0 MB/second (2560.0 MB in 320000 clocks)
our strcpy2: 7314.3 MB/second (2560.0 MB in 350000 clocks)
C memcpy: 7757.6 MB/second (2560.0 MB in 330000 clocks)
our memcpy: 7111.1 MB/second (2560.0 MB in 360000 clocks)
Both compiled with -march=native -mtune=generic -ftree-vectorize -funroll-all-loops -O2.
Last edited by yejun (2010-08-31 03:48:21)
Offline
Lol... Athlon II 240 (2.8 ghz) with BFS (-pf7 kernel).
~ % ./a.out
C strcpy: 3878.8 MB/second (2560.0 MB in 660000 clocks)
our strcpy1: 543.5 MB/second (2560.0 MB in 4710000 clocks)
our strcpy2: 1326.4 MB/second (2560.0 MB in 1930000 clocks)
C memcpy: 3200.0 MB/second (2560.0 MB in 800000 clocks)
our memcpy: 717.1 MB/second (2560.0 MB in 3570000 clocks)
./a.out 12,16s user 0,00s system 99% cpu 12,215 total
But then I have pretty slow Ram...
฿ 18PRsqbZCrwPUrVnJe1BZvza7bwSDbpxZz
Offline
anhadikal,
Regardless of anything that can be said about the validity of your tests as a means to gauge performance, you definitely cannot do it by writing code the way you do. Write proper standard code that can assure everyone there won't be any compiler unknowns that can affect results between compilers/systems in any significant way. Especially on what comes to optimization.
Then post results and we can debate strcpy, intrinsic functions, and whatnot.
Last edited by marfig (2010-08-31 15:41:59)
I probably made this post longer than it should only because I lack the time to make it shorter.
- Paraphrased from Blaise Pascal
Offline
I'm kind of confused. It looks like the linux benchmarks are the same or better than the windows benchmarks, especially with the i5. So what's the problem?
If they're still "kind of high," then you have to consider the numbers in refrence to something else. Perhaps bench linux against one of the bsd's or plan9 or something and see how it stands up.
The Linux benchmarks are worse. Much worse, in fact. Higher means better in this chart.
Both compiled with -march=native -mtune=generic -ftree-vectorize -funroll-all-loops -O2.
I don't know what these flags do, but WOW, the difference they make. Intel i3:
No extra flags ("gcc test.cpp"):
C strcpy: 9142.9 MB/second (2560.0 MB in 280000 clocks)
our strcpy1: 820.5 MB/second (2560.0 MB in 3120000 clocks)
our strcpy2: 1142.9 MB/second (2560.0 MB in 2240000 clocks)
C memcpy: 7314.3 MB/second (2560.0 MB in 350000 clocks)
our memcpy: 633.7 MB/second (2560.0 MB in 4040000 clocks)
With your flags:
C strcpy: 11130.4 MB/second (2560.0 MB in 230000 clocks)
our strcpy1: 8533.3 MB/second (2560.0 MB in 300000 clocks)
our strcpy2: 7757.6 MB/second (2560.0 MB in 330000 clocks)
C memcpy: 8827.6 MB/second (2560.0 MB in 290000 clocks)
our memcpy: 8000.0 MB/second (2560.0 MB in 320000 clocks)
<keanu>Whoa</keanu>
EDIT: Also, for some reason beyond my understandings, you i7 and all results on my i3 have the test output "11130.4 MB/second" for "C strcpy".
Last edited by spupy (2010-08-31 22:28:18)
There are two types of people in this world - those who can count to 10 by using their fingers, and those who can count to 1023.
Offline
I think you just test compiler optimization..
% ./a.out
C strcpy: 3710.1 MB/second (2560.0 MB in 690000 clocks)
our strcpy1: 4491.2 MB/second (2560.0 MB in 570000 clocks)
our strcpy2: 3555.6 MB/second (2560.0 MB in 720000 clocks)
C memcpy: 3413.3 MB/second (2560.0 MB in 750000 clocks)
our memcpy: 3240.5 MB/second (2560.0 MB in 790000 clocks)
./a.out 3,70s user 0,00s system 99% cpu 3,730 total
฿ 18PRsqbZCrwPUrVnJe1BZvza7bwSDbpxZz
Offline
This all looks like an OS hardware feature optimization like, presumably, windows using dma accesses and linux going through cpu cycles for memory to memory copy. Thus these tests don't measure anything comparable.
If anyone gets the time to get down to the basics, some other tests on memory copy could reveal this. It all bogs down to the fact how memcpy-like procedures are implemented at OS level.
Last edited by bernarcher (2010-09-01 07:58:15)
To know or not to know ...
... the questions remain forever.
Offline