Thanks for this article. It has been needed for about a year. Every previous benchmark of AMD 64 seemed to be 32-bit mode which is rather missing the point.
Firefox 1.0PR on LINUX did not show the 64-bit results until I went to edit:preferences:web features:enable java advanced... and turned on lots of crap (I don't know which item made the difference).
The information was fascinating but the presentation was very awkward.
When you see a surprising benchmark result, it is a good idea to analyze why you were surprised. For example, I would guess that the poor showing for 64-bit code on John the Ripper might be due to hand-coded x86 assembly code. Note: just a guess.
The fact that Wine is only 32-bit seems pretty uninteresting/unsurprising: Win32 binaries are also only 32-bit.
Few things in the LINUX world are binary-only, so almost anything for which CPU performance matters can and should be run in 64-bit mode on a 64-bit processor.
On the LAME encoding benchmark, isn't the actual value really "Play time divided by encoding time"? Or perhaps "Relative encoding rate"? Anyway, the text explains the graph better (in 1 second the 64-bit FX-53 encoded 25 seconds of audio). Otherwise, good stuff.
Crafty does have a bit of hand tuned asm for both x86 and x86_64. Most of the operations are done with boards packed into bit representations. For example, like this:
while (moves) {
to=FirstOne(moves);
*move++=temp|(to<<6)|(PcOnSq(to)<<15);
Clear(to,moves);
}
The FirstOne() function utilizes the bitscan ops of x86 (bsr = bit scan reverse), but notice the cmpl at the top:
what's up with the encryption benchmarks? "OpenSSL's crypt libraries are probably heavily optimized for 32-bit operation; we see the difference in the two architectures very clearly."
But the results show that 64bit mode is more than two times as fast as 32bit mode in one case (RSA), and 50% faster in the other case (AES)?
(and btw I haven't looked at johntheripper, but it might contain hand-optimized assembly for x86, but only generic c code for other architectures such as x86_64.)
Were any of the 32-bit binaries (incl kernel) conducted with -mregparm=x where x!=0? See e.g. http://lwn.net/Articles/66965/ - improvements in the use of registers are generally the main source of performance improvements for x86-64, and using this parameter can significantly improve gcc's register usage on regular x86. Generally, mregparm=3 is recommended for the kernel and =1 for C++ code.
Hello,
Liked the article! I was disappointed to see you stuck with the only chess engine on the planet that is faster on a 3.6GHz P4 than a 2.4GHz A64. The Crafty benches looked odd, but they were more realistic. Even with HT optimized engines like Frtiz8 (which has competed internationally for as much as $1 million on Xeon machines, including one 4-way Xeon "donation" from Intel) pull almost identical numbers between the top a64 and the top p4.
If, as I assume, you left HT off [which you should for benchmarks. there are some odd issues with HT and chess], there just isn't a chess program around (except apparently TSCP) that pulls these numbers.
I know there is a risk of sounding fanboyish. That is not my intent. I play in the computer engine room on playchess.com, and I know the numbers I get from other machines. The benchmark you are using is simply not representative of chess engines. Please take a look at Frtiz benchmarks at: www.beepworld.de/members39/computerschach2/chessmarks.htm [disregard the top dual xeon score; "Deep Fritz 8" calculates many more nodes/s than regular "Fritz8", even on a single processor]. Again, this is an engine that is optimized for the Pentium architecture.
Less dedicated engines like Crafty show the results that, unfortunately, you found questionable in the previous article. Bob Hyatt has been programming chess for decades and Crafty is available on every major desktop OS. It's part of the SPEC2000 benchmark [where it performs identically on a lowly XP3200 and a Xeon 3.4]. It is also the first engine out the door with a 64-bit clean code! In one of the few fields where 64-bit computing can offer a near perfect doubling of calculations/s, why leave out the 64-bit bench? If you're concerend Crafty is Athlon optimized, check out Hyatt's homepage: www.cis.uab.edu/info/faculty/hyatt/hyatt.html ...his ICC account pet machine is a dual Xeon.
I think you've got a graph error on the 32-bit MEncoder graph. You show the P4 530 and the A64 3500+ tied at 146fps, but then show the A64 3800+ at 193fps; that's a 32% higher score for a CPU that is only 9% higher-clocked and otherwise identical. Methinks the 146fps for the A64 3500+ is an error; it should be somewhere between 165 & 175, right around the P4EE.
"whenever I get cornered by a processor on campus or guest speak at a Linux Users Group"
OH NO!!!! ROGUE PROCESSORS ARE ATTACKING PEOPLE ON COLLEGE CAMPUSES!!!! LOCK YOUR DOORS!! GRAB YOUR SHOTGUN!!
heheheh
Kris.. I think you meant professors ;)
I'd still lock your doors and grab weapons of minimal destruction. Professors are scary. Especially the fat ones with suspenders who talk about overclocking their PDP-11s.
Nice review, and you actually compared 32- and 64-bit for once ;). Would've been more interesting to do it back when you had some 64-bit Intel processors in the mix as well, though...
Why no 64-bit results on the kernel compile? :/ That's probably the single benchmark out of all of them I'd be most interested in (Gentoo :D).
Also, UT2004 has both 32- and 64-bit Linux versions, and nVidia has both 32- and 64-bit Linux drivers. Seeing as this was a desktop review, that would've been nice to see.
I'd personally have been more interested in s754 processors, but they're the same architecture anyways so I can mostly extrapolate their performance from the ones tested, so it isn't a big deal either way.
It's happened again: the cheapest AMD64 939 processor is the better choice of the processors reviewed here for price/performance. Granted the Linux market isn't a huge chunk of the market for either company, but among those who do use Linux there is no reason to use anything but AMD.
It would be interesting to see these tests run on several S754 boxes as well. I know this one was entitled "cutting edge performance" but with the cost difference between the S939 solutions and the S754 solutions, many will opt (as I have) to go with the S754 parts. I run SuSE 9.1 Professional AMD64 on an Athlon 64 3000+ and have been pleased with it.
"Our processor test bed is completely caseless, and if we have issuse with our 3.6GHz processor out of a normal case, we can't imagine what issues might exist in a full enclosure. "
Todays processors ("today" meaning back to the days of the original Pentium) typically run hotter with the case open than with it closed. With the case closed, cooler air is forced through the case and across the right places to help lower component temperatures, assuming that the fans are placed appropriately, are venting in the right direction, and cables/components are arranged to allow the air to flow properly.
Newer form factors, such as Intel's BTX, are designed in a large part to cope with cooling the thing down.
You should be running your test bed in a case with appropriate cooling.
Make and gcc themselves are not multi-threaded, but make does understand a -j option to specify the # of concurrent make jobs to run. The general rule-of-thumb I have heard, is to specify a -j equal to n+1, where n is the number of processors in the system(One compile job per processor, and one control job.) So, to test if hyperthreading lowers the compile time, specifiy `time make -j 3`.
I would personally like to see this, in addition to the single-threaded results. I read some quick-and-dirty benchmark results suggesting hyperthreading does help the compile time, but those results were published by curious people when hyperthreading was new.. I haven't seen any results on recent processors.
Isn't it normal procedure for a P4 that's getting too hot that it throttles the clock speed down? Maybe that would explain the extremely bad results, though it seems a bit unrealistic that the results has been reproducible, unless each iteration of the benchmark has been run for a long time.
First of all : good article. It really shows the early benefits of well written 64-bit software. The 3500+ is definately on my wishlist ;-)
"This does not bode well for the processor. Our processor test bed is completely caseless, and if we have issues with our 3.6GHz processor out of a normal case, we can't imagine what issues might exist in a full enclosure."
Rather confusing this bit, Kristopher. Anyway, I read that a good case will offer convection which apparently a caseless testbed does not. How were the tempratures of the other CPU's? Were they also a tad above average or typical peak?
can we please have Graphs where the order of the legend is the same as the order of the bars in the bar-graph. Surely that's possible.
also, the strange "Thermal issue" error. It seems that you thought it was weird but immediately assumed that it was correct; that the Intel CPU(s) were getting too hot.
Did you verify this somehow?? It seems strange to call it "An unusual problem" and then trust that it's correct without question or explanation.
"This does not bode well for the processor. Our processor test bed is completely caseless, and if we have issuse with our 3.6GHz processor out of a normal case, we can't imagine what issues might exist in a full enclosure."
That quote made me laugh, and I'm not entirely sure why. :D
Anyway, I see that going to 64-bit is definitely worth the price of admission, considering the huge gains the processors get in the jump. One thing I had a question on, though: Why does the result from the SSL benchmark halve between 32-bit and 64-bit? Is it that the keys are longer in 64-bit?
I too would prefer longer images that include both 32 and 64-bit results. Mouseover comparisons are cumbersome.
Is sample.wav 800mb or 700mb? I'm guessing the 7 was probably just a typo.
Nice analysis of DDR1 vs 2.
My only gripe is I wish a full complement of "lower end" processors were included in all these benchmarks (754s, slower prescotts, and heck even northwoods)... but I guess that'd be too much work.
Good article, but I have a comment on the mouse-over graphs. They work well in other articles such as the recent DVR-108D article where the scale and axes remain constant. In this case however the layout and in some cases even the scale are different between the two graphs. It would be easier to compare the two if the scale was the same and processors were in the same layout(spacing/location), with the inapplicable processors still listed to maintain the same appearance between the two.
If that explanation is nonsensical I can create a few images to try to elucidate my point.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
33 Comments
Back to Article
- Saturday, October 24, 2009 - link
sell:nike shoes$32,ed hardy(items),jean$30,handbag$35,polo shirt$13,shox$34Hugh R - Thursday, September 23, 2004 - link
Thanks for this article. It has been needed for about a year. Every previous benchmark of AMD 64 seemed to be 32-bit mode which is rather missing the point.Firefox 1.0PR on LINUX did not show the 64-bit results until I went to edit:preferences:web features:enable java advanced... and turned on lots of crap (I don't know which item made the difference).
The information was fascinating but the presentation was very awkward.
When you see a surprising benchmark result, it is a good idea to analyze why you were surprised. For example, I would guess that the poor showing for 64-bit code on John the Ripper might be due to hand-coded x86 assembly code. Note: just a guess.
The fact that Wine is only 32-bit seems pretty uninteresting/unsurprising: Win32 binaries are also only 32-bit.
Few things in the LINUX world are binary-only, so almost anything for which CPU performance matters can and should be run in 64-bit mode on a 64-bit processor.
bobbozzo - Tuesday, September 21, 2004 - link
You should be running all the compilation test with -j2 or higher, as otherwise the CPU is waiting for the disk more often.uyu - Tuesday, September 21, 2004 - link
Consider re-evaluating the test with the icc compiler:http://www.intel.com/software/products/compilers/c...
I do not think it will only favor the result of intel processors..
Zebo - Tuesday, September 21, 2004 - link
Why separate the graphs? Afriad of people easily visualizing major A64 ownage? Gawd that's hard to compare that way... I had to get out pen and paper.Shalmanese - Tuesday, September 21, 2004 - link
"throw an alternative opterating system"I like the attempt at subliminal advertising :D.
TrogdorJW - Monday, September 20, 2004 - link
On the LAME encoding benchmark, isn't the actual value really "Play time divided by encoding time"? Or perhaps "Relative encoding rate"? Anyway, the text explains the graph better (in 1 second the 64-bit FX-53 encoded 25 seconds of audio). Otherwise, good stuff.injinj - Monday, September 20, 2004 - link
Crafty does have a bit of hand tuned asm for both x86 and x86_64. Most of the operations are done with boards packed into bit representations. For example, like this:while (moves) {
to=FirstOne(moves);
*move++=temp|(to<<6)|(PcOnSq(to)<<15);
Clear(to,moves);
}
The FirstOne() function utilizes the bitscan ops of x86 (bsr = bit scan reverse), but notice the cmpl at the top:
cmpl $1, 8(%esp)
sbbl %eax, %eax
movl 8(%esp,%eax,4), %edx
bsr %edx, %ecx
jz l4
andl $32, %eax
subl $31, %ecx
subl %ecx, %eax
ret
l4: movl $64, %eax
The cmpl splits a 64 bit word into a 32 bit hi and lo words, so crafty will naturally exploit 64 bit instructions.
This same function on x86_64 can be done much fewer instructions:
asm (
" bsrq %0, %1" "\n\t"
" jnz 1f" "\n\t"
" movq $-1, %1" "\n\t"
"1: movq $63, %0" "\n\t"
" subq %1, %0" "\n\t"
: "=r&" (dummy), "=r&" (dummy2)
: "0" ((long) (word))
: "cc");
These are critical functions in crafty and if you see benchmarks comparing 64 bit crafty to 32 bit crafty, this is primarily why 64 bits is faster.
mczak - Monday, September 20, 2004 - link
what's up with the encryption benchmarks? "OpenSSL's crypt libraries are probably heavily optimized for 32-bit operation; we see the difference in the two architectures very clearly."But the results show that 64bit mode is more than two times as fast as 32bit mode in one case (RSA), and 50% faster in the other case (AES)?
(and btw I haven't looked at johntheripper, but it might contain hand-optimized assembly for x86, but only generic c code for other architectures such as x86_64.)
PrinceGaz - Monday, September 20, 2004 - link
The mouseover images work fine for me (Firefox 0.9.3)Cheval - Monday, September 20, 2004 - link
Using Firefox 1.0PR and those graphs don't work either.jensend - Monday, September 20, 2004 - link
Were any of the 32-bit binaries (incl kernel) conducted with -mregparm=x where x!=0? See e.g. http://lwn.net/Articles/66965/ - improvements in the use of registers are generally the main source of performance improvements for x86-64, and using this parameter can significantly improve gcc's register usage on regular x86. Generally, mregparm=3 is recommended for the kernel and =1 for C++ code.RyanHirst - Monday, September 20, 2004 - link
o, i c.k.
ryan
LittleKing - Monday, September 20, 2004 - link
The article is good, but the Rollover images don't work in FireFox 9.2.KristopherKubicki - Monday, September 20, 2004 - link
I had trouble compiling crafty. The numbers were more to show the impact of compiler options rather than actual chess numbers themselves.Kristopher
RyanHirst - Monday, September 20, 2004 - link
Hello,Liked the article! I was disappointed to see you stuck with the only chess engine on the planet that is faster on a 3.6GHz P4 than a 2.4GHz A64. The Crafty benches looked odd, but they were more realistic. Even with HT optimized engines like Frtiz8 (which has competed internationally for as much as $1 million on Xeon machines, including one 4-way Xeon "donation" from Intel) pull almost identical numbers between the top a64 and the top p4.
If, as I assume, you left HT off [which you should for benchmarks. there are some odd issues with HT and chess], there just isn't a chess program around (except apparently TSCP) that pulls these numbers.
I know there is a risk of sounding fanboyish. That is not my intent. I play in the computer engine room on playchess.com, and I know the numbers I get from other machines. The benchmark you are using is simply not representative of chess engines. Please take a look at Frtiz benchmarks at: www.beepworld.de/members39/computerschach2/chessmarks.htm [disregard the top dual xeon score; "Deep Fritz 8" calculates many more nodes/s than regular "Fritz8", even on a single processor]. Again, this is an engine that is optimized for the Pentium architecture.
Less dedicated engines like Crafty show the results that, unfortunately, you found questionable in the previous article. Bob Hyatt has been programming chess for decades and Crafty is available on every major desktop OS. It's part of the SPEC2000 benchmark [where it performs identically on a lowly XP3200 and a Xeon 3.4]. It is also the first engine out the door with a 64-bit clean code! In one of the few fields where 64-bit computing can offer a near perfect doubling of calculations/s, why leave out the 64-bit bench? If you're concerend Crafty is Athlon optimized, check out Hyatt's homepage: www.cis.uab.edu/info/faculty/hyatt/hyatt.html ...his ICC account pet machine is a dual Xeon.
Cheers,
Ryan
KristopherKubicki - Monday, September 20, 2004 - link
johnsonx: sorry about that- i put in the 530 score for the 3500+. The correct score is 175.Kristopher
johnsonx - Monday, September 20, 2004 - link
Good article Kris.I think you've got a graph error on the 32-bit MEncoder graph. You show the P4 530 and the A64 3500+ tied at 146fps, but then show the A64 3800+ at 193fps; that's a 32% higher score for a CPU that is only 9% higher-clocked and otherwise identical. Methinks the 146fps for the A64 3500+ is an error; it should be somewhere between 165 & 175, right around the P4EE.
WooDaddy - Monday, September 20, 2004 - link
"whenever I get cornered by a processor on campus or guest speak at a Linux Users Group"OH NO!!!! ROGUE PROCESSORS ARE ATTACKING PEOPLE ON COLLEGE CAMPUSES!!!! LOCK YOUR DOORS!! GRAB YOUR SHOTGUN!!
heheheh
Kris.. I think you meant professors ;)
I'd still lock your doors and grab weapons of minimal destruction. Professors are scary. Especially the fat ones with suspenders who talk about overclocking their PDP-11s.
Illissius - Monday, September 20, 2004 - link
Nice review, and you actually compared 32- and 64-bit for once ;). Would've been more interesting to do it back when you had some 64-bit Intel processors in the mix as well, though...Why no 64-bit results on the kernel compile? :/ That's probably the single benchmark out of all of them I'd be most interested in (Gentoo :D).
Also, UT2004 has both 32- and 64-bit Linux versions, and nVidia has both 32- and 64-bit Linux drivers. Seeing as this was a desktop review, that would've been nice to see.
I'd personally have been more interested in s754 processors, but they're the same architecture anyways so I can mostly extrapolate their performance from the ones tested, so it isn't a big deal either way.
Aquila76 - Monday, September 20, 2004 - link
It's happened again: the cheapest AMD64 939 processor is the better choice of the processors reviewed here for price/performance. Granted the Linux market isn't a huge chunk of the market for either company, but among those who do use Linux there is no reason to use anything but AMD.fitten - Monday, September 20, 2004 - link
It would be interesting to see these tests run on several S754 boxes as well. I know this one was entitled "cutting edge performance" but with the cost difference between the S939 solutions and the S754 solutions, many will opt (as I have) to go with the S754 parts. I run SuSE 9.1 Professional AMD64 on an Athlon 64 3000+ and have been pleased with it.Aquila76 - Monday, September 20, 2004 - link
This may be a little off-topic, but shows yet another reason California hates MS. This is what happens when you move to Windows...http://software.silicon.com/applications/0,3902465...
Love the caption: Someone forgot to reboot
fitten - Monday, September 20, 2004 - link
"Our processor test bed is completely caseless, and if we have issuse with our 3.6GHz processor out of a normal case, we can't imagine what issues might exist in a full enclosure. "Todays processors ("today" meaning back to the days of the original Pentium) typically run hotter with the case open than with it closed. With the case closed, cooler air is forced through the case and across the right places to help lower component temperatures, assuming that the fans are placed appropriately, are venting in the right direction, and cables/components are arranged to allow the air to flow properly.
Newer form factors, such as Intel's BTX, are designed in a large part to cope with cooling the thing down.
You should be running your test bed in a case with appropriate cooling.
Araemo - Monday, September 20, 2004 - link
A note about the compilation benchmark:Make and gcc themselves are not multi-threaded, but make does understand a -j option to specify the # of concurrent make jobs to run. The general rule-of-thumb I have heard, is to specify a -j equal to n+1, where n is the number of processors in the system(One compile job per processor, and one control job.) So, to test if hyperthreading lowers the compile time, specifiy `time make -j 3`.
I would personally like to see this, in addition to the single-threaded results. I read some quick-and-dirty benchmark results suggesting hyperthreading does help the compile time, but those results were published by curious people when hyperthreading was new.. I haven't seen any results on recent processors.
garfield - Monday, September 20, 2004 - link
Isn't it normal procedure for a P4 that's getting too hot that it throttles the clock speed down? Maybe that would explain the extremely bad results, though it seems a bit unrealistic that the results has been reproducible, unless each iteration of the benchmark has been run for a long time.ceefka - Monday, September 20, 2004 - link
First of all : good article. It really shows the early benefits of well written 64-bit software. The 3500+ is definately on my wishlist ;-)"This does not bode well for the processor. Our processor test bed is completely caseless, and if we have issues with our 3.6GHz processor out of a normal case, we can't imagine what issues might exist in a full enclosure."
Rather confusing this bit, Kristopher. Anyway, I read that a good case will offer convection which apparently a caseless testbed does not. How were the tempratures of the other CPU's? Were they also a tad above average or typical peak?
balzi - Monday, September 20, 2004 - link
Some thoughts --can we please have Graphs where the order of the legend is the same as the order of the bars in the bar-graph. Surely that's possible.
also, the strange "Thermal issue" error. It seems that you thought it was weird but immediately assumed that it was correct; that the Intel CPU(s) were getting too hot.
Did you verify this somehow?? It seems strange to call it "An unusual problem" and then trust that it's correct without question or explanation.
Thanks
Shinei - Monday, September 20, 2004 - link
"This does not bode well for the processor. Our processor test bed is completely caseless, and if we have issuse with our 3.6GHz processor out of a normal case, we can't imagine what issues might exist in a full enclosure."That quote made me laugh, and I'm not entirely sure why. :D
Anyway, I see that going to 64-bit is definitely worth the price of admission, considering the huge gains the processors get in the jump. One thing I had a question on, though: Why does the result from the SSL benchmark halve between 32-bit and 64-bit? Is it that the keys are longer in 64-bit?
gherald - Monday, September 20, 2004 - link
I too would prefer longer images that include both 32 and 64-bit results. Mouseover comparisons are cumbersome.Is sample.wav 800mb or 700mb? I'm guessing the 7 was probably just a typo.
Nice analysis of DDR1 vs 2.
My only gripe is I wish a full complement of "lower end" processors were included in all these benchmarks (754s, slower prescotts, and heck even northwoods)... but I guess that'd be too much work.
ravedave - Monday, September 20, 2004 - link
What klah is trying to say in too many and too big of words : Make the scale the same for the mouseover pics.Also make the picture height the same as well if possible.
Otherwise a very good article.
Has anyone thought of making an open office benchmark for linux?
klah - Monday, September 20, 2004 - link
Good article, but I have a comment on the mouse-over graphs. They work well in other articles such as the recent DVR-108D article where the scale and axes remain constant. In this case however the layout and in some cases even the scale are different between the two graphs. It would be easier to compare the two if the scale was the same and processors were in the same layout(spacing/location), with the inapplicable processors still listed to maintain the same appearance between the two.If that explanation is nonsensical I can create a few images to try to elucidate my point.
Decoder - Monday, September 20, 2004 - link
"Hold your mouse over for the 64-bit graph."I like to see the 32 and 64 bits on the same graph. Why not use Athlon FX-53 (32) and Athlon FX-53 (64) for labels?