So after yesterday’s post about some compiler results with Python 2.6 I wanted to show how some of GCC’s architecture-specific compiler flags affect the execution of pybench. As I explained in comments I think most people will never even touch the flags passed to Python’s build. Nonetheless, some people asked if I had tuned it in any way. Pádraig Brady had asked me if I had used the optimal GCC architecture flags. On my FreeBSD 7.0-STABLE machine at home (AMD Athlon(tm) 64 X2 Dual Core Processor 4600+ (2411.13-MHz K8-class CPU)) his script stated I had to pass along “-m32 -march=k8 -mfpmath=sse”. My machine is fully 64 bits so I left out the -m32 (since it will not link anyway) and used “-march=k8 -mfpmath=sse” (using -march=native instead of k8 resulted in a 0,1 seconds faster result and -mtune=native -march=native instead of k8 resulted in a 0,1 – 0,2 seconds faster result).
The default option flags are on my system: -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes.
Considering some other comments about how I did not use a 0-origin for my y-axis I have to point out two things: firstly, given the sometimes close results zooming out too much can eliminate detailed information (of course you have to be careful not to zoom in too much as well); secondly, I like to make sure the graph itself is appropriately centered so you do not get a whitespace skewing in the resulting image. I think, being a follower of the Edward Tufte school of graphic displaying, I did reasonably well. The graphs were made with a tool called Ploticus.
I was curious how the optimization level influenced the resulting program and as such I removed the -O3 option from the compiler flags. As is evident from the graph you are looking at a bit more than a doubling of execution time (an average of 14,2 seconds versus the previous 6,6 and 6,5 seconds).
So, given the huge performance hit by merely leaving out the -O3, I was interested how the other optimization levels worked out. Holger Hoffstätte asked to use -O2 -fomit-frame-pointer instead of -O3. Basically the results of -O3 (average of 6,5 seconds) and -O2 -fomit-frame-pointer (average of 6,5 seconds) were equal. The result of using -O1 (I could not really discern much of a speed difference by adding -fomit-frame-pointer, also for the -O2 case it was still an average of 6,5 seconds) was quite interesting. It already improves execution by ~86%. From -O1 to -O2/-O3 we are looking at another increase of ~16%. From the no optimization case to -O2/-O3 execution improves by ~118%
I tried a profile-guided optimization build, but I have some issues on my FreeBSD 7.0-STABLE with libgcov. Apparently only a libgconv.a is provided and linking gives me a relocation warning. Thankfully I also had a GCC 4.2.4 snapshot from March installed and did a PGO build, but I managed to only shave of about 0,2 seconds on the average time.
Due to recent concerns with memory use and execution speed I was curious how Python would behave with different compilers. I took Python 2.6a2 r62288 from the Subversion repository and compiled it with the flags: –with-threads –enable-unicode=ucs4 –enable-ipv6. The machine is a HP dc7700p with 1GB memory with an Intel Core2 6300 @ 1.86GHz running Ubuntu 7.10. I installed GCC 3.3.6, 3.4.6, 4.1.3, 4.2.1 from the Gutsy repository, and Intel 10.1.015. The MS Visual Studio 2008 Python was the MSI snapshot of 2008-04-10 from the main Python site. I ran this through Wine 0.9.46 after installing the VC2008 runtime.
First various GCC versions: 3.3.6, 3.4.6, 4.1.3, 4.2.1:
It is good to see that the 3.4 series is faster than the 3.3 series and the 4.2 series is faster than the 4.1 series. I am a bit worried about the 4.1 series drop in performance compared to the 3 series though.
Next we have Python compiled with GCC 3.4.6, 4.2.1, Intel CC 10.1.015, MSC from Visual Studio 2008:
It is nice to see how the Microsoft Visual Studio 2008 compiler produces a binary that, when run through Wine, still performs quite well compared to GCC. I am not quite sure if Wine incurs a performance penalty or not. What’s quite impressive is the performance of the Intel CC compiled Python. If we take the fastest GCC, which is 4.2.1 at the moment, take the average of the 10 rounds of execution, which is 6,574 seconds, and compare that to the average of ICC, which is 5,412 seconds, we see that ICC is about 21% faster. If we take the slowest, GCC 4.1.3 with an average of 7,002 seconds, we even get a result that ICC is about 29% faster.
So it seems for people who want to get the full performance out of Python compiling with ICC might be quite beneficial. I want to check out how ICC progressed from version 8 to version 10 performance-wise.
The raw data can be found at http://www.in-nomine.org/~asmodai/python-pybench.txt.
Some of those days are a mixture between the truly wonderful and truly weird.
Been actively sending in patches to other projects to support DragonFly natively. For projects using autoconf/config you need to bug the maintainers to update config.guess and config.sub so it detects DragonFly.
With TenDRA I am currently adding Amos’ -y flag, since it just makes sense to get rid of any hardwired stuff with regards to paths. Been looking at SuperH, ARM, and some other processors. Right now busy getting STLport working, of course, after I fix the -y flag and the building/* issues.
This is one of those days where you would put on Sigur Ros, The Counting Crows, Muse, or any other of those loveable melancholic bands…
Well, thankfully it is a weekend again. I find I need more and more time to get back on energetic levels. Guess getting up every morning at 06:30 is asking a bit too much of my body. Darn that glandular fever (Pfeiffer) that I once suffered from. It never leaves your body.
Oh well, at least the good news is that I found out the appartment I was interested in, about two-three weeks ago, in Schiedam is now available again. Guess the financial stuff didn’t work out for the people who intended to buy it. Going to see it on monday. Yay!
Bought a bunch of books, amongst which: Classics of Buddhism and Zen – Volume One, Linkers and Loaders, Optimizing Compilers for Modern Architectures, Engineering a Compiler, C.S. Lewis’ first two Narnia books (The Magician’s Nephew and The Lion, the Witch, and the Wardrobe), Frank Herbert’s Dune (wanted to reread it again), Oliver Twist, Moby Dick, Poe’s Spirits of the Dead, Michael Moore’s Stupid White Men and Dude Where’s my Country?, and Asimov’s I, Robot and Caves of Steel.
Been coding a lot lately. Expanded the r9 bot with some modules for RDF fetching, bugzilla bug fetching. Next is notifications.