So after yesterday’s post about some compiler results with Python 2.6 I wanted to show how some of GCC’s architecture-specific compiler flags affect the execution of pybench. As I explained in comments I think most people will never even touch the flags passed to Python’s build. Nonetheless, some people asked if I had tuned it in any way. Pádraig Brady had asked me if I had used the optimal GCC architecture flags. On my FreeBSD 7.0-STABLE machine at home (AMD Athlon(tm) 64 X2 Dual Core Processor 4600+ (2411.13-MHz K8-class CPU)) his script stated I had to pass along “-m32 -march=k8 -mfpmath=sse”. My machine is fully 64 bits so I left out the -m32 (since it will not link anyway) and used “-march=k8 -mfpmath=sse” (using -march=native instead of k8 resulted in a 0,1 seconds faster result and -mtune=native -march=native instead of k8 resulted in a 0,1 – 0,2 seconds faster result).
The default option flags are on my system: -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes.
Considering some other comments about how I did not use a 0-origin for my y-axis I have to point out two things: firstly, given the sometimes close results zooming out too much can eliminate detailed information (of course you have to be careful not to zoom in too much as well); secondly, I like to make sure the graph itself is appropriately centered so you do not get a whitespace skewing in the resulting image. I think, being a follower of the Edward Tufte school of graphic displaying, I did reasonably well. The graphs were made with a tool called Ploticus.
I was curious how the optimization level influenced the resulting program and as such I removed the -O3 option from the compiler flags. As is evident from the graph you are looking at a bit more than a doubling of execution time (an average of 14,2 seconds versus the previous 6,6 and 6,5 seconds).
So, given the huge performance hit by merely leaving out the -O3, I was interested how the other optimization levels worked out. Holger Hoffstätte asked to use -O2 -fomit-frame-pointer instead of -O3. Basically the results of -O3 (average of 6,5 seconds) and -O2 -fomit-frame-pointer (average of 6,5 seconds) were equal. The result of using -O1 (I could not really discern much of a speed difference by adding -fomit-frame-pointer, also for the -O2 case it was still an average of 6,5 seconds) was quite interesting. It already improves execution by ~86%. From -O1 to -O2/-O3 we are looking at another increase of ~16%. From the no optimization case to -O2/-O3 execution improves by ~118%
I tried a profile-guided optimization build, but I have some issues on my FreeBSD 7.0-STABLE with libgcov. Apparently only a libgconv.a is provided and linking gives me a relocation warning. Thankfully I also had a GCC 4.2.4 snapshot from March installed and did a PGO build, but I managed to only shave of about 0,2 seconds on the average time.


