Acovea 5.0

Analysis of GCC 4.0.0 for Pentium 4

1 May 2005
 



 

Acovea Logo

Downloads
acovea-gtk-1.0.1.tar.gz
libacovea-5.1.1.tar.gz
libevocosm-3.1.0.tar.gz
libcoyotl-3.1.0.tar.gz

About Acovea
Acovea Overview
A GUI for Acovea: Acovea-GTK
Configuration Files
FAQ
About the Genetic Algorithm

Analyses
Acovea 5.0, GCC 4.0, Opteron
Acovea 5.0, GCC 4.0, Pentium 4
Acovea 5.0, GCC 3.4, Opteron
Acovea 4.0, GCC 3.4, P4, AMD64 (May-04)
Acovea 3.3, GCC 3.x, P4 (Dec-03)

Licensing
GNU General Public License (GPL)
Commercial License

If you find this article useful, please consider supporting the author's free software efforts with a donation, no matter how small.

Abstract

This article describes results obtained by running Acovea 5.0.0 to test GCC 4.0.0 on an Intel Pentium 4 workstation. When applied to several example benchmark programs, Acovea identified optimization sets that reduced three individual benchmark run times by between 2 and 8%, when compared against code generated using the compiler's predefined -On optimization sets. In no case did Acovea fail to find a solution that at least equalled the performance of the -On options

Past articles have combined results from testing on both Opteron and Pentium 4 processors. This article focuses solely on the Pentium 4, to avoid the non sequitur and perhaps misleading comparison of processors that has accompanied past articles.

Herein, I am writing about Acovea as it performs on the Pentium 4 workstation I own, Tycho. All tests were performed using GCC 4.0.0, a major new version released in mid-April 2005.

Results obtained with Acovea 5.0 were clearer than those from older versions. This is due to better statistical analysis; the underlying algorithm is the same. In every case, Acovea suggested a set of options that produced faster code than any of the preset -On options.

Benchmark tests

The current benchmark suite consists of six algorithm-specific programs, all written to the 1999 ISO C Standard. I intend to update the test suite following the release of Acovea 5.0. Such grand plans are, of course, predicated on my finding some of that mythical "spare time" I keep hearing about...

The individual tests are:

  • alma
    Calculates the daily ephemeris (at noon) for the years 2000-2099; tests array handling, floating-point math, and mathematical functions such as sin() and cos().
  • evo
    A simple genetic algorithm that maximizes a two-dimensional function; tests 64-bit math, loop generation, and floating-point math.
  • fft
    Uses a Fast Fourier Transform to multiply two very (very) large polynomials; tests the C99 _Complex type and basic floating-point math.
  • huff
    Compresses a large block of data using the Huffman algorithm; tests string manipulation, bit twiddling, and the use of large memory blocks.
  • lin
    Solves a large linear equation via LUP decomposition; tests basic floating-point math, two-dimensional array performance, and loop optimization.
  • mat1
    Multiplies two very large matrices using the brute-force algorithm; tests loop optimization.

Past articles included a test called "tree" which is not present here. The reason: It broke, and I didn't want to spend time hunting bugs in a benchmark I'll be replacing soon. The "tree" test was a quick hack that I'll replace with a new benchmark based on the production-quality code in Itzam.

Compiler Options

For the purpose of this article, I evolved the best set of options for the seven benchmark programs listed above; I also created a composite program that invokes all seven benchmarks to produce a single, overall result. At left is a chart showing the relative performance of each benchmark as compiled with different optimization settings.

  • O1, O2, O3 - baseline compiles with those collective options defined by GCC.
  • O3 fast math - baseline compiles with -O3 -ffast-math
  • acovea - compiled with options evolved by Acovea, with -O1 as the base level of optimization (though it implements switches to turn off options implied by -O1)

Acovea evolves optimization sets from the -O1 level. -O1 must be included if any optimization is to occur; I have included switches in the GCC configuration files to turn off various optimizations (e.g., -fno-merge-constants) implied by -O1, thus allowing evolution to remove options implied at the base level. For GCC 3.4 on the Pentium 4, Acovea evolves optimization sets selected from 60-some options, some of which include multiple possibilities (e.g., -mfpath). The -fnew-ra option causes compiler crashes and is disabled for these tests.

Discussions on the GCC mailing list (thread 1, thread 2, thread 3) revealed different perceptions about the implications of the -ffast-math option. For the tests shown here, I included -ffast-math in the evolutionary mix. I've performed experiments that show how -ffast-math does not reduce (and in some cases, improves) accuracy when used with industry-standard benchmarks like Paranoia. The "floating-point accuracy" story is very complicated, but it deserves its own, to-be-written-when-I-have-time article. And no, I haven't had time in the year since I last wrote that sentence!

Results

Acovea 5.0 was somewhat successful when applied to the Pentium; the results were not as dramatic as they were when I applied Acovea to the Opteron. For every single benchmark, Acovea produced a set of options that generated faster code than any of the built-in -On sets, although in three cases the improvement was trivial. You can download the complete set of test runs here. These are test files generated by the command-line and GUI drivers, and include the full sets of optimizations evolved by Acovea.

For every test compile on Corwin, Acovea used gcc -lrt -lm -std=gnu99 -O1 -march=pentium4; the -march=pentium4 implies -msse, -msse2, -m3dnow, and -mmmx.

For each test, I provide Acovea's top options set along with lists of strongly pessimistic and optimistic flags. An "optimistic" option was common in "good" solutions during the entire evolutionary process. A "pessimistic" option, on the other hand, was absent from the "good" solutions, either because it hurt run times or it simply made no difference at all and was ignored as a "junk gene". It's important to realize that Acovea's best solution may contain options that had no practical effect on code generation; they didn't hurt, didn't help, and were just "there." Such it is human genes for producing tails; most of us have the genes (option), but it doesn't express itself in any way.

For the seven benchmarks on the Opteron, Acovea produced the following results:

  • alma
    Acovea's solution:
    -fno-defer-pop -fno-thread-jumps -ftree-ccp -ftree-dce -ftree-dse -ftree-ter -ftree-lrs -ftree-sra -ftree-fre -ftree-ch -fcrossjumping -foptimize-sibling-calls -fcse-follow-jumps -fexpensive-optimizations -fcaller-saves -fforce-addr -fschedule-insns -fschedule-insns2 -fregmove -fstrict-aliasing -fdelete-null-pointer-checks -fgcse-lm -freorder-functions -funit-at-a-time -falign-functions -falign-jumps -finline-functions -fomit-frame-pointer -fpeel-loops -funroll-all-loops -freschedule-modulo-scheduled-loops -ftree-loop-im -fivopts -fvariable-expansion-in-unroller -mieee-fp -mno-push-args -mno-align-stringops -minline-all-stringops -mfpmath=387 -funsafe-math-optimizations -fno-trapping-math -finline-limit=700


    Strongly Optimistic Options:
    ftree-fre -finline-functions -fpeel-loops -funsafe-math-optimizations -finline-limit


    Strongly Pessimistic Options:
    -floop-optimize2 -momit-leaf-frame-pointer -ffloat-store -fno-inline -mfpmath=sse -mfpmath=sse,387 -fno-math-errno


    Acovea turned in its best performance on this benchmark, gaining 8% over the -On options.
     
  • evo
    Acovea's solution:
    -fno-thread-jumps -fno-cprop-registers -fno-if-conversion2 -floop-optimize2 -ftree-dominator-opts -ftree-lrs -ftree-fre -ftree-ch -fmerge-constants -fcrossjumping -fgcse -fcaller-saves -fforce-mem -fschedule-insns -fschedule-insns2 -fregmove -fstrict-aliasing -fthread-jumps -fsched-interblock -freorder-blocks -freorder-functions -funit-at-a-time -falign-functions -falign-loops -falign-labels -finline-functions -funswitch-loops -fgcse-after-reload -fpeel-loops -funswitch-loops -fbranch-target-load-optimize2 -fmodulo-sched -freschedule-modulo-scheduled-loops -maccumulate-outgoing-args -minline-all-stringops -fno-math-errno -fno-trapping-math


    Strongly Optimistic Options:
    -funswitch-loops -fmodulo-sched


    Strongly Pessimistic Options:
    -fforce-addr -fpeephole2 -momit-leaf-frame-pointer -ffloat-store -funroll-all-loops -mfpmath=387 -mfpmath=sse -mfpmath=sse,387 -funsafe-math-optimizations


    Results for this benchmark are just plain odd. The fastest code was turned in by -Os, although Acovea was able to match it.
     
  • fft
    Acovea's solution:
    -fno-merge-constants -fno-thread-jumps -fno-guess-branch-probability -fno-delayed-branch -ftree-dce -ftree-dominator-opts -ftree-ter -ftree-lrs -ftree-sra -ftree-copyrename -foptimize-sibling-calls -fcse-follow-jumps -fcse-skip-blocks -fgcse -frerun-cse-after-loop -frerun-loop-opt -fcaller-saves -fregmove -fstrict-aliasing -freorder-blocks -fthread-jumps -fsched-interblock -fsched-spec -fgcse-after-reload -fno-inline -funswitch-loops -funroll-all-loops -fmodulo-sched -fgcse-sm -freschedule-modulo-scheduled-loops -ftree-loop-im -fivopts -fvariable-expansion-in-unroller -mno-push-args -maccumulate-outgoing-args -minline-all-stringops -funsafe-math-optimizations -ffinite-math-only -fno-signaling-nans -fcx-limited-range -finline-limit=600


    Strongly Optimistic Options:
    -fstrict-aliasing (3.219) -maccumulate-outgoing-args (1.647) -funsafe-math-optimizations


    Strongly Pessimistic Options:
    -fno-guess-branch-probability -fno-loop-optimize -finline-functions -momit-leaf-frame-pointer -ffloat-store -fbranch-target-load-optimize -mfpmath=387 -mfpmath=sse,387


    Acovea's solution was no better the -O2 and -O3. Overall, there seemed to be little room for optimizing this algorithm at all on the Pentium 4.
     
  • huff
    Acovea's solution:
    -fno-defer-pop -fno-guess-branch-probability -fno-if-conversion2 -fno-loop-optimize -ftree-ccp -ftree-ter -ftree-lrs -ftree-sra -ftree-copyrename -ftree-fre -ftree-ch -fmerge-constants -fcrossjumping -fcse-skip-blocks -fgcse -fexpensive-optimizations -fschedule-insns2 -fregmove -freorder-blocks -fthread-jumps -fgcse-lm -freorder-blocks -freorder-functions -funit-at-a-time -falign-functions -falign-jumps -ftree-pre -fgcse-after-reload -fomit-frame-pointer -ffloat-store -fprefetch-loop-arrays -fmodulo-sched -fno-function-cse -fgcse-sm -freschedule-modulo-scheduled-loops -ftree-loop-im -mieee-fp -minline-all-stringops -mfpmath=sse,387 -funsafe-math-optimizations -fno-trapping-math


    Strongly Optimistic Options:
    -ftree-sra -freorder-blocks -fsched-interblock -freorder-blocks -fomit-frame-pointer -fgcse-sm


    Strongly Pessimistic Options:
    -fno-if-conversion -fforce-addr -momit-leaf-frame-pointer -ftracer -funroll-all-loops -fbranch-target-load-optimize2 -mfpmath=387 -mfpmath=sse


    Acovea improved generate code performance by 3% as compared to the next-best solution, -O3.
     
  • lin
    Acovea's solution:
    -fno-merge-constants -fno-cprop-registers -fno-if-conversion2 -ftree-ter -ftree-lrs -ftree-sra -ftree-ch -fexpensive-optimizations -fstrength-reduce -frerun-cse-after-loop -fcaller-saves -fpeephole2 -fschedule-insns -fstrict-aliasing -fdelete-null-pointer-checks -freorder-blocks -fsched-spec -freorder-blocks -falign-loops -falign-labels -ftree-pre -finline-functions -momit-leaf-frame-pointer -fno-inline -fpeel-loops -ftracer -funroll-all-loops -fmodulo-sched -freschedule-modulo-scheduled-loops -ftree-loop-im -fivopts -ftree-vectorize -mieee-fp -maccumulate-outgoing-args -mno-align-stringops -minline-all-stringops -funsafe-math-optimizations -fno-trapping-math -fno-signaling-nans


    Strongly Optimistic Options:
    -fstrict-aliasing -fno-inline -funroll-all-loops


    Strongly Pessimistic Options:
    -fno-loop-optimize -floop-optimize2 -falign-jumps -fomit-frame-pointer -ffloat-store -funroll-loops -mfpmath=387 -mfpmath=sse -mfpmath=sse,387


    Acovea found an option set for producing code that is produced code that is 2% faster than any -On option.
     
  • mat1
    Acovea's solution:
    -fno-merge-constants -fno-defer-pop -fno-guess-branch-probability -fno-cprop-registers -fno-delayed-branch -floop-optimize2 -ftree-ccp -ftree-dominator-opts -ftree-ter -ftree-lrs -ftree-copyrename -fmerge-constants -fcrossjumping -foptimize-sibling-calls -fcse-follow-jumps -fgcse -fexpensive-optimizations -fcaller-saves -fforce-mem -fpeephole2 -fschedule-insns -fschedule-insns2 -fregmove -fdelete-null-pointer-checks -freorder-blocks -fsched-interblock -freorder-functions -falign-jumps -falign-loops -falign-labels -finline-functions -funswitch-loops -fgcse-after-reload -fprefetch-loop-arrays -fno-inline -fpeel-loops -funswitch-loops -fgcse-las -fivopts -maccumulate-outgoing-args -minline-all-stringops -fno-trapping-math -fno-signaling-nans -finline-limit=500


    Strongly Optimistic Options:
    -ftree-ccp -fgcse


    Strongly Pessimistic Options:
    -fforce-addr -momit-leaf-frame-pointer -ffloat-store -funroll-loops -funroll-all-loops -fbranch-target-load-optimize -mfpmath=sse -mfpmath=sse,387


    This is the single test where Acovea did not beat the -On options. Looking at the numbers, I'm not surprised: The fastest code was generated by -O1 alone, beating both -O2 and -O3. In other words, adding more options to -O1 doesn;t improve performance, so Acovea had nothing to work with.
     

Conclusions

The results I've presented describe the effects of different optimizations on a set of relatively straight-forward programs that perform well-defined tasks. My conclusions so far have been these:

Acovea 5.0 is an improvement over older versions.
With version 5, Acovea identified a "better" solution in every single case, though in three cases the improvement was negligible, and in one case (mat1), -O1 came out ahead of Acovea, -O2 and -O3. However, the results for Pentium 4 were not as significant as those Acovea identified for the Opteron, a difference that can likely be attributed to the relative maturity of the two code generators. The Pentium 4 has been around longer, and I'm not surprised that it's code generator is more stable than is the Opteron back end.

A higher "-O" level does not guarantee faster code.
By implication, -O3 should produce code that is faster than -O1, but it doesn't always do so in practice. The complex interactions of many different optimization techniques seem very likely to conflict; the complexity is simply too much to understand. Acovea can test these interactions, and produce better optimization sets that avoid conflicts and mutual pessimisms.

Different algorithms are most effective with different optimizations.
Again, an obvious pearl of wisdom that is sometimes forgotten. Real (i.e., non-benchmark) programs do many things; using a profiler, the critical loops can be identified, and their algorithms refined as much as possible by hand. Once the algorithms are tuned, a tool like Acovea will identify the best possible optimization settings. Applying -O3 to an entire program is not likely to produce the fastest program; using algorithm-specific optimizations to compile specific, critical routines is likely to produce faster code.

As always, I look forward to considered comments.

-- Scott




E-mail
LinkedIn
Facebook

•• HIRE SCOTT ••

Computer Books
Fiction
Articles
Reviews

FAQ
Bibliography


Link to Scott Ladd's Syraqua site

© 2010
Scott Robert Ladd
All rights reserved.
Established 1996


The grey-and-purple dragon logo, the blue coyote logo, Coyote Gulch Productions, Itzam, Evocosm, and Acovea are all Trademarks of Scott Robert Ladd.

Privacy Policy
Legal Stuff