Adam Back (aba@dcs.ex.ac.uk)
Wed, 25 Mar 1998 13:47:40 GMT
OK, here's same again without BF_PTR2.  
Using raw ecb bytes / sec as performance metric, again all CPUs at 166
Mhz.
                        Intel	AMD k6	AMD k5	Intel	Pentium Pentium
                        MMX	MMX	non-MMX	non-MMX	pro	II
======================================================================
bfs-072m-gcc		3.8	5.3	4.0	
bfs-082b-gcc		4.0	4.8	5.9	3.0	7.4
bfs-082b-586-asm	8.3	5.9	5.8	
bfs-082b-686-asm	6.8	6.1	4.3	
======================================================================
bfs-072m-gcc-ptr2	4.6	5.8	4.9	
bfs-082b-gcc-ptr2	2.9	3.2	3.3	1.5	4.5
bfs-082b-586-asm-ptr2	8.3	5.9	5.8	
bfs-082b-686-asm-ptr2	6.8	4.8	5.9		7.5
======================================================================
(Intel non-MMX adjusted from Eric's 133 Mhz non-MMX(?) NT figures,
Ppro adjusted from Eric's Ppro 200 linux figures).
I am still finding that the fastest C code yet on an Intel MMX was
0.7.2m compiled with BF_PTR2.
Also the big difference between k6 686-asm and k6 686-asm-ptr2 is
surprising, didn't realise it was even using any ptr2 macros when
compiled with asm.... checking it's not!  Benchmarks are indeed good
random number generators.
If find the poor performance of pentium pros interesting as it backs
up something I had considered: that Ppros are a waste of time, and
people `upgrading' from say a pentium 200 to a pentium pro 200 are
wasting their time.  There are however improvements other than MMX
between pentium MMXs and non-MMXs, and the pentium pro predates the
MMX pentium (I think).  so perhaps the pecking order is:
non-MMX < Ppro < MMX < AMD k6 MMX < P-II
Are P-II's similar to Ppro's, or are they yet another architecture
with different scheduling optimisations necessary?
Another CPU worth trying might be the Cyrix CPUs, and IBM manufactured
pentium 6x86s (tho' I think? the IBM ones are Cyrix's manufactured by
IBM).
One thing that occurs to me is that if VTune is fairly automatic, and
able to do it's job quickly, perhaps someone could hack up a loader
(as in load/run time) VTune which optimised for your particular CPU at
load time.  Or if VTune isn't that fast store some of VTune's
decisions in a compact format ready to speed load time VTuneing.
A project for some pgcc folks perhaps.
Generally speaking AMD k6's are nice fast unfussy CPUs.  An additional
advantage of them is that they succesfully overclock.  I have reports
of AMD k6 233s running reliably at 292 Mhz and 249 MHz (83 x 3.5, and
83 x 3 respectively).  Probably they would work at 300 Mhz too, if you
had a board with a 100 Mhz bus.  350Mhz might be pushing it :-)
At 292 or 300 it might even rival perforamnce of a P-II 300/333 at a
fraction of the cost.
Another comment on your c2ln macros: on linux you should use htonl --
it is implemented thusly in usr/include/asm/byteorder.h, I found with
some other algorithms (looking at SHA1) that it works out faster than
anything you can get gcc to compile from C.
extern __inline__ unsigned long int
__ntohl(unsigned long int x)
{
#if defined(__KERNEL__) && !defined(CONFIG_M386)
        __asm__("bswap %0" : "=r" (x) : "0" (x));
#else
        __asm__("xchgb %b0,%h0\n\t"     /* swap lower bytes     */
                "rorl $16,%0\n\t"       /* swap words           */
                "xchgb %b0,%h0"         /* swap higher bytes    */
                :"=q" (x)
                : "0" (x));
#endif  
        return x;
}
Adam
The following archive was created by hippie-mail 7.98617-22 on Fri Aug 21 1998 - 17:16:14 ADT