[lug] Various Arch/Compiler Binaries living together

Wed Jun 26 11:52:10 MDT 2002

On Tue, 25 Jun 2002, Nate Duehr said:

> P-III's, Durons, and Celerons -- and the P-IV, high end Athlons and Xeons
> are serious overkill right now for most end-users.

I came across something while searching the news groups for info on Linux
/ P3 issues which I found interesting a few months ago regarding P3 vs. P4
performance:

		=============================

If you want to see a REAL bug, and one that both severely cripples
performance and is NOT fixable, look at this "errata" in the Intel 860
chipset ( the chipset for dual XEON P4 motherboards):

In the file found at:
ftp://download.intel.com/design/chipsets/specupdt/29071501.pdf

Intel lists errata for the 860 chipset.
One of these states:

"5. Sustained PCI Bandwidth Problem:
During a memory read multiple operation, a PCI master will read more
than one complete cache line from memory. In this situation, the MCH
pre-fetches information from memory to provide optimal performance.
However, the MCH cannot provide information to the PCI master fast enough.
Therefore, the ICH2 terminates the read cycle early to free up the PCI bus
for other PCI masters to claim.

Implication: The early termination limits the maximum bandwidth to ~90
MB/s.

Workaround: None

Status: Intel has no fix planned for this erratum."

This effectively limits the bandwidth of the PCI bus to 90MB per second.
Considering this is a chipset designed for servers, and is equipped with
PCI slots at 64 bit, and 66MHz, it should have a bandwidth of over
300MB/sec.  If you buy one of these, and spend money on high performance
SCSI, gigabit, or other devices, you are wasting your $$.

		=============================

and this:

I've just installed a Dual Xeon/1.7 GHz Dell machine with RH7.2.
Performs pretty nice but I happened to write a very stupid and very
simple program using math library extensively, just to keep the
processors busy in order to test the new UPS.

The program is as simple as:

#include <stdio.h>
#include <math.h>

#define SIZE 20000000

main(int argc, char *argv[])
{
  int i,j, nloop=1;
  double x, y, z, sum=0.0;

  if (argc > 1) nloop=atoi(argv[1]);
  for (i=0; i<nloop; i++) {
    for (j=0; j<SIZE; j++) {
      x = (double)j/SIZE;
      y = exp(x);
      z = 1.23*sin(x) + acos(log10(y))/2.75;
      sum += (z>0.0 ? log(z) : pow(fabs(z),1.23));
    } printf("%d %e\n", i, sum);
  } return 0;
}

For pure curiosity, I compared the timing of this program run on Xeon 1.7
and Pentium III 850, same version of RH7.2, very same binary run on both
machines, just single job, nothing to do with SMP.

Results are very surprising:

PIII/850:
26.160u 0.000s 0:26.08 100.3%   0+0k 0+0io 115pf+0w
Xeon/1.7:
24.300u 0.000s 0:24.30 100.0%   0+0k 0+0io 114pf+0w

Does it make any sense? Both CPUs have the same size cache (256kB)  CPU
clock is two times faster on Xeon. The memory is four times faster on Xeon
(400MHz RAMBUS vs. 100MHz bus in PIII). So what could be the possible
reason for such strange results. I can't imagine that Xeon's floating
point operations take twice as many clock ticks to complete as on PIII

On other, not-so-stupid applications, the Xeon performs roughly 1.7-2
times faster than PIII. Still, I'm very curious why is it so bad on this
simple program.

----------------------------------------------------------------
John Karns                                        jkarns at csd.net