[lug] More wood on the fire (High resolution timers)

Sun Feb 23 20:54:31 MST 2003

At the risk of starting a riot, I have to ask a really dumb question.

Why do you care to know the time to a resolution of 1 usec?

If you just need the time - a single measurement - you can 
certainly use gettimeofday() and a combination of NTP and local 
clock skewing to get the actual time to within 10msec or so.

But if you you're trying to create a high resolution interval 
timer, you'll need to make multiple calls to gettimeofday() and 
the context switch latency that can occur between these calls will 
become a huge issue.

Consider the following code.  (And run it on your system, although 
you may need to tweak some of the numbers.)

long process (long count, const char *buffer)
{
     struct timeval start, stop;
     long i, j, x, delay;

     gettimeofday(&start, NULL);
     /* do something that can't be optimized out of existence. */
     x = 0;
     for (i = 0; i < count; i++) {
         for (j = 0; j < 1<<13; j++) {
             x = (x << 1) ^ buffer[j % 20];
         }
     }
     gettimeofday(&stop, NULL);

     delay = 1000000 * (stop.tv_sec - start.tv_sec);
     delay = delay + stop.tv_usec - start.tv_usec;

     return delay;
}

int main()
{
     int i;
     long baseline = 0;
     long delay[20];

     for (i = 0; i < 20; i++) delay[i] = 0;

     for (i = 0; i < 20; i++) {
         for (count = 1; count < 20; count++) {
             delay[count] += process(count,
                 "the quick brown fox jump over the lazy dog");
         }
     }

     baseline = delay[1];
     for (i = 1; i < 20; i++) {
         printf("%2ld: %6.2f\n", i,
             100.0*(float)(delay[i]-(i*baseline))/
             (float)baseline);
     }
}

The time consumed by "process()" should scale linearly with the 
count, so the numbers printed later should hover around 0%.

But they don't.  Even with 20 runs averaged there's a lot of 
variability, but this is representantive:

  1:   0.00
  2:   3.14
  3: 160.48
  4:  17.94
  5:  18.10
  6:  22.68
  7:  34.80
  8:  44.18
  9:  43.51
10:  85.85
11:  58.61
12:  63.25
13:  66.86
14: 120.00
15:  89.56
16:  77.52
17:  92.06
18:  94.81
19: 104.92

The numbers are a bit wierd, but the second column is an attempt 
to normalize the "excess" time over n*O(1), where O(1) has 
arbitrary units of 100.  So for n=19, the time should have been 
1900, but was actually 1900+104.92 = 2004.92.

For n=3, the time should have been 300 but was actually 460.48.
The huge hit here is because it's close to the system time slice - 
shorter runs often avoid the hit, while longer runs can amortize 
the cost over more slices spent entirely in the delay loop.

The point of this exercise is that the latency in the time slices 
causes an anomomously, and highly unpredictable overhead, to be 
added to any time measurement of a software task that spans 
multiple timeslices.  If you can stay within a single timeslice 
(perhaps tossing any anomomously high numbers) you can get good 
results on a stock kernel, but if you're trying to measure time 
delays that are within an order of magnitude of the timeslice 
you're hosed.

Bear