[lug] q: multithreading on smp weirdness

karheng at softhome.net karheng at softhome.net
Tue Nov 30 21:14:57 MST 2004


that's an interesting guess! i've gotta watch out for
that.. i don't think it's the reason though, at least
not for the test app. 

for the test app, i've replaced the chunk fetching
function with a routine that simply does some computation
over a small buffer. that buffer doesn't grow or
shrink, & all i do is toggle the number of iterations
over that buffer. so it should exclude issues caused by
the cpu cache line... 

actually, the test app i have is part of a bigger app.
thread 1 of the bigger app fetches data & thread 2
processes them.
i've extracted & simplified that code out and made both
threads only do computation on a small buffer to
reproduce the problem from the bigger app. 

anyway, i've also tested the threads in the bigger app,
& i've found that data returned from thread 1 is
consistently sufficient to keep thread 2 busy.
(thread1:thread2 elapse time ratio is approx 6:4, in my
test app, it's made to be 1:1).
in theory, i should be getting about 40% or 30%
performance improvement in my bigger app, & always 40%
to 50% performance improvement in my test case. 

i've attached source code from my test app, for better
clarity. it can't be compiled because i've excluded some
custom classes. i can produce a completely isolated
version of the code if required. 

i've been trying to look up multithreading topics/faqs
all over the net... so far still no clue. 

will also try it on another dual CPU itanium box soon
after it's set up. 

rgds, 

kh 

ps: sorry for the double post earlier... the web based
mailer i used didn't indicate i've successfully sent the
first mail. 


>
>Zan Lynx wrote:
>>On Mon, 2004-11-29 at 19:45 -0700, karheng at softhome.net wrote: 
>>
>>>greetings. 
>>>i've got a test app with 2 threads processing
>>>several chunks of data. thread 1 fetches chunk 1
>>>& processes it. once done, it's relayed to thread 2.
>>>then thread 1 fetches & processes chunk 2, while
>>>thread 2 receives & processes chunk 1. this goes
>>>on until all chunks are processed by both threads. 
>>>my problem is that if the chunk size is made small
>>>enough, the elapse time suddenly almost doubles...
>>>and i need these chunks to be small eventually. 
>>>more specifically:
>>>(chunk size * iteration count always == 6000)
>>>single threaded:
>>>chunk size of 6000 unit, 1 iteration, 12+ secs elapse time
>>>multi-threaded:
>>>chunk size of 12 unit, 500 iterations, 6+ secs elapse time
>>>chunk size of 10 unit, 600 iterations, 6+ secs elapse time
>>>chunk size of 8 unit, 750 iterations, 6++ secs elapse time
>>>chunk size of 6 unit, 1000 iterations, 6++ secs elapse time most of the time
>>>chunk size of 5 unit, 1200 iterations, 6+++ secs elapse time most of the time, 11+ secs at other times.
>>>chunk size of 4 unit, 1500 iterations, 11+ secs elapse time
>>>chunk size of 3 unit, 2000 iterations, 11++ secs elapse time
>>>chunk size of 2 unit, 3000 iterations, 11+++ secs elapse time
>>>chunk size of 1 unit, 6000 iterations, 12+ secs elapse time
>>>(note that in above, elapse times of 7, 8, 9 & 10 secs are
>>>mostly not present, just 1 or 2 sporadic ones). 
>>>i ran this on a 4 CPU itanium linux server. 
>>>this problem doesn't appear when i ran a port
>>>of this app on a windows hyperthreading machine. 
>>
>>An exotic guess: maybe 4 units is just under the size of a cache line?
>>Then thread 1's 4 units might be sharing a cache line with thread 2's 4
>>units.  Memory accesses will cause the cache line to bounce from cpu 1
>>to cpu 2 and back.
>
>Actually I would go with this as my guess too. Having worked with some serious game developers (just for fun), we've gone through some 

serious optimizing. The most extraordinary losses occur when transfering 
data that is not the cache line size, and can be demonstrated 

with ease in a simple C for loop. What you do is loop through a data copy 1 
byte at a time at each iteration of the loop. Then you try 

the alternative, say 4 bytes at a time if that matches your cache line size 
(and you use modulus and on the last loop if it isn't exactly 

the right size of the line then you go for the non-cache-line-size), which 
results in a dramatic improvement. FYI if you are using enough 

data copy then MMX copy will again be an enormous boost...but I'm not sure 
how MMX could be used or what the gotchas are using MMX to 

copy between threads. On a 32-bit system (what I was using) try 4 byte 
copies; on a 64-bit system, try 8 byte copies. Consider that if 

you have data which is less than 8 bytes, padding this with NULL bytes so 
you can copy line sizes (double check on cache line sizes, it's 

been a while and my numbers might be off).
>
>D. Stimits, stimits AT comcast DOT net
 
-------------- next part --------------
/*
  Test app for testing thread sync overhead & stuff.
*/

/*
  defines platform stuff, such as MS_WINDOWS_NATIVE,
  APIFUNCATTRH, APIFUNCATTRB, APIFUNCATTRT, etc.
*/
#include        <myconfig.h>
#include        <stdio.h>
#include        <stdlib.h>
#include        <ctype.h>
#include        <time.h>
#include        <sys/timeb.h>
#include        <iostream.h>
#if             !defined(MS_WINDOWS_NATIVE)
#include        <stdexcept>
#else        /* !defined(MS_WINDOWS_NATIVE) */
#if             !defined(__BORLANDC__)
#include        <stdexcept>
#else        /* !defined(__BORLANDC__) */
#include        <stdexcep.h>
#endif       /* !defined(__BORLANDC__) */
#endif       /* !defined(MS_WINDOWS_NATIVE) */
/*
  Wrapper for OS file map functions.
  Provides void * map(filename,mapmode,size) & unmap(void *).
  mapmode is 'i' for create+read+write, 'w' for read+write,
  & 'r' for read only.
*/
#include        <Filemapper.h>
/*
  Wrapper for OS threading functions.
  It's pthreads when pthreads is available.
  On windows, it's windows threads.
  Provides classes MyThread & MyThreadSync.

  MyThread is a class that needs to be inherited with
  it's MyThread::main()=0 implemented.

  MyThreadSync manages wait events for thread
  synchronization & resource management.
  Wait events are recorded into this class via
  MyThreadSync::Push()
  MyThreadSync::Pop()
  MyThreadSync::NonBlockPush()
  MyThreadSync::NonBlockPop()

  ::Push() pushes a wait event into an instance
  of MyThreadSync.
  ::Pop() pops a wait event from an instance
  of MyThreadSync.

  When resulting wait events in an instance
  is >0, ::Push() & ::Pop() causes the calling
  thread to block.
  When resulting wait events in an instance
  is <1, ::Push() & ::Pop() causes any thread
  currently blocking on this instance to unblock.

  The NonBlockPush() & NonBlockPop() variants
  doesn't block the calling thread.
*/
#include        <MyThread.h>
#include        <MyThreadSync.h>


#define ctxcnt 1500 // count of syncs desired
unsigned itrcnt=6000/ctxcnt; // count of iterations.
#define bufcnt 128 // size of a buffer we will use in getbusy()
int buff1[bufcnt]; // scratch buffer
int buff2[bufcnt]; // scratch buffer
int * buf1=buff1;
int * buf2=buff2;
NXThreadSync s0,s1; // thread synchronizers

struct timeb start_time[100]; // used to store elapse times.
int itimer = 0;

void start_timer () {
  ftime(&start_time[itimer++]);
}

void print_timer () {
  struct timeb e_time;
  long tdiff;

  ftime(&e_time);

  itimer--;
  tdiff=(long)(difftime(e_time.time,start_time[itimer].time)*1000);
  tdiff=tdiff+e_time.millitm-start_time[itimer].millitm;

  printf("Elapsed time %ld milliseconds\n",tdiff);
}

// simply does something to keep a CPU busy for a short time (+-2ms).
void
getbusy(int * buf)
{
  unsigned c,i;

  //puts("Gonna be busy...");
  for(c=bufcnt;c-->0;)
  {
    buf[c]=0;
    buf[c]=(int)((((double)157*(double)7)/(double)10)*(double)22/(double)7*(double)c);
  };
  for(i=0;i<bufcnt;i++)
  for(c=bufcnt;c-->0;)
  {
    buf[c]*=buf[i];
  };
  //puts("Gonna be free...");
}

class
thrd
:public MyThread
{
  public:
    APIFUNCATTRH void APIFUNCATTRB main(void) APIFUNCATTRT;
};

#if             defined(__GNUC__)
#if             !defined(__attribute__)
#define         __attribute__(attr)
#endif       /* !defined(__attribute__) */
#endif       /* defined(__GNUC__) */

void
duh1(void)
{
  unsigned n,i;
  thrd mythread;

  buf1=(int *)map("buf1",'i',sizeof(int)*bufcnt);
  buf2=(int *)map("buf2",'i',sizeof(int)*bufcnt);

  start_timer();

  mythread.Start();
  for(n=0;n<ctxcnt;n++)
  {
    for(i=0;i<itrcnt;i++)
      getbusy(buf1);
    s1.Pop(); // refer to top of this file for function documentation
    s0.Push(); // refer to top of this file for function documentation
  };
  mythread.Join();

  print_timer();

  unmap(buf1);
  unmap(buf2);
}

void
thrd::main(void)
{
  unsigned n,i;

  for(n=0;n<ctxcnt;n++)
  {
    for(i=0;i<itrcnt;i++)
      getbusy(buf2);
    s0.Pop(); // refer to top of this file for function documentation
    s1.Push(); // refer to top of this file for function documentation
  };
}

int
main(void)
{
  //duh00();
  duh1();

  return(0);
}



More information about the LUG mailing list