[lug] q: multithreading on smp weirdness

Tue Nov 30 00:54:39 MST 2004

Zan Lynx wrote:
> On Mon, 2004-11-29 at 19:45 -0700, karheng at softhome.net wrote:
> 
>>greetings. 
>>
>>i've got a test app with 2 threads processing
>>several chunks of data. thread 1 fetches chunk 1
>>& processes it. once done, it's relayed to thread 2.
>>then thread 1 fetches & processes chunk 2, while
>>thread 2 receives & processes chunk 1. this goes
>>on until all chunks are processed by both threads. 
>>
>>my problem is that if the chunk size is made small
>>enough, the elapse time suddenly almost doubles...
>>and i need these chunks to be small eventually. 
>>
>>more specifically:
>>(chunk size * iteration count always == 6000)
>>single threaded:
>>chunk size of 6000 unit, 1 iteration, 12+ secs elapse time
>>multi-threaded:
>>chunk size of 12 unit, 500 iterations, 6+ secs elapse time
>>chunk size of 10 unit, 600 iterations, 6+ secs elapse time
>>chunk size of 8 unit, 750 iterations, 6++ secs elapse time
>>chunk size of 6 unit, 1000 iterations, 6++ secs elapse time most of the time
>>chunk size of 5 unit, 1200 iterations, 6+++ secs elapse time most of the 
>>time, 11+ secs at other times.
>>chunk size of 4 unit, 1500 iterations, 11+ secs elapse time
>>chunk size of 3 unit, 2000 iterations, 11++ secs elapse time
>>chunk size of 2 unit, 3000 iterations, 11+++ secs elapse time
>>chunk size of 1 unit, 6000 iterations, 12+ secs elapse time
>>(note that in above, elapse times of 7, 8, 9 & 10 secs are
>>mostly not present, just 1 or 2 sporadic ones). 
>>
>>i ran this on a 4 CPU itanium linux server. 
>>
>>this problem doesn't appear when i ran a port
>>of this app on a windows hyperthreading machine. 
> 
> 
> An exotic guess: maybe 4 units is just under the size of a cache line?
> Then thread 1's 4 units might be sharing a cache line with thread 2's 4
> units.  Memory accesses will cause the cache line to bounce from cpu 1
> to cpu 2 and back.

Actually I would go with this as my guess too. Having worked with some 
serious game developers (just for fun), we've gone through some serious 
optimizing. The most extraordinary losses occur when transfering data 
that is not the cache line size, and can be demonstrated with ease in a 
simple C for loop. What you do is loop through a data copy 1 byte at a 
time at each iteration of the loop. Then you try the alternative, say 4 
bytes at a time if that matches your cache line size (and you use 
modulus and on the last loop if it isn't exactly the right size of the 
line then you go for the non-cache-line-size), which results in a 
dramatic improvement. FYI if you are using enough data copy then MMX 
copy will again be an enormous boost...but I'm not sure how MMX could be 
used or what the gotchas are using MMX to copy between threads. On a 
32-bit system (what I was using) try 4 byte copies; on a 64-bit system, 
try 8 byte copies. Consider that if you have data which is less than 8 
bytes, padding this with NULL bytes so you can copy line sizes (double 
check on cache line sizes, it's been a while and my numbers might be off).

D. Stimits, stimits AT comcast DOT net