[lug] q: multithreading on smp weirdness

Sun Dec 5 10:02:29 MST 2004

At 2004-12-02 01:25 AM, you wrote:
>karheng at softhome.net wrote:
>>that's an interesting guess! i've gotta watch out for
>>that.. i don't think it's the reason though, at least
>>not for the test app.
>>for the test app, i've replaced the chunk fetching
>>function with a routine that simply does some computation
>>over a small buffer. that buffer doesn't grow or
>>shrink, & all i do is toggle the number of iterations
>>over that buffer. so it should exclude issues caused by
>>the cpu cache line...
>>actually, the test app i have is part of a bigger app.
>>thread 1 of the bigger app fetches data & thread 2
>>processes them.
>>i've extracted & simplified that code out and made both
>>threads only do computation on a small buffer to
>>reproduce the problem from the bigger app.
>>anyway, i've also tested the threads in the bigger app,
>>& i've found that data returned from thread 1 is
>>consistently sufficient to keep thread 2 busy.
>>(thread1:thread2 elapse time ratio is approx 6:4, in my
>>test app, it's made to be 1:1).
>>in theory, i should be getting about 40% or 30%
>>performance improvement in my bigger app, & always 40%
>>to 50% performance improvement in my test case.
>>i've attached source code from my test app, for better
>>clarity. it can't be compiled because i've excluded some
>>custom classes. i can produce a completely isolated
>>version of the code if required.
>>i've been trying to look up multithreading topics/faqs
>>all over the net... so far still no clue.
>>will also try it on another dual CPU itanium box soon
>>after it's set up.
>
>I suspect the threading wrappers themselves hold a good chunk of the clues. You could possibly compile with profiling on (see man gprof, and search for profile under man g++) which would give you a lot of information.

you might be right. :)
considering i wrote the wrappers... lol.
the wrappers r actually quite simple.

{ //opcode...
  pthread_mutex_lock();
  while(waiteventcount>0)
    pthread_condwait();
  pthread_mutex_unlock();
}

... with other complimentary parts simlarly simple.
... something to that effect.

but here's something new...

i managed to get it tested on a new dual itanium box.
it gave me the expected 50% performance improvement
no matter what number of thread syncs i put in.
(thread sync overheads were visible & only added
few milliseconds progressively.)

i'm gonna try it with 2 other dual itanium boxes from
a nearby friendly dept soon to verify this.

i'll take your advice & gprof it soon.
i'm curious though what effect it has with multithreaded
apps.

>Based on the code below though, it sounds like hard drive access is also going on. If this is so, does /proc/interrupts show all cpu's handling interrupts? If not, the SMP APIC is not enabled (and it may or may not be safe to do this depending on the chipset). On SMP machines only CPU0 handles hardware IRQ's, *unless* the APIC is enabled. Disk access and most hardware require hardware IRQ's, and thus with the SMP APIC enabled, any CPU can handle hardware, but without it, only 1 CPU can do this, and that 1 CPU can easily get IRQ starved under heavy hardware IRQ activity (e.g., a highly active ethernet card plus disk activity). Many of the newer Intel chipsets do not by default get the APIC activated, simply because Intel doesn't provide enough public chipset information. Profiling would probably tell you if disk I/O is consuming too much time as well.

hmmm... will have to check this...
was away for friday & didn't manage to try this.

from what you've mentioned... does it mean
that if the APIC is not enabled, even non-multithreaded
apps get affected?
(eg: slow down visible when a person is compiling all
of gnome on one end & another doing something else)

>Are these disk read/write that follow? If so, then you can be almost certain that this is your bottleneck, especially since no matter how many threads you have the disk is serialized.
>...
>>/*
>>  Wrapper for OS file map functions.
>>  Provides void * map(filename,mapmode,size) & unmap(void *).
>>  mapmode is 'i' for create+read+write, 'w' for read+write,
>>  & 'r' for read only.
>>*/
>>#include        <Filemapper.h>
>...
>>  buf1=(int *)map("buf1",'i',sizeof(int)*bufcnt);
>>  buf2=(int *)map("buf2",'i',sizeof(int)*bufcnt);
>...
>>  unmap(buf1);
>>  unmap(buf2);
>...
>
>D. Stimits, stimits AT comcast DOT net

in my test app, i actually started out without the mmap
function calls.
i read somewhere that only 1 thread can be in the kernel
at 1 time, & was wondering whether these mmap calls
would cause these situations.

anyway, i tried having them & removing them to the same
results.

disk IO didn't seem that high though.
(i assume mostly because the entire file was cached/paged
into memory & was left there since there was no reason
to page it out until the mmap was to be closed).

for the duration the test app was running, the disk was only
accessed in one go during app start up & another go during
app termination.

thanks.. :)

rgds,

kh