[lug] HW Puzzle

Lee Woodworth blug-mail at duboulder.com
Fri Oct 19 15:17:36 MDT 2018


Hi All,

Here is a puzzle to ponder:

An old workstation is randomly hard locking when idle.
   The PS and Case fans are running, no ping responses, SYSREQ keys are ignored
   and the video output is black.

   A remote monitoring loop dumping /sys/class/hwmon/hwmon0/device/temp*_input
   shows CPU/case temps of around 40C about 120s or less before the lockup. A
   similar dump for smartctl temps on the ssd shows 31C.

   This happens for multiple versions of the 4.18 kernel, up to 4.18.13.
   No lockups when the system is under a continuous heavy load (e.g. building a
   tool chain). No hibernate/suspend/frequency-adjust configured for the kernel
   and no userland agents either.

An oddity is that the kernel and gcc report differences as to whether the cpu
(athlon64, k8 arch) supports sse3. No sse3 flag in /proc/cpuinfo, but the output of
    gcc -E -v -march=native -</dev/null 2>&1 | grep cc1
includes -msse3. The cpuid2cpuflags command also includes sse3 as a cpu feature.

-- ps issues would be the first guess, but the low temps and no crashes under
    load don't suggest it; small form-factor case with non-standard PS so swapping
    the ps is a pain; past web reports for similar issues have been about cpu
    power states or watch dog timers
-- a failing component could be the cause; but how heavy loads could fail to
    to not stress a failing component isn't obvious to me
-- cracked mb trace/solder joint is a possibility since failures happen when idle;
    don't have an easy way to check that
-- WAG: maybe there is processor state related to sse3 the kernel isn't saving
    because it doesn't think it needs to, but one of the screen savers use sse3;
    will be rebuilding the tool chains and software with sse3 disabled and shutoff
    the screen saver to see what happens.
-- gpu issue -- web searches find reports attributing lockups to gpu overload;
    desktop is xfce4, we'll see if turning off the screen saver makes a difference;
    no crashes when multiple instances of firefox are actively in use w/ hw
    acceleration though


More information about the LUG mailing list