[lug] HW Puzzle

mad.scientist.at.large at tutanota.com mad.scientist.at.large at tutanota.com
Fri Oct 19 17:00:46 MDT 2018


I'd definitely look at the graphics card and reseat it, if the graphics card hangs for any reason it can bring down the rest of the machine.  Probably worth trying another graphics card for testing.

Intermittent problems can be very deceptive in terms of what it looks like is failing .  Cold solder joints, cracked board traces, and failing bond wires can produce all kinds of false correlations.  Any of these things can get better or worse when there's load.  These are the things that drive techs nuts, chasing down one false lead after another.  I.E. just touching the board or replacing something else can affect the connections elsewhere.  Also note that current flow can keep failed conductors touching due to the magnetic field produced by the current.  Especially in the power supply and it's wiring and the power traces on a board and bond wires in chips that draw significant power.
Sounds like possible overheating despite what the cpu reports temperature wise, could also be a failing cpu given the odd reports, i.e. a bond wire on the cpu or a trace/solder joint on the board may be failing.  This can go away (or seem to) once it warms up.  Could also be the processor needs to be removed and reseated although ZIF sockets don't do much wiping of contacts but there is some.  

I'd also monitor the power supply rails and look at the smart data for the hard drives. Maybe one of them is taking a lot of tries to spinup at startup or is drawing excessive current to do it sometimes.  When power supplies get old the caps age and lose some of their' capacitance which makes it hard on the rest of the supply.  You said it was old, if you have a known good power supply with enough capacity it would be worth swapping for testing. I've seen power supplies going marginal produce a wide range of symptoms that look like something else because a little power supply problem can cause a wide range of malfunctions in the rest of the system.

Also consider cleaning.  As the temperature sensors on the cpu don't detect hot spots elsewhere in the cpu.  

Also possible the cpu fan is dirty and spinning up slowly so that the boot load may overheat it but such that it usually starts spinning fast enough once it gets going.  I've seen old fan do that from dirt, once they get going the gunk heats up and becomes less of a problem though still making the fan run a bit slow.
In any case a visual inspection of the inside and plug/unplug of the power cables and other cables may solve the problem.  This will be hard to figure out barring good luck.



Democracy now!


19. Oct 2018 15:17 by blug-mail at duboulder.com <mailto:blug-mail at duboulder.com>:


> Hi All,
>
> Here is a puzzle to ponder:
>
> An old workstation is randomly hard locking when idle.
>   The PS and Case fans are running, no ping responses, SYSREQ keys are ignored
>   and the video output is black.
>
>   A remote monitoring loop dumping /sys/class/hwmon/hwmon0/device/temp*_input
>   shows CPU/case temps of around 40C about 120s or less before the lockup. A
>   similar dump for smartctl temps on the ssd shows 31C.
>
>   This happens for multiple versions of the 4.18 kernel, up to 4.18.13.
>   No lockups when the system is under a continuous heavy load (e.g. building a
>   tool chain). No hibernate/suspend/frequency-adjust configured for the kernel
>   and no userland agents either.
>
> An oddity is that the kernel and gcc report differences as to whether the cpu
> (athlon64, k8 arch) supports sse3. No sse3 flag in /proc/cpuinfo, but the output of
>    gcc -E -v -march=native -</dev/null 2>&1 | grep cc1
> includes -msse3. The cpuid2cpuflags command also includes sse3 as a cpu feature.
>
> -- ps issues would be the first guess, but the low temps and no crashes under
>    load don't suggest it; small form-factor case with non-standard PS so swapping
>    the ps is a pain; past web reports for similar issues have been about cpu
>    power states or watch dog timers
> -- a failing component could be the cause; but how heavy loads could fail to
>    to not stress a failing component isn't obvious to me
> -- cracked mb trace/solder joint is a possibility since failures happen when idle;
>    don't have an easy way to check that
> -- WAG: maybe there is processor state related to sse3 the kernel isn't saving
>    because it doesn't think it needs to, but one of the screen savers use sse3;
>    will be rebuilding the tool chains and software with sse3 disabled and shutoff
>    the screen saver to see what happens.
> -- gpu issue -- web searches find reports attributing lockups to gpu overload;
>    desktop is xfce4, we'll see if turning off the screen saver makes a difference;
>    no crashes when multiple instances of firefox are actively in use w/ hw
>    acceleration though
> _______________________________________________
> Web Page:  > http://lug.boulder.co.us <http://lug.boulder.co.us>
> Mailing List: > http://lists.lug.boulder.co.us/mailman/listinfo/lug <http://lists.lug.boulder.co.us/mailman/listinfo/lug>
> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lug.boulder.co.us/pipermail/lug/attachments/20181020/51fdd3ac/attachment.html>


More information about the LUG mailing list