[lug] debugging workstation issue

Maxwell Spangler lists at maxwellspangler.com
Fri Apr 10 13:55:44 MDT 2020


At this point, if I were in your circumstances and I were me, I would:
* Setup a serial port on this system and try to capture some console
output during the crash via that.
* If it has a serial port, use it. If it doesn't then add a USB to
serial port adapter.
* Configure the kernel to recognize the Linux serial port as a console
so all critical console messages go to the serial port.
* Setup a second laptop to capture the serial output.
* Use the desktop as you normally would and when there is a crash, it
should output something to the serial port even if the storage system
is read-only or broken, or if it otherwise won't log it and the system
won't let you SSH in because its more than just a graphical fault but a
system fault.
If this was a server with an out of bands management system like an
iLO, iDrac or iLOM, I would SSH in from another system, connect to the
virtual serial console and do the same capture process.
>> Anyway, the problem is that sometimes at boot and sometimes after
the screensaver is engaged the machine goes into a "dead" status. I
suspect it can be an hybernation mode or something like that, which
does not awake when I hit the keyboard, mouse or power button. A long
press on the power button does trigger a reboot, with all the usual
consequences of such a thing (possible files not closed, filesystem
checks, loss of not-saved data, etc). When it is in this status, trying
to ssh into it from another machine hangs with no response.
You might also try swapping out components if you have spare parts
available.  Bad memory might cause the system to lock up hard like
this.  PCIe bus issues with video and other cards could do the same as
well.  I'd replace memory with some spares and try to use the
system.  If this is a desktop, you're not likely to have the greatest
tolerance for memory errors and even on servers I've seen bad ECC
memory cause a server to just hang.  In those cases, on the server,
upon reboot of the hung server, the BIOS often recognizes the memory as
bad and the mystery as solved.  But on a non-ECC PC with few firmware
features to support this kind of debugging, you're forced to do much of
it yourself.
Also, I wonder if you could plug in a USB standard (non-Apple) keyboard
as a second keyboard and attempt to ALT-F2 to get to a console on
that?  If it won't because your primary keyboard is apple, perhaps a
second simultaneous keyboard would allow this.
I hope this helps.. Keep us posted if you try more things?

On Fri, 2020-04-10 at 07:46 -0600, Davide Del Vento wrote:
> Thanks Maxwell and Michael,
> 
> The video card is of this machine is older than the rest of the box.
> It's an ATI dual-monitor DVI (details below). There is no 3D enabled.
> One of the first things I tried when this was happening was indeed
> trying CONTROL-ALT-Fn and that did not work. I say "did" because I
> recently changed keyboard with a more ergonomic Apple one (or I had
> my hands chopped off with the inordinate amount of time I now have to
> spend there....) and on such keyboard the Fn keys are not recognized,
> so this test is a moot point. 
> Yet, I think the key point here is the fact that the box does not
> respond to ssh attempts, so it must go in some weird status, not just
> a "broken display mode" one.
> 
> Maxwell, I am sorry to hear that even you have to suffer these
> things. Sigh.
> 
> Thanks,
> Davide
> 
> 
> VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
> RV370 GL [FireMV 2200] (rev 80)
> Display controller: Advanced Micro Devices, Inc. [AMD/ATI] RV370 GL
> [FireMV 2200] (Secondary) (rev 80)
> 
> 
> On Thu, Apr 9, 2020 at 9:17 PM Maxwell Spangler <
> lists at maxwellspangler.com> wrote:
> > Sometimes on my intel based laptop with integrated Intel GMA
> > drivers, the GUI display will lock up.
> > 
> > Can you CONTROL-ALT-F2 (or f3, f4, etc) to get to an alternate
> > console?
> > 
> > If so, you can then login on a text console and use 'dmesg' and
> > 'journalctl -f' to see what's going on.
> > 
> > Due to bugs in X-windows/Gnome/intel drivers/whatever this happens
> > to me a lot and for all my love for Linux I have to reboot it a lot
> > to get back to a normal GUI display.
> > 
> > M
> > 
> > On Thu, 2020-04-09 at 16:45 -0600, Davide Del Vento wrote:
> > > Thanks to both of you.
> > > 
> > > Zan: it is indeed a systemd system and "journalctl -b -1"
> > > provides what I was looking for. I don't see anything suspicious
> > > other than perhaps
> > > 
> > > Apr 09 13:23:43 buzzicone systemd[1]: Starting Message of the
> > > Day...
> > > Apr 09 13:23:43 buzzicone systemd[1]: Started Message of the Day.
> > > 
> > > (and then the log ends). The time is around when the problem
> > > occurred and I tried to trigger that live by e.g. opening a new
> > > shell, starting bash with -l option, ssh'ing to localhost,
> > > triggering the screensaver. Nothing cause it to happen.... so
> > > that is weird.
> > > 
> > > D Stimits: ssh'ing prior to the failure assumes that I have a
> > > spare system to do that, which unfortunately at the moment I
> > > haven't (plus as I said it sometimes happens at boot before I get
> > > a chance to ssh into it). Thanks for the tip about unplugging and
> > > replugging USB devices, I will try that next time!
> > > 
> > > Cheers,
> > > Davide
> > > 
> > > 
> > > On Thu, Apr 9, 2020 at 3:10 PM D. Stimits <stimits at comcast.net>
> > > wrote:
> > > >   
> > > >    
> > > >  
> > > >  
> > > >   
> > > >    
> > > > 
> > > >   
> > > >   
> > > >    
> > > > 
> > > >   
> > > >   
> > > > >    On April 9, 2020 at 2:14 PM Davide Del Vento <
> > > > > davide.del.vento at gmail.com> wrote: 
> > > > >    
> > > > > 
> > > > >    
> > > > >     
> > > > >      Folks,
> > > > >     
> > > > >     
> > > > >      
> > > > > 
> > > > >     
> > > > >     
> > > > >      My workstation, a desktop-sized server computer used as
> > > > > a desktop, is having a very annoying problem, which is even
> > > > > more severe these days that I have to rely to it for
> > > > > basically everything (so far the only thing I don't use it
> > > > > for is for when I use the restroom, but that may change
> > > > > soon)...
> > > > >     
> > > > >     
> > > > >      
> > > > > 
> > > > >     
> > > > >     
> > > > >      Anyway, the problem is that sometimes at boot and
> > > > > sometimes after the screensaver is engaged the machine goes
> > > > > into a "dead" status. I suspect it can be an hybernation mode
> > > > > or something like that, which does not awake when I hit the
> > > > > keyboard, mouse or power button. A long press on the power
> > > > > button does trigger a reboot, with all the usual consequences
> > > > > of such a thing (possible files not closed, filesystem
> > > > > checks, loss of not-saved data, etc). When it is in this
> > > > > status, trying to ssh into it from another machine hangs with
> > > > > no response. 
> > > > >      
> > > > > 
> > > > >     
> > > > >     
> > > > >      
> > > > > 
> > > > >     
> > > > >     
> > > > >      It's been a long time since I debugged something like
> > > > > this and dmesg shows only messages since last reboot, which
> > > > > are clearly useless. Any clues on how to look at the logs
> > > > > immediately BEFORE that? Bonus points if you have any ideas
> > > > > on what might be going on or specifically what to look for.
> > > > >     
> > > > >    
> > > > >   
> > > > 
> > > >   
> > > >    
> > > >  Ssh in and run "dmesg --follow" prior to the failure. The
> > > > computer which is displaying this will still be running.
> > > >    
> > > > 
> > > >   
> > > >   
> > > >    
> > > > 
> > > >   
> > > >   
> > > >    Tip: Often USB devices do not correctly handle or respond to
> > > > low power mode events. If this is the case, then when
> > > > mouse/keyboard fails to wake up the system, you might try to
> > > > unplug and replug the mouse/keyboard and test again if you can
> > > > now resume.
> > > >    
> > > > 
> > > >   
> > > >   
> > > >    
> > > > 
> > > >   
> > > >   
> > > >    PS: If I were more motivated I'd edit my window manager and
> > > > display manager code and remove the option to manually
> > > > sleep/suspend. I hate accidentally clicking that instead of log
> > > > out or shut down or reboot.
> > > >    
> > > > 
> > > >    
> > > >  
> > > > _______________________________________________
> > > > 
> > > > Web Page:  http://lug.boulder.co.us
> > > > 
> > > > Mailing List: 
> > > > http://lists.lug.boulder.co.us/mailman/listinfo/lug
> > > > 
> > > > Join us on IRC: irc.hackingsociety.org port=6667
> > > > channel=#hackingsociety
> > > 
> > > _______________________________________________Web Page:  
> > > http://lug.boulder.co.us
> > > Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> > > Join us on IRC: irc.hackingsociety.org port=6667
> > > channel=#hackingsociety
> > -- 
> > Maxwell Spangler
> > 
> > ===================================================================
> > Denver, Colorado, USA
> > 
> > maxwellspangler.com
> > _______________________________________________
> > 
> > Web Page:  http://lug.boulder.co.us
> > 
> > Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> > 
> > Join us on IRC: irc.hackingsociety.org port=6667
> > channel=#hackingsociety
> 
> _______________________________________________Web Page:  
> http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: irc.hackingsociety.org port=6667
> channel=#hackingsociety
-- 
Maxwell Spangler

===================================================================
Denver, Colorado, USA

maxwellspangler.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lug.boulder.co.us/pipermail/lug/attachments/20200410/8f5a69ee/attachment-0001.html>


More information about the LUG mailing list