[lug] Interface CRC error on USB connected SATA drive
Davide Del Vento
davide.del.vento at gmail.com
Fri Sep 9 20:40:58 MDT 2016
Try something like
tar --mtime='2010-01-01' cf - | md5sum
from both locations to rule out file corruption (remove mtime if you
want to include mtime int the checksum). Try first on a subdir to
check if this is correct as is (I haven't tested it).
You might want to do this in smaller chunks.
My best practices:
- always use tmux, and (in reality "or" suffice) set the scroll back
options in your shell to infinite
- never put long-running jobs in background, always use a new tmux tab
(or terminal tab) for doing other stuff
- if I anticipate there might be problem, redirect stderr to stdout
and pipe into a file with tee (easier to grep and parse than tmux or
terminal scrollback) -- if it weren't so complicated I would also keep
a copy of the stdout and stderr separately, besides the merged one
On Fri, Sep 9, 2016 at 8:22 PM, Jed S. Baer <blug at jbaer.cotse.net> wrote:
> Hi Folks.
>
> I'm in the middle of some disk migration, owing to upgrading external
> storage from 1TB to 2TB. Configuration is a new Toshiba 2TB SATA drive in
> an external enclosure, connect via USB-2.
>
> I copied a large number of files, appx 167GB, using my favorite method:
> cd /path/source
> tar cf - . | (cd /path/dest; tar xf -)
>
> I'm trying to be on the lookout for any problems with the new drive,
> before I fully commit to the rest of the process, so I periodically fire
> up gsmartcontrol and see if anything's amiss. Now I have two instances of
> a "interface CRC error, command aborted", with further logging indicating
> this was during a DMA WRITE.
>
> I see nothing informative in /var/log/syslog.
>
> (I do see a gripe from smartd about /usr/bin/mail not being there, but
> that's a seperate irritation. Possibly it would've mailed something
> useful.)
>
> The other fun part of this is that while tar was running in the
> background, I fired up bluefish to tinker with some HTML. I launch most
> of the things I use from the command line. bluefish has a nasty habit, I
> discovered, of generating huge amounts of mindless bitspew to stdout (or
> stderr) while it's running, thus, when I checked to see if the tar was
> finished, any error messages it might have given were no longer available
> in the terminal scrollback.
>
> File count, and size (according to du -cs) are correct.
>
> A web search indicates this is bad communication between the drive and
> the controller - unsurprising, given the USB2 in the middle.
>
> Finally, here's what I'm wondering: which of the following is more likely?
> 1) Down in the kernel, the ATA driver noticed the error, retried, and
> succeeded
> 2) I have corruption in a file or files
>
> Here's one of the 2 instances of this error, from SMART
>
> SMART Error Log Version: 1
> ATA Error Count: 2
> CR = Command Register [HEX]
> FR = Features Register [HEX]
> SC = Sector Count Register [HEX]
> SN = Sector Number Register [HEX]
> CL = Cylinder Low Register [HEX]
> CH = Cylinder High Register [HEX]
> DH = Device/Head Register [HEX]
> DC = Device Command Register [HEX]
> ER = Error register [HEX]
> ST = Status register [HEX]
> Powered_Up_Time is measured from power on, and printed as
> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
> SS=sec, and sss=millisec. It "wraps" after 49.710 days.
>
> Error 2 occurred at disk power-on lifetime: 70 hours (2 days + 22 hours)
> When the command that caused the error occurred, the device was active
> or idle.
>
> After command completion occurred, registers were:
> ER ST SC SN CL CH DH
> -- -- -- -- -- -- --
> 84 51 40 a0 58 44 0d Error: ICRC, ABRT 64 sectors at LBA = 0x0d4458a0
> = 222582944
>
> Commands leading to the command that caused the error were:
> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
> -- -- -- -- -- -- -- -- ---------------- --------------------
> 35 d5 f0 f0 57 44 e0 00 2d+19:22:18.897 WRITE DMA EXT
> 35 d5 f0 00 57 44 e0 00 2d+19:22:18.893 WRITE DMA EXT
> 35 d5 f0 10 56 44 e0 00 2d+19:22:18.889 WRITE DMA EXT
> 35 d5 f0 20 55 44 e0 00 2d+19:22:18.885 WRITE DMA EXT
> 35 d5 f0 30 54 44 e0 00 2d+19:22:18.881 WRITE DMA EXT
>
> Error 1 occurred at disk power-on lifetime: 70 hours (2 days + 22 hours)
> When the command that caused the error occurred, the device was active
> or idle.
> _______________________________________________
> Web Page: http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
More information about the LUG
mailing list