[lug] Interface CRC error on USB connected SATA drive

Davide Del Vento davide.del.vento at gmail.com
Fri Sep 9 20:40:58 MDT 2016


Try something like
tar --mtime='2010-01-01' cf - | md5sum
from both locations to rule out file corruption (remove mtime if you
want to include mtime int the checksum). Try first on a subdir to
check if this is correct as is (I haven't tested it).
You might want to do this in smaller chunks.

My best practices:
- always use tmux, and (in reality "or" suffice) set the scroll back
options in your shell to infinite
- never put long-running jobs in background, always use a new tmux tab
(or terminal tab) for doing other stuff
- if I anticipate there might be problem, redirect stderr to stdout
and pipe into a file with tee (easier to grep and parse than tmux or
terminal scrollback) -- if it weren't so complicated I would also keep
a copy of the stdout and stderr separately, besides the merged one

On Fri, Sep 9, 2016 at 8:22 PM, Jed S. Baer <blug at jbaer.cotse.net> wrote:
> Hi Folks.
>
> I'm in the middle of some disk migration, owing to upgrading external
> storage from 1TB to 2TB. Configuration is a new Toshiba 2TB SATA drive in
> an external enclosure, connect via USB-2.
>
> I copied a large number of files, appx 167GB, using my favorite method:
>   cd /path/source
>   tar cf - . | (cd /path/dest; tar xf -)
>
> I'm trying to be on the lookout for any problems with the new drive,
> before I fully commit to the rest of the process, so I periodically fire
> up gsmartcontrol and see if anything's amiss. Now I have two instances of
> a "interface CRC error, command aborted", with further logging indicating
> this was during a DMA WRITE.
>
> I see nothing informative in /var/log/syslog.
>
> (I do see a gripe from smartd about /usr/bin/mail not being there, but
> that's a seperate irritation. Possibly it would've mailed something
> useful.)
>
> The other fun part of this is that while tar was running in the
> background, I fired up bluefish to tinker with some HTML. I launch most
> of the things I use from the command line. bluefish has a nasty habit, I
> discovered, of generating huge amounts of mindless bitspew to stdout (or
> stderr) while it's running, thus, when I checked to see if the tar was
> finished, any error messages it might have given were no longer available
> in the terminal scrollback.
>
> File count, and size (according to du -cs) are correct.
>
> A web search indicates this is bad communication between the drive and
> the controller - unsurprising, given the USB2 in the middle.
>
> Finally, here's what I'm wondering: which of the following is more likely?
> 1) Down in the kernel, the ATA driver noticed the error, retried, and
> succeeded
> 2) I have corruption in a file or files
>
> Here's one of the 2 instances of this error, from SMART
>
> SMART Error Log Version: 1
> ATA Error Count: 2
>     CR = Command Register [HEX]
>     FR = Features Register [HEX]
>     SC = Sector Count Register [HEX]
>     SN = Sector Number Register [HEX]
>     CL = Cylinder Low Register [HEX]
>     CH = Cylinder High Register [HEX]
>     DH = Device/Head Register [HEX]
>     DC = Device Command Register [HEX]
>     ER = Error register [HEX]
>     ST = Status register [HEX]
> Powered_Up_Time is measured from power on, and printed as
> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
> SS=sec, and sss=millisec. It "wraps" after 49.710 days.
>
> Error 2 occurred at disk power-on lifetime: 70 hours (2 days + 22 hours)
>   When the command that caused the error occurred, the device was active
> or idle.
>
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 40 a0 58 44 0d  Error: ICRC, ABRT 64 sectors at LBA = 0x0d4458a0
> = 222582944
>
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   35 d5 f0 f0 57 44 e0 00   2d+19:22:18.897  WRITE DMA EXT
>   35 d5 f0 00 57 44 e0 00   2d+19:22:18.893  WRITE DMA EXT
>   35 d5 f0 10 56 44 e0 00   2d+19:22:18.889  WRITE DMA EXT
>   35 d5 f0 20 55 44 e0 00   2d+19:22:18.885  WRITE DMA EXT
>   35 d5 f0 30 54 44 e0 00   2d+19:22:18.881  WRITE DMA EXT
>
> Error 1 occurred at disk power-on lifetime: 70 hours (2 days + 22 hours)
>   When the command that caused the error occurred, the device was active
> or idle.
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety


More information about the LUG mailing list