[lug] Interface CRC error on USB connected SATA drive
Lee Woodworth
blug-mail at duboulder.com
Fri Sep 9 22:27:40 MDT 2016
I would vote for a driver retry of a command that didn't
complete within the driver's deadline. That said, I would
still want to know why the command failed or was slow. Dmesg
is where I would start looking for driver error messages.
A full diff after a complete cache flush would test for corruption
and somewhat exercise the drives. Completley powering off the external
drive would be a start for the cache flush. Then a smart short
test might show something abnormal.
Possible things going on with the drive/driver:
1) Not sure that I would necessarily blame the USB controller or
protocol**.
If your disk enclosure is USB port powered then it might be the port
isn't providing enough/stable power. RPI users have reported seeing
intermittent disk errors with port-powered drives go away when they
switched to a beefier power brick for the RPI.
2) If you think multiple I/O streams are an issue, upgrading/changing
drivers might help. We have a thermaltake external USB3-SATA dock that
intermittently times out using the uas driver (USB attached SCSI) but
works fine using just usb-storage. The uas driver has issues with
multiple I/O streams for this dock.
3) I would look at the drive temp from smart. For reference one of
our external USB backup drives has: recorded Min/Max 19/40 (C).
Its possible lots of seeks could increase the temp, but unless the
drive temp is near the max allowed its not obvious that would produce
your errors. I have seen high temps cause retries, but they were not
intermittent.
4) If you keep getting errors for the same LBA range, it would suggest
there may be a bad-block issue. You would probably also see other
smart errors in that case.
------------------
** An external USB 2 backup drive with 42,042 power on hours (1751+ days) with
3,795,286,730 LBA writes and 2,281,970,819 LBA reads reports no smart errors
at all. That's about 3.47 million LBA reads+writes for each daily backup. This
drive has its own power brick.
On 09/09/2016 08:22 PM, Jed S. Baer wrote:
> Hi Folks.
>
> I'm in the middle of some disk migration, owing to upgrading external
> storage from 1TB to 2TB. Configuration is a new Toshiba 2TB SATA drive in
> an external enclosure, connect via USB-2.
>
> I copied a large number of files, appx 167GB, using my favorite method:
> cd /path/source
> tar cf - . | (cd /path/dest; tar xf -)
>
> I'm trying to be on the lookout for any problems with the new drive,
> before I fully commit to the rest of the process, so I periodically fire
> up gsmartcontrol and see if anything's amiss. Now I have two instances of
> a "interface CRC error, command aborted", with further logging indicating
> this was during a DMA WRITE.
>
> I see nothing informative in /var/log/syslog.
>
> (I do see a gripe from smartd about /usr/bin/mail not being there, but
> that's a seperate irritation. Possibly it would've mailed something
> useful.)
>
> The other fun part of this is that while tar was running in the
> background, I fired up bluefish to tinker with some HTML. I launch most
> of the things I use from the command line. bluefish has a nasty habit, I
> discovered, of generating huge amounts of mindless bitspew to stdout (or
> stderr) while it's running, thus, when I checked to see if the tar was
> finished, any error messages it might have given were no longer available
> in the terminal scrollback.
>
> File count, and size (according to du -cs) are correct.
>
> A web search indicates this is bad communication between the drive and
> the controller - unsurprising, given the USB2 in the middle.
>
> Finally, here's what I'm wondering: which of the following is more likely?
> 1) Down in the kernel, the ATA driver noticed the error, retried, and
> succeeded
> 2) I have corruption in a file or files
>
> Here's one of the 2 instances of this error, from SMART
>
> SMART Error Log Version: 1
> ATA Error Count: 2
> CR = Command Register [HEX]
> FR = Features Register [HEX]
> SC = Sector Count Register [HEX]
> SN = Sector Number Register [HEX]
> CL = Cylinder Low Register [HEX]
> CH = Cylinder High Register [HEX]
> DH = Device/Head Register [HEX]
> DC = Device Command Register [HEX]
> ER = Error register [HEX]
> ST = Status register [HEX]
> Powered_Up_Time is measured from power on, and printed as
> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
> SS=sec, and sss=millisec. It "wraps" after 49.710 days.
>
> Error 2 occurred at disk power-on lifetime: 70 hours (2 days + 22 hours)
> When the command that caused the error occurred, the device was active
> or idle.
>
> After command completion occurred, registers were:
> ER ST SC SN CL CH DH
> -- -- -- -- -- -- --
> 84 51 40 a0 58 44 0d Error: ICRC, ABRT 64 sectors at LBA = 0x0d4458a0
> = 222582944
>
> Commands leading to the command that caused the error were:
> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
> -- -- -- -- -- -- -- -- ---------------- --------------------
> 35 d5 f0 f0 57 44 e0 00 2d+19:22:18.897 WRITE DMA EXT
> 35 d5 f0 00 57 44 e0 00 2d+19:22:18.893 WRITE DMA EXT
> 35 d5 f0 10 56 44 e0 00 2d+19:22:18.889 WRITE DMA EXT
> 35 d5 f0 20 55 44 e0 00 2d+19:22:18.885 WRITE DMA EXT
> 35 d5 f0 30 54 44 e0 00 2d+19:22:18.881 WRITE DMA EXT
>
> Error 1 occurred at disk power-on lifetime: 70 hours (2 days + 22 hours)
> When the command that caused the error occurred, the device was active
> or idle.
> _______________________________________________
> Web Page: http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
>
More information about the LUG
mailing list