[lug] Interface CRC error on USB connected SATA drive

Fri Sep 9 22:27:40 MDT 2016

I would vote for a driver retry of a command that didn't
complete within the driver's deadline. That said, I would
still want to know why the command failed or was slow. Dmesg
is where I would start looking for driver error messages.

A full diff after a complete cache flush would test for corruption
and somewhat exercise the drives. Completley powering off the external
drive would be a start for the cache flush. Then a smart short
test might show something abnormal.

Possible things going on with the drive/driver:

1) Not sure that I would necessarily blame the USB controller or
   protocol**.

   If your disk enclosure is USB port powered then it might be the port
   isn't providing enough/stable power. RPI users have reported seeing
   intermittent disk errors with port-powered drives go away when they
   switched to a beefier power brick for the RPI.

2) If you think multiple I/O streams are an issue, upgrading/changing
   drivers might help. We have a thermaltake external USB3-SATA dock that
   intermittently times out using the uas driver (USB attached SCSI) but
   works fine using just usb-storage. The uas driver has issues with
   multiple I/O streams for this dock.

3) I would look at the drive temp from smart. For reference one of
   our external USB backup drives has: recorded Min/Max 19/40 (C).
   Its possible lots of seeks could increase the temp, but unless the
   drive temp is near the max allowed its not obvious that would produce
   your errors. I have seen high temps cause retries, but they were not
   intermittent.

4) If you keep getting errors for the same LBA range, it would suggest
   there may be a bad-block issue. You would probably also see other
   smart errors in that case.

------------------

** An external USB 2 backup drive with 42,042 power on hours (1751+ days) with
   3,795,286,730 LBA writes and 2,281,970,819 LBA reads reports no smart errors
   at all. That's about 3.47 million LBA reads+writes for each daily backup. This
   drive has its own power brick.

On 09/09/2016 08:22 PM, Jed S. Baer wrote:
> Hi Folks.
> 
> I'm in the middle of some disk migration, owing to upgrading external
> storage from 1TB to 2TB. Configuration is a new Toshiba 2TB SATA drive in
> an external enclosure, connect via USB-2.
> 
> I copied a large number of files, appx 167GB, using my favorite method:
>   cd /path/source
>   tar cf - . | (cd /path/dest; tar xf -)
> 
> I'm trying to be on the lookout for any problems with the new drive,
> before I fully commit to the rest of the process, so I periodically fire
> up gsmartcontrol and see if anything's amiss. Now I have two instances of
> a "interface CRC error, command aborted", with further logging indicating
> this was during a DMA WRITE.
> 
> I see nothing informative in /var/log/syslog.
> 
> (I do see a gripe from smartd about /usr/bin/mail not being there, but
> that's a seperate irritation. Possibly it would've mailed something
> useful.)
> 
> The other fun part of this is that while tar was running in the
> background, I fired up bluefish to tinker with some HTML. I launch most
> of the things I use from the command line. bluefish has a nasty habit, I
> discovered, of generating huge amounts of mindless bitspew to stdout (or
> stderr) while it's running, thus, when I checked to see if the tar was
> finished, any error messages it might have given were no longer available
> in the terminal scrollback.
> 
> File count, and size (according to du -cs) are correct.
> 
> A web search indicates this is bad communication between the drive and
> the controller - unsurprising, given the USB2 in the middle.
> 
> Finally, here's what I'm wondering: which of the following is more likely?
> 1) Down in the kernel, the ATA driver noticed the error, retried, and
> succeeded
> 2) I have corruption in a file or files
> 
> Here's one of the 2 instances of this error, from SMART
> 
> SMART Error Log Version: 1
> ATA Error Count: 2
>     CR = Command Register [HEX]
>     FR = Features Register [HEX]
>     SC = Sector Count Register [HEX]
>     SN = Sector Number Register [HEX]
>     CL = Cylinder Low Register [HEX]
>     CH = Cylinder High Register [HEX]
>     DH = Device/Head Register [HEX]
>     DC = Device Command Register [HEX]
>     ER = Error register [HEX]
>     ST = Status register [HEX]
> Powered_Up_Time is measured from power on, and printed as
> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
> SS=sec, and sss=millisec. It "wraps" after 49.710 days.
> 
> Error 2 occurred at disk power-on lifetime: 70 hours (2 days + 22 hours)
>   When the command that caused the error occurred, the device was active
> or idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 40 a0 58 44 0d  Error: ICRC, ABRT 64 sectors at LBA = 0x0d4458a0
> = 222582944
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   35 d5 f0 f0 57 44 e0 00   2d+19:22:18.897  WRITE DMA EXT
>   35 d5 f0 00 57 44 e0 00   2d+19:22:18.893  WRITE DMA EXT
>   35 d5 f0 10 56 44 e0 00   2d+19:22:18.889  WRITE DMA EXT
>   35 d5 f0 20 55 44 e0 00   2d+19:22:18.885  WRITE DMA EXT
>   35 d5 f0 30 54 44 e0 00   2d+19:22:18.881  WRITE DMA EXT
> 
> Error 1 occurred at disk power-on lifetime: 70 hours (2 days + 22 hours)
>   When the command that caused the error occurred, the device was active
> or idle.
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
>