[lug] Stale NFS File Handle

Thu Aug 4 19:42:47 MDT 2005

It was occuring under load,  I wrote a shell script to see if the 
automounter was causing the problem.  I'm running in on our Sun E450 
Solaris POS, a Fedora Core 3 box, and a Red Hat Enterprise 3 box.  Most 
of the problems were actually on a the Sun box.  The Sun box is MUCH 
more suceptiable to this problem than any of the Linux boxes.  The Sun 
box runs Apache from /usr/local which is an NFS mount.  It's Document 
Root is also coming via NFS.  I have grad students that use the web 
server on the Sun box all day every day.  Plus there are people from all 
over the world that access it 24x7.  So the load isn't high, but its 
constant.  The rest of the load on the NFS server comes from the Linux 
workstations.  During the mornings, the load does get pretty high.  I've 
seen load averages well over 5 at times and over 10 when it gets really bad.

According to the NFS FAQ (http://nfs.sourceforge.net/) there are only 5 
reasons that you will get this error:
/
- The file resides in an export that is not accessible./ It could have 
been unexported, the export's access list may have changed, or the 
server could be up but simply not exporting its shares

I don't think this is my problem.  All the clients and the server have 
been rebooted since the migration to LVM2.  Some have been rebooted 
multiple times.
/
The file handle refers to a deleted file./ After a file is deleted on 
the server, clients don't find out until they try to access the file 
with a file handle they had cached from a previous LOOKUP. Using rsync 
or mv to replace a file while it is in use on another client is a common 
scenario that results in an ESTALE error.

I doubt this as well.  Most people use only 1 workstation, if somebody 
deleted somebody elses file then the problem would only happen to one or 
two people once or twice.
/
The file was renamed to another directory, and subtree checking is 
enabled on a share exported by a Linux NFS server./ See question C7 
<http://nfs.sourceforge.net/#faq_c7> for more details on subtree 
checking on Linux NFS servers.

This makes the most sense.  Especially with the automounter.  I did turn 
off subtree checking, which seems to have helped.
/
The device ID of the partition that holds your exported files has 
changed./ File handles often contain all or part of a physical device 
ID, and that ID can change after a reboot, RAID-related changes, or a 
hardware hot-swap event on your server. Using the "fsid" export option 
on Linux will force the fsid of an exported partition to remain the 
same. See the "exports" man page for more details.

I might set this at a later date, but it's not worth the hassle of 
rebooting everything again.  Especially since I'm not doing any failover 
or redundancy.
/
The exported file system doesn't support permanent inode numbers./ 
Exporting FAT file systems via NFS is problematic for this reason. This 
problem can be avoided by exporting only local filesystems which have 
good NFS support. See question C6 <http://nfs.sourceforge.net/#faq_c6> 
for more information.

Nope.  According to the FAQ, ext3, jfs, xfs, and reiser should all work 
find over NFS.

The only other possibility is "other" which would be problems with LVM2. 

At the moment everything finally seems stable.

I can use rsync to make a fairly high load on the ext3 volume to see if 
the problem occurs there.

Dan Ferris

Lee Woodworth wrote:

> Dan Ferris wrote:
>
>> An Update...
>>
>> Turning off subtree checking seems to have mitigated the problem, but 
>> not solved it.
>
> Do things change under load? The problem I had didn't show up unless
> there was some load. But once there was load the stale handle error
> consistently occurred -- rsyncing a dir tree with 110,000+ files would
> always trigger the problem. I use  xfs with lvm2. I didn't change file
> systems to fix the problem.
>
>>
>> Tomorrow I'm going to test out an LVM volume with ext3 instead of 
>> jfs.  I'm going to test with both fstab mounts and the automounter to 
>> see if the automounter makes any difference.  If ext3 solves the 
>> problem I guess I'll just have to figure out a way to live with it's 
>> limits.  Or I'll try reiserfs to see if that works.  I've had 
>> problems with reiser in the past, maybe it's better now.  It would 
>> kind of suck to move off of jfs, I've been very happy with it.
>>
>> I'll report back to the list what I find out.
>>
>> Thank you again for all the responses, they have been most helpful.
>>
>> Dan Ferris
>>
>>>
>> _______________________________________________
>> Web Page:  http://lug.boulder.co.us
>> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
>> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug
>
>
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug
>
>