[lug] Cluster File Systems

Wed Aug 8 02:12:31 MDT 2018

Thanks Davide for the link. Interesting info.

For the curious these might be interesting:
    https://en.wikipedia.org/wiki/List_of_file_systems
    https://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems

I naively thought that after 35+ years that simple cluster-file-system
setup would have gotten easier. Instead the scope has grown as well as
the complexity.

What I am after is removing SPOF due to server and/or disk outages (e.g.
server kernel upgrade, server replacement). The file server exports nfsv4
from raid-backed file-systems (xfs mostly) on lvm. The uses range from
/home, to the postfix mail-queue/mailboxes, backup osm database (postgres),
nearline file store, video archive and build roots (e.g. rpi, rock64,
apu2). The non-hot swap disks are in the server case and there isn't
another system available that can accept the disks.

For those interested in what I understand about some of the options:

o These days, cluster/distributed wrt storage implies lots of clients,
   100's of terabytes or more, desire for high 'performance' among the
   clients. It seems to vary by fs whether performance means latency or
   throughput.

o The 'largeness' orientation means the hardware requirements might be
   substantial even for small deployments (e.g. ceph)

o Lustre -- complicated to setup, very much tied to RH distro via
   kernel requirements.

o Ceph -- looks complicated to setup and memory requirements are pretty
   large for small deployments. Operating ram for OSD's ~500M, but
   recovery operations could require 1GB/TB of storage. I.e.
   4TB disks -> 4GB RAM for recovery.

o BeeGFS/fhGFS -- Mirroring (replication) and file ACLs are only
   available in the enterprise version (see the LICENSE file).

o OpenStack Swift -- swift by itself is an object store. The swift-on-
   file backend provides 'POSIX' file access (maybe via NFS protocols),
   but does not do replication.

   https://github.com/openstack/swiftonfile:
   >> Swift-On-File currently works only with Filesystems with extended
   >> attributes support. It is also recommended that these Filesystems
   >> provide data durability as Swift-On-File should not use Swift's
   >> replication mechanisms.

o GlusterFS -- distributed, replicated file system (glusterfs.org)
   Not recommended for big file access (e.g. databases). FUSE-based
   client access to cluster data allows for automatic fail over.
   Don't know if a cluster server can also be a client, say with FUSE.
   E.g. server running postfix which writes to the shared mailstore.

   https://docs.gluster.org/en/latest/Administrator%20Guide/Setting%20Up%20Clients/
     says NFS access is limited to NFSv3.

   ~3yrs ago I had problems w/ lots rapid reads/writes to gluster
   served files. Maybe its generally better now, but
     https://docs.gluster.org/en/latest/Administrator%20Guide/Linux%20Kernel%20Tuning/
   says:
   >> Having had a fair bit of experience looking at large memory
   >> systems with heavily loaded regressions, be it CAD, EDA or similar
   >> tools, we've sometimes encountered stability problems with Gluster.
   >> We had to carefully analyse the memory footprint and amount of
   >> disk wait times over days. This gave us a rather remarkable story
   >> of disk trashing, huge iowaits, kernel oops, disk hangs etc.

o OpenAFS -- distributed, descriptions don't mention replication.
   Requires kerberos5 (added complexity in user management)
   IIRC non-posix file semantics

o LizardFS (based on MooseFS) -- distributed, replicating
   Metadata servers (suggest 32GB RAM, but 4GB may be ok for 'small' cases)
     dedicated machine per metadata server, need 2: master, shadow
   MetaLogger and ChunkServers (suggest 2GB RAM)
     metaloggers are optional, used when both master & shadow metdata
     servers fail
   replication can use raid5-like modes but requires n-data, 1-parity,
   and 1-extra-recovery chunk servers

o OrangeFS (based on PVFS) -- distributed, non-replicating
   might not support posix locking
   direct access client is in mainline kernels 4.6+
   fault tolerance handled through os/hw raid
     metadata is _not_ replicated even with multiple metadata servers

o XtreemFS - distributed, file-level replication
   files are replicated on close (big file issues?)
   server code in java, fuse client in C++
   stable release 1.5.1 from 2015, github repo has some minor
     activity in the last few months

o NAS systems -- still a SPOF
   - need duplicate systems for quick recovery from non-disk failures
   - need some sort of regular syncing to standby, or manually move
     disks on failure
   - NFS clients have to kill processes, remount exports
   - firmware lock-in, limited support period for (most?) commercial versions
     2 bay units ~$250 no disks (netgear, synology, qnap)
     can end-users replace the firmware with any of these?
   - rockstor.com Pro 8 $1000 no disks, mentions DIY so it might possible to
     fully manage the system (standard motherboard?)
   - helios 4 DIY NAS (http://kobol.io/helios4)
     kickstarter, 4 non-hot-swap drive bays, 2GB RAM, Marvell Armada 388 SoC
     ~$200 no drives, shipped from HK

o md raid1/5... on iSCSI targets over GBE (tgt for linux)
   - will this work?
   - use rock64s as ISCSI targets w/ 4GB cache, use available
       systems for primary/standby nfs servers
   - one of the simpler things to do
   - requires manual switch over on nfs server fail, on standby:
       connect to iSCSI targets
       bring up preconfigured /dev/mdX
       mount exported fs
       update nfs exports
       update dns with new nfs server address
       kill processes, remount exports on clients

On 08/04/2018 05:31 AM, Davide Del Vento wrote:
> Ops. The link: http://moo.nac.uci.edu/~hjm/fhgfs_vs_gluster.html
> 
> On Fri, Aug 3, 2018 at 7:16 PM, Davide Del Vento <davide.del.vento at gmail.com
>> wrote:
> 
>> Since you and others mentioned a few things I did not know about, while
>> trying to learn more about them I've found this which seems to be quite
>> informative regarding BeeGFS (aka fhGFS) as compared to gluster and a
>> little bit to Lustre.