[lug] Cluster File Systems
Lee Woodworth
blug-mail at duboulder.com
Wed Aug 8 02:12:31 MDT 2018
Thanks Davide for the link. Interesting info.
For the curious these might be interesting:
https://en.wikipedia.org/wiki/List_of_file_systems
https://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems
I naively thought that after 35+ years that simple cluster-file-system
setup would have gotten easier. Instead the scope has grown as well as
the complexity.
What I am after is removing SPOF due to server and/or disk outages (e.g.
server kernel upgrade, server replacement). The file server exports nfsv4
from raid-backed file-systems (xfs mostly) on lvm. The uses range from
/home, to the postfix mail-queue/mailboxes, backup osm database (postgres),
nearline file store, video archive and build roots (e.g. rpi, rock64,
apu2). The non-hot swap disks are in the server case and there isn't
another system available that can accept the disks.
For those interested in what I understand about some of the options:
o These days, cluster/distributed wrt storage implies lots of clients,
100's of terabytes or more, desire for high 'performance' among the
clients. It seems to vary by fs whether performance means latency or
throughput.
o The 'largeness' orientation means the hardware requirements might be
substantial even for small deployments (e.g. ceph)
o Lustre -- complicated to setup, very much tied to RH distro via
kernel requirements.
o Ceph -- looks complicated to setup and memory requirements are pretty
large for small deployments. Operating ram for OSD's ~500M, but
recovery operations could require 1GB/TB of storage. I.e.
4TB disks -> 4GB RAM for recovery.
o BeeGFS/fhGFS -- Mirroring (replication) and file ACLs are only
available in the enterprise version (see the LICENSE file).
o OpenStack Swift -- swift by itself is an object store. The swift-on-
file backend provides 'POSIX' file access (maybe via NFS protocols),
but does not do replication.
https://github.com/openstack/swiftonfile:
>> Swift-On-File currently works only with Filesystems with extended
>> attributes support. It is also recommended that these Filesystems
>> provide data durability as Swift-On-File should not use Swift's
>> replication mechanisms.
o GlusterFS -- distributed, replicated file system (glusterfs.org)
Not recommended for big file access (e.g. databases). FUSE-based
client access to cluster data allows for automatic fail over.
Don't know if a cluster server can also be a client, say with FUSE.
E.g. server running postfix which writes to the shared mailstore.
https://docs.gluster.org/en/latest/Administrator%20Guide/Setting%20Up%20Clients/
says NFS access is limited to NFSv3.
~3yrs ago I had problems w/ lots rapid reads/writes to gluster
served files. Maybe its generally better now, but
https://docs.gluster.org/en/latest/Administrator%20Guide/Linux%20Kernel%20Tuning/
says:
>> Having had a fair bit of experience looking at large memory
>> systems with heavily loaded regressions, be it CAD, EDA or similar
>> tools, we've sometimes encountered stability problems with Gluster.
>> We had to carefully analyse the memory footprint and amount of
>> disk wait times over days. This gave us a rather remarkable story
>> of disk trashing, huge iowaits, kernel oops, disk hangs etc.
o OpenAFS -- distributed, descriptions don't mention replication.
Requires kerberos5 (added complexity in user management)
IIRC non-posix file semantics
o LizardFS (based on MooseFS) -- distributed, replicating
Metadata servers (suggest 32GB RAM, but 4GB may be ok for 'small' cases)
dedicated machine per metadata server, need 2: master, shadow
MetaLogger and ChunkServers (suggest 2GB RAM)
metaloggers are optional, used when both master & shadow metdata
servers fail
replication can use raid5-like modes but requires n-data, 1-parity,
and 1-extra-recovery chunk servers
o OrangeFS (based on PVFS) -- distributed, non-replicating
might not support posix locking
direct access client is in mainline kernels 4.6+
fault tolerance handled through os/hw raid
metadata is _not_ replicated even with multiple metadata servers
o XtreemFS - distributed, file-level replication
files are replicated on close (big file issues?)
server code in java, fuse client in C++
stable release 1.5.1 from 2015, github repo has some minor
activity in the last few months
o NAS systems -- still a SPOF
- need duplicate systems for quick recovery from non-disk failures
- need some sort of regular syncing to standby, or manually move
disks on failure
- NFS clients have to kill processes, remount exports
- firmware lock-in, limited support period for (most?) commercial versions
2 bay units ~$250 no disks (netgear, synology, qnap)
can end-users replace the firmware with any of these?
- rockstor.com Pro 8 $1000 no disks, mentions DIY so it might possible to
fully manage the system (standard motherboard?)
- helios 4 DIY NAS (http://kobol.io/helios4)
kickstarter, 4 non-hot-swap drive bays, 2GB RAM, Marvell Armada 388 SoC
~$200 no drives, shipped from HK
o md raid1/5... on iSCSI targets over GBE (tgt for linux)
- will this work?
- use rock64s as ISCSI targets w/ 4GB cache, use available
systems for primary/standby nfs servers
- one of the simpler things to do
- requires manual switch over on nfs server fail, on standby:
connect to iSCSI targets
bring up preconfigured /dev/mdX
mount exported fs
update nfs exports
update dns with new nfs server address
kill processes, remount exports on clients
On 08/04/2018 05:31 AM, Davide Del Vento wrote:
> Ops. The link: http://moo.nac.uci.edu/~hjm/fhgfs_vs_gluster.html
>
> On Fri, Aug 3, 2018 at 7:16 PM, Davide Del Vento <davide.del.vento at gmail.com
>> wrote:
>
>> Since you and others mentioned a few things I did not know about, while
>> trying to learn more about them I've found this which seems to be quite
>> informative regarding BeeGFS (aka fhGFS) as compared to gluster and a
>> little bit to Lustre.
More information about the LUG
mailing list