[lug] Reliable SATA server?

Rob Nagler nagler at bivio.biz
Tue May 8 08:36:41 MDT 2012


Hi Sean,

I hope we're not boring the rest of BLUG, but this is something I don't
see discussed often...  I like learning about ZFS from someone who actually
uses it in production. :)

> We had a customer with a few Dell 29xx boxes, they seemed to be pretty
> good.  Their $300 rack-mount kits, as I've mentioned here before, are
> fantastic.

Yes, the rails are much better than whitebox rails I've had.  They
just snap in and
roll.

The refurb box with rails is about $900 delivered.  It has 2 x
quadcore 3ghz Xeons, 8GB,
redundant power supply, dual enet, and remote access card.  It's
really quite a beast,
and way over kill for my problem, and quite a deal imiho.

> For example, copying my home directory with "cp -al" takes 75 seconds:

On 7.5M files and .5M directories, it takes about 2:45 on my slow
server.  On the
faster server (same disks), it takes about 2:30.

> And then removing it takes 63 seconds.

Removes take approx 3:00 and 2:20, respectively.

> Comparatively, a snapshot takes well less than a second to create *AND*
> delete:

That makes sense.  ZFS is surely well-designed for high performance and
reliability.  My contention is that it is "new", and I'll let others
work out the bugs.
(There was one data loss bug as recently as 2010, I believe.)  That's certainly
selfish, but I do enough for the common green in other areas. :)

>> Thrashing is running out of resources.
>
> *EXACTLY*.  You only have so many I/Os you can do in a second.  If you have
> millions of I/Os that you are trying to do, doing anything else that
> generates I/Os, like logging into a system, will cause each one to have a
> dramatically increased latency.

I think we are talking apples and oranges here.  My server is
dedicated to backup.
It is designed for "peak load", that is, it consumes all the resources
on the machine
that are available to do its job, and no more.  This is similar to
graphics cards,
number crunching boxes, etc.  These are all "batch" load machines, and logging
in to them and poking around is going to be slow when they are under peak load.

All I require is that the machine be done with its batch job in 24
hours.  It does
that in plenty of time. The biggest job, in fact, is not the cp
--link, but the weekly
tars.  They take well over 24 hours on both machines.  The tar runs between
different disks, and the writes are very sequential (typical file size
is 100MB).
The compression is probably a big cost on the slow boxes.  I'm curious how
the Dell will fair.

Why do I do the tar?  So I can put disks in a vault, and so I can also
store multiple
complete backups online on independent disks.  I've been fortunate that the
compressed version of my backup data has kept pace with the maximum 2.5"
drive available for the last couple of years.  That makes it possible
to store a lot
of data in one vault.

You may say that I could make it all much faster with ZFS.  And, someday, I may
do that.  However, these are blood of my and my customer's businesses.
I have seen too many scary stories about backup software going awry.  If
someone attacks my systems and destroys everything, I can get them back
and running with the data in my vault(s).

As it is, the systems have plenty of capacity (as long as they stay running).
I don't care if the disks are rattling 7x24 as long as the complete process
including weeklies finishes in a week.

The only thing that needs to finish asap is pulling the data off the disks
of the other systems.  That's a network problem for the most part.  That's
why I'm also building standby servers, which will update and be validated
in real-time.

> file).  If each one of these operations now takes several seconds, a simple
> operation like running "ls" can start taking minutes to complete.

I created a red herring with the login thing.  I expect it to be slow on a busy
system.  The problem is not slow logins but system failures.  For some reason,
under these loads, whitebox servers I've owned don't cut the mustard
-- to be fair, I was able to assemble one which has worked pretty well, but
it can only have one CPU.  Hopefully, the Dell will work better.

> I'm not doing some theoretical discussion here, I have personally observed
> that "cp -l" can bring a system to it's knees and make it unresponsive.

Agreed.

Rob



More information about the LUG mailing list