[geeks] R720 NAS, continued

Fri Jan 21 01:13:50 CST 2022

On Tue, 18 Jan 2022, Phil Stracchino wrote:

> Unfortunately at the time I bought the SFF R720, I didn't realize that *all* 
> 2.5" mechanical drives over 1TB are SMR.  I also didn't realize that the 
> write performance impact was so gigantic.

Write performance is an utter disaster with filesystems that are not
SMR-aware.  A copy-on-write filesystem like ZFS can do decently with a
very large recordsize, provided you have the SLOG writing to something
fast like mirrored SSD.

On SMR drives that have a "conventional" range of LBAs, you can use that
to hold things like SLOG in a pinch.  Really, though, you want a fully
SMR-aware filesystem with fast random-access scratchspace on another
device.

I got to design and start the implementation of such a filesystem at
$job[-1] beacuse there's type of NVMe device (so-called "Zoned
Namespace" devices) that exposes each flash page for sequential writes
in much the same way[0].  All sorts fundamental assumptions turn into
performance-killing hills when you can't "just go back and update"
nearly anything.

A rather typical sort of organization in a SAN controller would be a
B+tree with the leaves on SMR and the internal index nodes on
conventional storage.  With a copy-on-write approach to whole-page
updates, something like that could make SMR really sing.  I'd fully
expect something like that to back the super-cheap "object storage"
that's popular on hosting platforms if they weren't just using Ceph.

> Having now experienced it firsthand, I think Seagate et al downplay
> the write performance deficit to such an extent that there are
> potential grounds for a class action suit against the entire storage
> industry for misrepresentation of the technology.

I think only WD have exposure there because they pushed SMR storage for
entry-level NAS applications, where it's wholly inappropriate.  For an
end user's Wintendo box, sluggish writes will get adequately cached,
minimizing performance woes on a read-mostly use case.

> Because this is just terrible. It's not, say, 10% or 15% slower on writes 
> than the CMR drives in the failed X4540.  It is 3 or 4 *TIMES* slower.

Well, yes.  It's on-disk write amplification via read-modify-write
cycles.  Every synchronous write means the whole (usually 64MB) zone
gets read, updated in cache, and rewritten.  SMR drives usually have
enough cache for one or two full zones, so you get the added bonuses of
implictly flushing the activity queue and evicting anything useful from
the readahead cache, too.

[0] Flash SSDs mostly work like SMR drives, and the controller does a
     *lot* of LBA remapping for wear-leveling and other concerns.
     Getting the controller out of the way and doing that in software
     (thus, taking on the responsibility of writing sequentially wihin a
     page) can really remove a lot of jitter from write latencies.
-- 
Jonathan Patschke
Austin, TX
USA