Andriy Gapon
2016-08-03 14:25:45 UTC
I would like to get more comments about the following review request:
https://reviews.freebsd.org/D6533
Both about the code changes and about the general direction of the changes.
I've been asked to try to get this change into 11.0 because of
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158
So, there is some urgency to my request.
To be honest I've never wanted to rush in this change, but I probably
have been sitting on it for too long.
ZFS POSIX Layer is originally written for Solaris VFS which is very
different from FreeBSD VFS. Many things that FreeBSD VFS manages on
behalf of all filesystems are also (re-)implemented in ZPL in a
different way. As a result, ZPL contains code that is redundant on
FreeBSD or duplicates VFS functionality or, in the worst cases, badly
interacts / interferes with VFS.
I am talking mostly about the operations that work on a filesystem
namespace rather than on contents and properties of files.
The former operations typically work on 2 or more vnodes. The vnodes
are usually locked in a particular way or are expected to be locked in a
particular way by the filesystem.
ZPL on the other hand has quite an elaborate locking system of its own.
It includes a lock that protects a znode, a lock that protects the
znode's notion of its parent, locks that protect a name to znode
relationship for the znode's children.
So, we have two competing locking systems for the filesystem nodes and
they are not aware of each other.
The most prominent problem is a deadlock caused by the lock order
reversal of vnode locks that may happen with concurrent zfs_rename() and
lookup(). The deadlock is a result of zfs_rename() not observing the
vnode locking contract expected by the VFS. I should mention here that
VOP_RENAME is the weirdest method with respect to locking requirements.
Every filesystem implementation is expected to perform quite an
elaborate locking "dance" in VOP_RENAME. This is in contrast to all
other operations where all necessary locking is done by the VFS before
calling into the filesystem code.
So, I started by fixing zfs_rename() but that required to make some
changes to the "bottom half" of the ZPL (the "upper half" being the code
that implements the vnode and VFS operations). And that required making
changes in other VOPs anyway. And so the next step was to remove
redundant re-lookup of child nodes in methods like zfs_rmdir and
zfs_remove. Before I could stop I removed all ZPL internal locking that
was supposed to protect parent-child relationships of filesystem nodes.
Those relationships are now protected exclusively by the vnode locks and
the code is changed to take advantage of that fact and to properly
interact with VFS.
As a minor detail, the removal of the internal locking allowed all ZPL
dmu_tx_assign calls to use TXG_WAIT mode.
Another change that was not strictly required and which is probably too
intrusive is killing the support for case insensitive operations. My
thinking was that FreeBSD VFS does not provide support for those anyway.
But I'll probably restore the code, at least in the bottom half of the
ZPL, before committing the change.
On a more general level, I decided that the upper half of ZPL would
necessarily be quite different between different VFS models and so it
does not make much sense to keep huge chunks of ifdef-ed out code (or
compiled in, but never reached code) that's not useful in FreeBSD at
all. I think that keeping that code makes it harder to maintain the
FreeBSD code and to _correctly_ merge upstream changes. E.g., if a
change can be merged without conflicts to an ifdef-ed out blocked, that
does not mean that the merge is correct.
Last, but not least, the work was sponsored by HybridCluster / ClusterHQ
Inc back in 2013.
https://reviews.freebsd.org/D6533
Both about the code changes and about the general direction of the changes.
I've been asked to try to get this change into 11.0 because of
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158
So, there is some urgency to my request.
To be honest I've never wanted to rush in this change, but I probably
have been sitting on it for too long.
ZFS POSIX Layer is originally written for Solaris VFS which is very
different from FreeBSD VFS. Many things that FreeBSD VFS manages on
behalf of all filesystems are also (re-)implemented in ZPL in a
different way. As a result, ZPL contains code that is redundant on
FreeBSD or duplicates VFS functionality or, in the worst cases, badly
interacts / interferes with VFS.
I am talking mostly about the operations that work on a filesystem
namespace rather than on contents and properties of files.
The former operations typically work on 2 or more vnodes. The vnodes
are usually locked in a particular way or are expected to be locked in a
particular way by the filesystem.
ZPL on the other hand has quite an elaborate locking system of its own.
It includes a lock that protects a znode, a lock that protects the
znode's notion of its parent, locks that protect a name to znode
relationship for the znode's children.
So, we have two competing locking systems for the filesystem nodes and
they are not aware of each other.
The most prominent problem is a deadlock caused by the lock order
reversal of vnode locks that may happen with concurrent zfs_rename() and
lookup(). The deadlock is a result of zfs_rename() not observing the
vnode locking contract expected by the VFS. I should mention here that
VOP_RENAME is the weirdest method with respect to locking requirements.
Every filesystem implementation is expected to perform quite an
elaborate locking "dance" in VOP_RENAME. This is in contrast to all
other operations where all necessary locking is done by the VFS before
calling into the filesystem code.
So, I started by fixing zfs_rename() but that required to make some
changes to the "bottom half" of the ZPL (the "upper half" being the code
that implements the vnode and VFS operations). And that required making
changes in other VOPs anyway. And so the next step was to remove
redundant re-lookup of child nodes in methods like zfs_rmdir and
zfs_remove. Before I could stop I removed all ZPL internal locking that
was supposed to protect parent-child relationships of filesystem nodes.
Those relationships are now protected exclusively by the vnode locks and
the code is changed to take advantage of that fact and to properly
interact with VFS.
As a minor detail, the removal of the internal locking allowed all ZPL
dmu_tx_assign calls to use TXG_WAIT mode.
Another change that was not strictly required and which is probably too
intrusive is killing the support for case insensitive operations. My
thinking was that FreeBSD VFS does not provide support for those anyway.
But I'll probably restore the code, at least in the bottom half of the
ZPL, before committing the change.
On a more general level, I decided that the upper half of ZPL would
necessarily be quite different between different VFS models and so it
does not make much sense to keep huge chunks of ifdef-ed out code (or
compiled in, but never reached code) that's not useful in FreeBSD at
all. I think that keeping that code makes it harder to maintain the
FreeBSD code and to _correctly_ merge upstream changes. E.g., if a
change can be merged without conflicts to an ifdef-ed out blocked, that
does not mean that the merge is correct.
Last, but not least, the work was sponsored by HybridCluster / ClusterHQ
Inc back in 2013.
--
Andriy Gapon
Andriy Gapon