[Bug 209158] node / npm triggering zfs rename deadlock

Discussion:

(too old to reply)

b***@freebsd.org

2016-05-02 02:56:10 UTC

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

Mark Linimon <***@FreeBSD.org> changed:

What |Removed |Added
----------------------------------------------------------------------------
Assignee|freebsd-***@FreeBSD.org |freebsd-***@FreeBSD.org

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-23 21:51:32 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

Doug Luce <***@freebsd.con.com> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |***@freebsd.con.com

--- Comment #1 from Doug Luce <***@freebsd.con.com> ---
Same thing is happening for me:

% procstat -kk 2291
PID TID COMM TDNAME KSTACK
2291 101307 node - mi_switch+0xe1
sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d kern_kevent+0x401
sys_kevent+0x12a amd64_syscall+0x351 Xfast_syscall+0xfb
2291 101353 node V8 WorkerThread mi_switch+0xe1
sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125
do_sem_wait+0x4aa __umtx_op_sem_wait+0x73 amd64_syscall+0x351
Xfast_syscall+0xfb
2291 101354 node V8 WorkerThread mi_switch+0xe1
sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125
do_sem_wait+0x4aa __umtx_op_sem_wait+0x73 amd64_syscall+0x351
Xfast_syscall+0xfb
2291 101355 node V8 WorkerThread mi_switch+0xe1
sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125
do_sem_wait+0x4aa __umtx_op_sem_wait+0x73 amd64_syscall+0x351
Xfast_syscall+0xfb
2291 101356 node V8 WorkerThread mi_switch+0xe1
sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125
do_sem_wait+0x4aa __umtx_op_sem_wait+0x73 amd64_syscall+0x351
Xfast_syscall+0xfb
2291 101357 node - mi_switch+0xe1 sleepq_wait+0x3a
sleeplk+0x15d __lockmgr_args+0x902 vop_stdlock+0x3c VOP_LOCK1_APV+0xab
_vn_lock+0x43 zfs_lookup+0x3c2 zfs_freebsd_lookup+0x6d
VOP_CACHEDLOOKUP_APV+0xa1 vfs_cache_lookup+0xd6 VOP_LOOKUP_APV+0xa1
lookup+0x59c namei+0x4d4 kern_renameat+0x213 amd64_syscall+0x351
Xfast_syscall+0xfb
2291 101358 node - mi_switch+0xe1 sleepq_wait+0x3a
sleeplk+0x15d __lockmgr_args+0x902 vop_stdlock+0x3c VOP_LOCK1_APV+0xab
_vn_lock+0x43 zfs_lookup+0x3c2 zfs_freebsd_lookup+0x6d
VOP_CACHEDLOOKUP_APV+0xa1 vfs_cache_lookup+0xd6 VOP_LOOKUP_APV+0xa1
lookup+0x59c namei+0x4d4 kern_renameat+0x213 amd64_syscall+0x351
Xfast_syscall+0xfb
2291 101359 node - mi_switch+0xe1 sleepq_wait+0x3a
sleeplk+0x15d __lockmgr_args+0xc9e vop_stdlock+0x3c VOP_LOCK1_APV+0xab
_vn_lock+0x43 vputx+0x21f zfs_rename_unlock+0x3e zfs_freebsd_rename+0xcdf
VOP_RENAME_APV+0xab kern_renameat+0x3ef amd64_syscall+0x351 Xfast_syscall+0xfb
2291 101360 node - mi_switch+0xe1 sleepq_wait+0x3a
sleeplk+0x15d __lockmgr_args+0x902 vop_stdlock+0x3c VOP_LOCK1_APV+0xab
_vn_lock+0x43 zfs_lookup+0x3c2 zfs_freebsd_lookup+0x6d
VOP_CACHEDLOOKUP_APV+0xa1 vfs_cache_lookup+0xd6 VOP_LOOKUP_APV+0xa1
lookup+0x59c namei+0x4d4 kern_renameat+0x213 amd64_syscall+0x351
Xfast_syscall+0xfb

Full thread dump at
https://gist.github.com/anonymous/3be664a893c8face928b678956b86302

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-24 00:17:48 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

Rick Macklem <***@FreeBSD.org> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |***@FreeBSD.org

--- Comment #2 from Rick Macklem <***@FreeBSD.org> ---
If it was possible, it might be useful to test with
sysctl debug.vfscache=0

Why? Well there was a similar recent occurrence on Freefall,
but where the vnode lock was on NFS. The similarity is that
several of the threads that were waiting for the vnode lock
were: cache_lookup()->vget()->_vn_lock()->
and r285632 changed cache_lookup() from using VI_LOCK() to
vhold() before the vget() call. I am wondering if this change
somehow broke the code.

Anyhow, disabling name caching would avoid doing the code in
cache_lookup().

Disabling name caching will have a performance hit, but it
would be nice to see if this avoids the deadlock?

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-24 00:23:08 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #3 from Doug Luce <***@freebsd.con.com> ---
(In reply to Rick Macklem from comment #2)

OK, did that on the same machine as before. A new npm install gives me
something slightly different:

PID TID COMM TDNAME KSTACK
21557 100965 node - mi_switch+0xe1
sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d kern_kevent+0x401
sys_kevent+0x12a amd64_syscall+0x351 Xfast_syscall+0xfb
21557 101212 node V8 WorkerThread mi_switch+0xe1
sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125
do_sem_wait+0x4aa __umtx_op_sem_wait+0x73 amd64_syscall+0x351
Xfast_syscall+0xfb
21557 101213 node V8 WorkerThread mi_switch+0xe1
sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125
do_sem_wait+0x4aa __umtx_op_sem_wait+0x73 amd64_syscall+0x351
Xfast_syscall+0xfb
21557 101214 node - mi_switch+0xe1
sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125
do_wait+0x329 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x351
Xfast_syscall+0xfb
21557 101215 node - mi_switch+0xe1 sleepq_wait+0x3a
sleeplk+0x15d __lockmgr_args+0x902 vop_stdlock+0x3c VOP_LOCK1_APV+0xab
_vn_lock+0x43 zfs_lookup+0x3c2 zfs_freebsd_lookup+0x6d
VOP_CACHEDLOOKUP_APV+0xa1 vfs_cache_lookup+0xd6 VOP_LOOKUP_APV+0xa1
lookup+0x59c namei+0x4d4 vn_open_cred+0x21d kern_openat+0x26f
amd64_syscall+0x351 Xfast_syscall+0xfb
21557 101646 node V8 WorkerThread mi_switch+0xe1
sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125
do_sem_wait+0x4aa __umtx_op_sem_wait+0x73 amd64_syscall+0x351
Xfast_syscall+0xfb
21557 101651 node V8 WorkerThread mi_switch+0xe1
sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125
do_sem_wait+0x4aa __umtx_op_sem_wait+0x73 amd64_syscall+0x351
Xfast_syscall+0xfb
21557 101654 node - mi_switch+0xe1
sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125
do_wait+0x329 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x351
Xfast_syscall+0xfb
21557 101655 node - mi_switch+0xe1
sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125
do_wait+0x329 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x351
Xfast_syscall+0xfb

Process still hangs, acourse.

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-24 01:03:54 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #4 from Rick Macklem <***@FreeBSD.org> ---
Oh well. Thanks for trying it. I am still not 100% convinced that
recent changes haven't broken cache_lookup(), but your problem isn't that.

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-24 06:53:01 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #5 from Andriy Gapon <***@FreeBSD.org> ---
(In reply to Rick Macklem from comment #4)

Rick, there is certainly a deadlock problem with rename on ZFS (on FreeBSD).
It's been there since day one.

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-24 12:02:03 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #6 from Rick Macklem <***@FreeBSD.org> ---
I knew there was a ZFS rename issue, but I thought it
occurred rarely and typically (always??) when snapshots
were being renamed.

This problem seems to have become frequent.
Of course, just about any change could cause changes in
when things get executed and "tickle" the deadlock bug.

There was a case on freefall that looked similar (due to
a deadlock involving vnode locks, mostly through cache_lookup())
and I had "hoped" this was the same problem (since the freefall
case occurred once in several months). It doesn't appear to
be the same problem.

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-24 12:42:20 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #7 from Andriy Gapon <***@FreeBSD.org> ---
(In reply to Rick Macklem from comment #6)

Post by b***@freebsd.org
I knew there was a ZFS rename issue, but I thought it
occurred rarely and typically (always??) when snapshots
were being renamed.

It is quite rare indeed. Just in case, I meant rename(2), so not sure how
snapshot renaming, which is done through a special ioctl, can get into the
picture.

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-24 12:43:58 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #8 from Doug Luce <***@freebsd.con.com> ---
I"m finding this highly reproducible at the moment. Anything I might do to aid
in diagnosis?

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-24 12:58:16 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #9 from Andriy Gapon <***@FreeBSD.org> ---
(In reply to Doug Luce from comment #8)
I believe that I know the root cause (if it's the same problem that I diagnosed
earlier). But my fix is very non-trivial, because it has a domino like effect
on the related ZFS code. So, I am not sure how to proceed from here.

I have a dusty and somewhat logically incomplete but working patch.
At the very least I can rebase it to the latest head and attach it to this bug.
Maybe tomorrow.

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-24 15:00:08 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

Bengt Ahlgren <***@sics.se> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |***@sics.se

--- Comment #10 from Bengt Ahlgren <***@sics.se> ---
(In reply to Andriy Gapon from comment #9)
You write that you think that a zfs rename (file, NOT snapshot) deadlock has
been around for a long time. Do you think this could be the same issue (on
9.3)?

https://lists.freebsd.org/pipermail/freebsd-fs/2015-December/022370.html

The trigger there was a massively parallel java program creating (and renaming)
a lot of output files.

I would be interested in testing patches, but that server is unfortunately
still running 9.3. Later in the summer perhaps.

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-24 15:46:47 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #11 from Andriy Gapon <***@FreeBSD.org> ---
(In reply to Bengt Ahlgren from comment #10)
Yes, that looks like the same thing.

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-24 16:58:35 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #12 from Andriy Gapon <***@FreeBSD.org> ---
I've just created a phabricator review request with my patch:
https://reviews.freebsd.org/D6533
I haven't added any reviewers or subscribers (other than umbrella ZFS
subscriber), because I view this review request more as an invitation to a
discussion or a request for comments.
But please feel free to add yourself in either capacity.

If you would like just to test the change, then there is 'Download Raw Diff'
link. The patch is for head as of r300617.

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-24 19:35:35 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #13 from Doug Luce <***@freebsd.con.com> ---
Running the patch give me this on bootup:

kernel: lock order reversal:
kernel: 1st 0xfffffe00f766fba0 bufwait (bufwait) @
/usr/src/sys/kern/vfs_bio.c:3512
kernel: 2nd 0xfffff800141e0200 dirhash (dirhash) @
/usr/src/sys/ufs/ufs/ufs_dirhash.c:281
kernel: stack backtrace:
kernel: #0 0xffffffff80a76bc3 at witness_debugger+0x73
kernel: #1 0xffffffff80a76ac3 at witness_checkorder+0xea3
kernel: #2 0xffffffff80a21cc5 at _sx_xlock+0x75
kernel: #3 0xffffffff80cd5ded at ufsdirhash_move+0x3d
kernel: #4 0xffffffff80cd831d at ufs_direnter+0x35d
kernel: #5 0xffffffff80ce08a1 at ufs_mkdir+0x8b1
kernel: #6 0xffffffff80fd2e67 at VOP_MKDIR_APV+0xf7
kernel: #7 0xffffffff80ae3b09 at kern_mkdirat+0x209
kernel: #8 0xffffffff80e7d2ae at amd64_syscall+0x2de
kernel: #9 0xffffffff80e5da1b at Xfast_syscall+0xfb

A bit later on:

kernel: acquiring duplicate lock of same type: "kqueue"
kernel: 1st kqueue @ /usr/src/sys/kern/kern_event.c:2026
kernel: 2nd kqueue @ /usr/src/sys/kern/subr_taskqueue.c:282
kernel: stack backtrace:
kernel: #0 0xffffffff80a76bc3 at witness_debugger+0x73
kernel: #1 0xffffffff80a76ac3 at witness_checkorder+0xea3
kernel: #2 0xffffffff809fb2b8 at __mtx_lock_flags+0xa8
kernel: #3 0xffffffff80a6945f at taskqueue_enqueue+0x3f
kernel: #4 0xffffffff809d04f9 at knote_enqueue+0x119
kernel: #5 0xffffffff809d28c1 at knote+0x1c1
kernel: #6 0xffffffff80ad9843 at vop_rename_post+0x143
kernel: #7 0xffffffff80fd2d3b at VOP_RENAME_APV+0x13b
kernel: #8 0xffffffff80ae3896 at kern_renameat+0x4a6
kernel: #9 0xffffffff80e7d2ae at amd64_syscall+0x2de
kernel: #10 0xffffffff80e5da1b at Xfast_syscall+0xfb

and the machine locks up.

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-24 19:54:58 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #14 from Konstantin Belousov <***@FreeBSD.org> ---
(In reply to Doug Luce from comment #13)
First LOR is known, false positive and not related.

Second LOR is more interesting, but it should be innocent. The cause is due to
witness matching locks by names, so it is just formal and is yet another
consequence of the taskqueue changes. Still, the patch attached should fix it.
What program do you run which needs vnode notifications ?

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-24 19:55:28 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #15 from Konstantin Belousov <***@FreeBSD.org> ---
Created attachment 170634
--> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=170634&action=edit
Silence false LOR

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-24 20:36:24 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #16 from Doug Luce <***@freebsd.con.com> ---
Maybe btsync is leading to the second LOR?

With the patch, I'm getting a locked machine after a couple minutes.
Unfortunately it's an EC2 instance so no console access. Not sure what the best
way to proceed would be...

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-24 21:02:48 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #17 from Doug Luce <***@freebsd.con.com> ---
A little more luck, got the panic message at least:

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x118
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff8229551d
stack pointer = 0x28:0xfffffe01232905e0
frame pointer = 0x28:0xfffffe0123290600
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 3233 (emacs-25.0.93)
trap number = 12
panic: page fault
cpuid = 0
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01232900a0
kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0123290150
vpanic() at vpanic+0x189/frame 0xfffffe01232901d0
panic() at panic+0x43/frame 0xfffffe0123290230
trap_fatal() at trap_fatal+0x36b/frame 0xfffffe0123290290
trap_pfault() at trap_pfault+0x200/frame 0xfffffe0123290310
trap() at trap+0x4cc/frame 0xfffffe0123290520
calltrap() at calltrap+0x8/frame 0xfffffe0123290520
--- trap 0xc, rip = 0xffffffff8229551d, rsp = 0xfffffe01232905f0, rbp =
0xfffffe0123290600 ---
rrm_exit() at rrm_exit+0xd/frame 0xfffffe0123290600
zfs_freebsd_rename() at zfs_freebsd_rename+0x12e/frame 0xfffffe0123290710
VOP_RENAME_APV() at VOP_RENAME_APV+0xfc/frame 0xfffffe0123290740
kern_renameat() at kern_renameat+0x4a6/frame 0xfffffe01232909a0
amd64_syscall() at amd64_syscall+0x2de/frame 0xfffffe0123290ab0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0123290ab0
--- syscall (128, FreeBSD ELF64, sys_rename), rip = 0x807c5fbaa, rsp =
0x7fffffffc418, rbp = 0x7fffffffc450 ---
Uptime: 28m4s
Automatic reboot in 15 seconds - press a key on the console to abort
--> Press a key on the console to reboot,
--> or switch off the system now.

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-24 21:20:41 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #18 from Andriy Gapon <***@FreeBSD.org> ---
(In reply to Doug Luce from comment #17)
As Konstantin said earlier those reported LORs are unlikely to have anything to
do with the lock-up.

The panic looks like quite unexpected too...
Could you please disassemble zfs_freebsd_rename in kgdb (kgdb /path/to/zfs.ko)
and attach the output?
Thank you for testing!

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-24 21:22:31 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #19 from Doug Luce <***@freebsd.con.com> ---
Will this do?

https://gist.github.com/anonymous/c18e9f08e7a36aef5c4ed2abb047e8f6

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-25 12:48:09 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #20 from Andriy Gapon <***@FreeBSD.org> ---
(In reply to Doug Luce from comment #19)
Yes, thank you.
Now I see one problem in the code, but it was present even without my patch and
now something in the patch triggers that bug. So, I have a fix for that
problem, but more investigation might be required.

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-25 13:08:03 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #21 from Andriy Gapon <***@FreeBSD.org> ---
Created attachment 170652
--> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=170652&action=edit
delta patch

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-25 22:02:29 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #22 from Doug Luce <***@freebsd.con.com> ---
With that patch, the npm install that used to hang no longer does. The panic
didn't happen until the thing had been up a few hours:

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address = 0xc0
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff82310db4
stack pointer = 0x28:0xfffffe01236e0600
frame pointer = 0x28:0xfffffe01236e0710
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 2533 (emacs-25.0.93)
trap number = 12
panic: page fault
cpuid = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01236e00c0
kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe01236e0170
vpanic() at vpanic+0x189/frame 0xfffffe01236e01f0
panic() at panic+0x43/frame 0xfffffe01236e0250
trap_fatal() at trap_fatal+0x36b/frame 0xfffffe01236e02b0
trap_pfault() at trap_pfault+0x200/frame 0xfffffe01236e0330
trap() at trap+0x4cc/frame 0xfffffe01236e0540
calltrap() at calltrap+0x8/frame 0xfffffe01236e0540
--- trap 0xc, rip = 0xffffffff82310db4, rsp = 0xfffffe01236e0610, rbp =
0xfffffe01236e0710 ---
zfs_freebsd_rename() at zfs_freebsd_rename+0xe4/frame 0xfffffe01236e0710
VOP_RENAME_APV() at VOP_RENAME_APV+0xfc/frame 0xfffffe01236e0740
kern_renameat() at kern_renameat+0x4a6/frame 0xfffffe01236e09a0
amd64_syscall() at amd64_syscall+0x2de/frame 0xfffffe01236e0ab0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe01236e0ab0
--- syscall (128, FreeBSD ELF64, sys_rename), rip = 0x807c5fbaa, rsp =
0x7fffffffc418, rbp = 0x7fffffffc450 ---
Uptime: 2h36m30s
Dumping 1470 out of 4064 MB: (CTRL-C to abort)
..2%..11%..21%..31%..41%..51%..61%..71%..81%..91%
Dump complete
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...
cpu_reset: Restarting BSP
cpu_reset_proxy: Stopped CPU 1

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-25 22:39:02 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #23 from Doug Luce <***@freebsd.con.com> ---
A second panic happened after only a few seconds of uptime after rebooting.
Looks the same:

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0xc0
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff82310db4
stack pointer = 0x28:0xfffffe0123585600
frame pointer = 0x28:0xfffffe0123585710
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 2734 (emacs-25.0.93)
trap number = 12
panic: page fault
cpuid = 0
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01235850c0
kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0123585170
vpanic() at vpanic+0x189/frame 0xfffffe01235851f0
panic() at panic+0x43/frame 0xfffffe0123585250
trap_fatal() at trap_fatal+0x36b/frame 0xfffffe01235852b0
trap_pfault() at trap_pfault+0x200/frame 0xfffffe0123585330
trap() at trap+0x4cc/frame 0xfffffe0123585540
calltrap() at calltrap+0x8/frame 0xfffffe0123585540
--- trap 0xc, rip = 0xffffffff82310db4, rsp = 0xfffffe0123585610, rbp =
0xfffffe0123585710 ---
zfs_freebsd_rename() at zfs_freebsd_rename+0xe4/frame 0xfffffe0123585710
VOP_RENAME_APV() at VOP_RENAME_APV+0xfc/frame 0xfffffe0123585740
kern_renameat() at kern_renameat+0x4a6/frame 0xfffffe01235859a0
amd64_syscall() at amd64_syscall+0x2de/frame 0xfffffe0123585ab0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0123585ab0
--- syscall (128, FreeBSD ELF64, sys_rename), rip = 0x807c5fbaa, rsp =
0x7fffffffcf88, rbp = 0x7fffffffcfc0 ---
Uptime: 12m25s
Dumping 634 out of 4064 MB: (CTRL-C to abort)
..3%..11%..21%..31%..41%..51%..61%..71%..81%..91%
Dump complete

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-26 07:01:03 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

Andriy Gapon <***@FreeBSD.org> changed:

What |Removed |Added
----------------------------------------------------------------------------
Status|New |Open

--- Comment #24 from Andriy Gapon <***@FreeBSD.org> ---
(In reply to Doug Luce from comment #22)
I am afraid that I really need to be able to poke inside the crash dump to be
able to understand what's happening here...
Or, before that, could you please compile your kernel with DEBUG_VFS_LOCKS
options if it's not enabled yet?

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-05-26 19:27:49 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #25 from Doug Luce <***@freebsd.con.com> ---
What's your SSH public key? I'll set you up with access.

Recompiled with the flag, got a panic and a dump.

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0xc0
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff82311194
stack pointer = 0x28:0xfffffe0123717600
frame pointer = 0x28:0xfffffe0123717710
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 2621 (emacs-25.0.93)
trap number = 12
panic: page fault
cpuid = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01237170c0
kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0123717170
vpanic() at vpanic+0x189/frame 0xfffffe01237171f0
panic() at panic+0x43/frame 0xfffffe0123717250
trap_fatal() at trap_fatal+0x36b/frame 0xfffffe01237172b0
trap_pfault() at trap_pfault+0x200/frame 0xfffffe0123717330
trap() at trap+0x4cc/frame 0xfffffe0123717540
calltrap() at calltrap+0x8/frame 0xfffffe0123717540
--- trap 0xc, rip = 0xffffffff82311194, rsp = 0xfffffe0123717610, rbp =
0xfffffe0123717710 ---
zfs_freebsd_rename() at zfs_freebsd_rename+0xe4/frame 0xfffffe0123717710
VOP_RENAME_APV() at VOP_RENAME_APV+0xfc/frame 0xfffffe0123717740
kern_renameat() at kern_renameat+0x4a6/frame 0xfffffe01237179a0
amd64_syscall() at amd64_syscall+0x2de/frame 0xfffffe0123717ab0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0123717ab0
--- syscall (128, FreeBSD ELF64, sys_rename), rip = 0x807c5fbaa, rsp =
0x7fffffffcdd8, rbp = 0x7fffffffce10 ---
Uptime: 14m43s
Dumping 565 out of 4064 MB: (CTRL-C to abort)
..3%..12%..23%..32%..43%..51%..63%..71%..83%..91%
Dump complete

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-06-01 22:14:17 UTC

Permalink

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-06-01 22:11:58 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #26 from Andriy Gapon <***@FreeBSD.org> ---
I've refreshed the phabricator request with a new version:
https://reviews.freebsd.org/D6533
The new version contains quite a few fixes on top of the previous one.

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-06-03 17:48:39 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #27 from Doug Luce <***@freebsd.con.com> ---
Built and installed the patched kernel this morning. I was able to do the npm
install that was failing before, and copies/moves between ZFS and UFS seem to
be going just fine.

While I don't feel qualified to provide a review on that diff, please let me
know if it would be useful. I'll be happy to get myself versed on this stuff.

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-06-06 14:21:41 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #28 from Andriy Gapon <***@FreeBSD.org> ---
(In reply to Doug Luce from comment #27)
Thank you very much again for testing and reporting!
Your testing is a valuable data point for further work on this patch.
Probably I'll now invite more ZFS / FS people to discuss and review the patch
and we'll see how it goes.

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-06-14 18:03:07 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #29 from Peter Wemm <***@FreeBSD.org> ---
I was wondering what the current state is. Which patch did Doug test? We've
had some more explosions with poudriere even without npm and its threaded
operations.

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-06-15 17:33:59 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #30 from Doug Luce <***@freebsd.con.com> ---
The last thing I tried was https://reviews.freebsd.org/D6533 on top of that
day's head. Looks like there are some reviewers tagged on there but nobody
seems to have gotten around to it yet.

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-06-23 06:08:22 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #31 from Steve Wills <***@FreeBSD.org> ---
(In reply to Doug Luce from comment #30)
I'm currently testing this patch and so far it fixes the issue and doesn't
cause any adverse affects, but it's only been running for a few hours. More
testing would be good, but this looks promising to me.

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-07-02 19:00:24 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

Kirk McKusick <***@FreeBSD.org> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |***@FreeBSD.org

--- Comment #32 from Kirk McKusick <***@FreeBSD.org> ---
Commits 301996 and 301997 made on 2016-06-17 to relieve pressure by ZFS on
vfs.numvnodes should help moderate this problem. This is not to say that it
will fully resolve the problem, so the ongoing work should continue.

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-07-08 13:43:58 UTC

Permalink

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-07-14 23:14:23 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

Laurie Odgers <***@gmail.com> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |***@gmail.com

--- Comment #33 from Laurie Odgers <***@gmail.com> ---
Hi All
We are seeing the same problem using NPM 6.2.2 within Jails with root-on-zfs.
This has been a problem for some time, but seems to randomly pop up. We have
run for months without this error happening, however today it seems to happen
repeatedly and relatively reproducible.

We have reverted back to Node v4 and NPM v2 and this bug does not seem to
happen (at least thus far).

Procstat trace below.

# procstat -kk 12590
PID TID COMM TDNAME KSTACK
12590 101356 node - mi_switch+0xe1
sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d kern_kevent_fp+0x399
kern_kevent+0x9f sys_kevent+0x12a amd64_syscall+0x40f Xfast_syscall+0xfb
12590 101358 node - mi_switch+0xe1
sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125
do_sem_wait+0x4db __umtx_op_sem_wait+0x73 amd64_syscall+0x40f
Xfast_syscall+0xfb
12590 101359 node V8 WorkerThread mi_switch+0xe1
sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125
do_sem_wait+0x4db __umtx_op_sem_wait+0x73 amd64_syscall+0x40f
Xfast_syscall+0xfb
12590 101360 node V8 WorkerThread mi_switch+0xe1
sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125
do_sem_wait+0x4db __umtx_op_sem_wait+0x73 amd64_syscall+0x40f
Xfast_syscall+0xfb
12590 101361 node V8 WorkerThread mi_switch+0xe1
sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125
do_sem_wait+0x4db __umtx_op_sem_wait+0x73 amd64_syscall+0x40f
Xfast_syscall+0xfb
12590 101362 node V8 WorkerThread mi_switch+0xe1
sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125
do_sem_wait+0x4db __umtx_op_sem_wait+0x73 amd64_syscall+0x40f
Xfast_syscall+0xfb
12590 101363 node - mi_switch+0xe1 sleepq_wait+0x3a
sleeplk+0x15d __lockmgr_args+0x91a vop_stdlock+0x3c VOP_LOCK1_APV+0xab
_vn_lock+0x43 vget+0x73 cache_lookup+0x5d5 vfs_cache_lookup+0xac
VOP_LOOKUP_APV+0xa1 lookup+0x5a1 namei+0x4d4 kern_renameat+0x1b7
amd64_syscall+0x40f Xfast_syscall+0xfb
12590 101364 node - mi_switch+0xe1 sleepq_wait+0x3a
sleeplk+0x15d __lockmgr_args+0xca0 vop_stdlock+0x3c VOP_LOCK1_APV+0xab
_vn_lock+0x43 vputx+0x21f

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-07-15 00:33:51 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

***@gmail.com changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |***@gmail.com

--- Comment #34 from ***@gmail.com ---
Same with node 6.3.0:

procstat -kk 95115
PID TID COMM TDNAME KSTACK
95115 101118 node - mi_switch+0xe1
thread_suspend_switch+0x170 thread_single+0x4e5 exit1+0xce sigexit+0x925
postsig+0x286 ast+0x427 doreti_ast+0x1f
95115 101294 node V8 WorkerThread mi_switch+0xe1
thread_suspend_switch+0x170 cursig+0x613 ast+0x42f doreti_ast+0x1f
95115 101301 node V8 WorkerThread mi_switch+0xe1
thread_suspend_switch+0x170 cursig+0x613 ast+0x42f doreti_ast+0x1f
95115 101303 node V8 WorkerThread mi_switch+0xe1
thread_suspend_switch+0x170 cursig+0x613 ast+0x42f doreti_ast+0x1f
95115 101305 node - mi_switch+0xe1 sleepq_wait+0x3a
sleeplk+0x15d __lockmgr_args+0x91a vop_stdlock+0x3c VOP_LOCK1_APV+0xab
_vn_lock+0x43 vget+0x73 cache_lookup+0x5d5 vfs_cache_lookup+0xac
VOP_LOOKUP_APV+0xa1 lookup+0x5a1 namei+0x4d4 kern_renameat+0x1b7
amd64_syscall+0x40f Xfast_syscall+0xfb
95115 101312 node - mi_switch+0xe1 sleepq_wait+0x3a
sleeplk+0x15d __lockmgr_args+0x91a vop_stdlock+0x3c VOP_LOCK1_APV+0xab
_vn_lock+0x43 vget+0x73 cache_lookup+0x5d5 vfs_cache_lookup+0xac
VOP_LOOKUP_APV+0xa1 lookup+0x5a1 namei+0x4d4 kern_renameat+0x1b7
amd64_syscall+0x40f Xfast_syscall+0xfb
95115 101313 node - mi_switch+0xe1 sleepq_wait+0x3a
sleeplk+0x15d __lockmgr_args+0xca0 vop_stdlock+0x3c VOP_LOCK1_APV+0xab
_vn_lock+0x43 vputx+0x21f zfs_rename_unlock+0x3e zfs_freebsd_rename+0xe39
VOP_RENAME_APV+0xab kern_renameat+0x4a6 amd64_syscall+0x40f Xfast_syscall+0xfb
95115 101314 node - mi_switch+0xe1 sleepq_wait+0x3a
sleeplk+0x15d __lockmgr_args+0x91a vop_stdlock+0x3c VOP_LOCK1_APV+0xab
_vn_lock+0x43 zfs_lookup+0x45d zfs_freebsd_lookup+0x6d
VOP_CACHEDLOOKUP_APV+0xa1 vfs_cache_lookup+0xd6 VOP_LOOKUP_APV+0xa1
lookup+0x5a1 namei+0x4d4 kern_renameat+0x1b7 amd64_syscall+0x40f
Xfast_syscall+0xfb

Nasty node, it really does something very unusual with files, whenever I met
this issue it was always caused by node process, this is really funny...

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-07-15 05:02:20 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #35 from ***@gmail.com ---
Yes, after some npm modules were updated today, npm install hangs on every run
after full system reboot.
I have latest FreeBSD with all patches installed:

FreeBSD 10.3-RELEASE-p4

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-07-21 15:59:52 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #36 from Glen Barber <***@FreeBSD.org> ---
Can we please get some feedback on the review for this?

https://reviews.freebsd.org/D6533

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-08-04 21:26:14 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

***@hardenedbsd.org changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |***@hardenedbsd.org

--- Comment #37 from ***@hardenedbsd.org ---
The error in the following PR possible relates to this PR too:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209580

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-08-04 21:28:07 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #38 from Andriy Gapon <***@FreeBSD.org> ---
(In reply to op from comment #37)
Most likely, no.

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-08-05 06:23:46 UTC

Permalink

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158

--- Comment #39 from commit-***@freebsd.org ---
A commit references this bug:

Author: avg
Date: Fri Aug 5 06:23:07 UTC 2016
New revision: 303763
URL: https://svnweb.freebsd.org/changeset/base/303763

Log:
zfs: honour and make use of vfs vnode locking protocol

ZFS POSIX Layer is originally written for Solaris VFS which is very
different from FreeBSD VFS. Most importantly many things that FreeBSD VFS
manages on behalf of all filesystems are implemented in ZPL in a different
way.
Thus, ZPL contains code that is redundant on FreeBSD or duplicates VFS
functionality or, in the worst cases, badly interacts / interferes
with VFS.

The most prominent problem is a deadlock caused by the lock order reversal
of vnode locks that may happen with concurrent zfs_rename() and lookup().
The deadlock is a result of zfs_rename() not observing the vnode locking
contract expected by VFS.

This commit removes all ZPL internal locking that protects parent-child
relationships of filesystem nodes. These relationships are protected
by vnode locks and the code is changed to take advantage of that fact
and to properly interact with VFS.

Removal of the internal locking allowed all ZPL dmu_tx_assign calls to
use TXG_WAIT mode.

Another victim, disputable perhaps, is ZFS support for filesystems with
mixed case sensitivity. That support is not provided by the OS anyway,
so in ZFS it was a buch of dead code.

To do:
- replace ZFS_ENTER mechanism with VFS managed / visible mechanism
- replace zfs_zget with zfs_vget[f] as much as possible
- get rid of not really useful now zfs_freebsd_* adapters
- more cleanups of unneeded / unused code
- fix / replace .zfs support

PR: 209158
Reported by: many
Tested by: many (thank you all!)
MFC after: 5 days
Sponsored by: HybridCluster / ClusterHQ
Differential Revision: https://reviews.freebsd.org/D6533

Changes:
head/sys/cddl/compat/opensolaris/sys/vnode.h
head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_dir.h
head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_vfsops.h
head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h
head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c
head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c
head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_sa.c
head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c

--
You are receiving this mail because:
You are the assignee for the bug.

b***@freebsd.org

2016-08-05 06:25:11 UTC

Permalink

--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.