-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRBD resources on diskful nodes are diskless after resize if hosts are hard power-cycled #423
Comments
Thanks for the very nice bug report. |
This issue was produced with a single backing disk on each node, attached to a SATA controller in AHCI mode - no RAID. I reproduced the issue with both enterprise SSD and consumer SSD backing disks (Intel S4620 and Crucial MX500). The issue also occurs if I run I know the log of stuff I tried is very long, so let me call out this part here:
EDIT: The above appears incorrect - I tried again and was not able to manually perform a resize that didn't exhibit this issue after a power-cycle of a host where the volume was in Secondary state. (also, I see now that I had |
My first guess was also some caching issues... I tried but failed to reproduce this issue, although I am only trying this with DRBD and LINSTOR running in VMs..
You might want to enable trace logging on the satellites, using |
Hi, how is there a decreasing resize of my ZFS storage with Linstor? Cheers! |
@ghernadi - I have repeated the process with trace logging enabled. Notably, when I attempted to re-create the resize sequence that did not fail earlier, both tests resulted in the problem on the Secondary. I said in my initial report that manually using For the normal process - resizing with Linstor - Here are all the syslog entries from around the time of the resize: And here are the same logs with all 3 nodes folded together and sorted by time, in case the order of events between nodes is important: Here are my notes from the various scenarios I tried today: Resize with Linstor
Repeat, But run all commands manually and in the same order as Linstor
Repeat, but with drbdadm resize on pve3 instead of pve1
Repeat, with vanilla drbdadm resize command
Repeat, with the process I thought did not result in the issue earlier
Same problem, so this might be at the DRBD layer? |
I was able to re-create this with just DRBD and with VMs - so this is not actually a Linstor bug. This can probably be closed, and I will open a bug report on the drbd repo. |
I encountered an issue with a Proxmox cluster running Linstor where, after a power outage and full cluster reboot, VMs that had disks resized failed to start. I narrowed the issue down to a problem with DRBD replicas in Secondary state when one replica is Primary and Linstor is used to resize the volume.
If a Linstor volume is resized while Primary on a host, then a DRBD Secondary for that volume is powered off, the Secondary's DRBD resource will not be able to come online after the host is back up. If the Primary is power cycled, the Primary's DRBD resource will sometimes not come back online after the host is back up. If multiple nodes are power-cycled, all Secondaries and sometimes the Primary will have this issue.
The issue can be identified by running
drbd adjust
on power-cycled replicas when they are back up, which will fail with an error saying there is no activity log. This situation results in an inability to re-establish quorum when multiple hosts power-cycle at once, such as in a power outage.This can be fixed by running
drbdadm create-md pm-nnnnnnnn
on the nodes that have backing volumes but are showing as diskless, then if necessary, forcing one to be Primary to start a sync.This issue occurs no matter how far in the past the disk was resized. The first time I encountered the issue was a power outage over two months after having resized a VM disk and that DRBD resource did not come back online.
The issue does not occur if the host is shutdown gracefully:
shutdown -h now
If the host is power-cycled after a graceful shutdown or reboot, the issue will not occur.
The issue does not occur if the volume is resized outside of Linstor -
lvextend
orzfs set volsize
thendrbdadm resize
, with or without--assume-clean
. I am not sure what Linstor is doing differently in the resize process.My full notes on troubleshooting this issue are attached. Reproduction attempts 3 through 7 are the most interesting. Attempts 1 and 2 contain a bunch of actions / thoughts that turned out to be unrelated.
Linstor Single Replica after Resize then Cluster Restart.md
Debugging Summary
drbdadm create-md
on the affected nodes for the affected volume./var/lib/drbd/drbd-minor-nnnn.lkbd
files after resize, and found that they were not updated with the new size on nodes where the volume was Secondary. The file did contain the correct volume size on the node where the volume was Primary.drbdadm check-resize
on the resized volume on each node. This caused thelkbd
file to get updated but the same metadata problem occurred after power-cycle.drbdadm down
andup
to bring the volume offline and back online. The first run ofdrbdadm up
complained withNo usable activity log found.
, but succeeded anyway. I was not able to provoke the same issue without the reboot.Version Info
Linstor controller/satellite are 1.29.1.
DRBD version info:
The text was updated successfully, but these errors were encountered: