Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: LVM2 RAID support #17226

Merged
merged 5 commits into from
Oct 30, 2023
Merged

Conversation

mvollmer
Copy link
Member

@mvollmer mvollmer commented Apr 7, 2022

Demo: https://youtu.be/OElG4yHxc2w
Dialog demo: https://youtu.be/WChc7-vfM1A

For later:


Storage: Support for RAID layouts with LVM2

On recent operating systems such as Fedora 39, Cockpit can now create LVM2 logical volumes with RAID layouts, display how they are stored on their physical volumes, and help you repair them.

image

image

@mvollmer mvollmer changed the title WIP storage: LVM2 RAID support Apr 7, 2022
@mvollmer mvollmer temporarily deployed to cockpit-dist April 7, 2022 07:20 Inactive
@mvollmer mvollmer added the no-test For doc/workflow changes, or experiments which don't need a full CI run, label Apr 7, 2022
@mvollmer mvollmer temporarily deployed to cockpit-dist April 7, 2022 14:19 Inactive
@mvollmer mvollmer temporarily deployed to cockpit-dist April 8, 2022 13:42 Inactive
@mvollmer mvollmer temporarily deployed to cockpit-dist April 11, 2022 14:06 Inactive
@mvollmer mvollmer temporarily deployed to cockpit-dist April 12, 2022 11:54 Inactive
@mvollmer mvollmer temporarily deployed to cockpit-dist April 20, 2022 12:20 Inactive
@mvollmer
Copy link
Member Author

I was confused about the "degraded" state; I somehow thought "lvs -o health_status" would return "degraded" for a degraded raid5 volume. This will never happen, both health_status and attr[8] will indicate "partial" when a raid5 lv is missing a single device. I guess we need to deduce that information somewhere in the stack. At that place, we need the full segment structure of the lv, since being degraded is a per-segment property.

@mvollmer
Copy link
Member Author

mvollmer commented Apr 26, 2022

both health_status and attr[8] will indicate "partial" when a raid5 lv is missing a single device.

Actually, they will only report "partial" until the missing PV is removed via "vgreduce --removemissing". After that they report "refresh needed". So I guess they are not that helpful as I initially thought.

When drilling into the segments and devices and subvolumes, we can identify missing pieces by their segement type ("error" after vgreduce --removemissing) and "devices" ("[unknown]" before vgreduce --removemissing).

@mvollmer mvollmer temporarily deployed to cockpit-dist April 27, 2022 07:06 Inactive
@mvollmer mvollmer temporarily deployed to cockpit-dist May 18, 2022 13:23 Inactive
@mvollmer mvollmer temporarily deployed to cockpit-dist May 19, 2022 12:42 Inactive
@mvollmer
Copy link
Member Author

mvollmer commented Jul 4, 2022

Just for the record, this just happened after growing the volume:

image

Either LVM2 has made fantastically bad decisions about where to put stuff, or I messed up the data structures...

(A single failure of either sdc or sda1 will kill two stripes.)

@garrett
Copy link
Member

garrett commented Jul 4, 2022

(A single failure of either sdc or sda1 will kill two stripes.)

We probably want to warn about unsafe configurations.

Some people may want to run with unsafe setups, as they might back data up anyway... but some people don't. Warnings are useful.

But then: What would be "unsafe"? Anything with worse fail rates than a single disk? Or anything not redundant?

This isn't about preventing people from setting up unsafe RAIDs; just letting them know that what they're doing would lead to data loss... probably with a helper text warning at the bottom of a dialog and possibly even an (i) explaining it more in depth?

@mvollmer
Copy link
Member Author

mvollmer commented Jul 4, 2022

This isn't about preventing people from setting up unsafe RAIDs; just letting them know that what they're doing would lead to data loss... probably with a helper text warning at the bottom of a dialog and possibly even an (i) explaining it more in depth?

Yes, this would be useful. In this case, LVM2 has made all the decisions, but we will also allow people to select which drives to use, and there we should warn about selecting "unsafe" ones, like putting two stripes on the same physical disk.

But again, what has just happened in this concrete case counts as a bug in LVM2 (if Cockpit presents the situation correctly, which I think it does). LVM2 should have used sdc to extend the second strip instead of the first, and sda1 to extend the third stripe instead of the second. This should have been possible since all stripe extensions are the same size. shrug I keep an eye out for this.

@mvollmer mvollmer temporarily deployed to cockpit-dist July 4, 2022 11:32 Inactive
@mvollmer mvollmer temporarily deployed to cockpit-dist July 15, 2022 10:22 Inactive
@mvollmer mvollmer temporarily deployed to gh-cockpituous October 26, 2023 11:20 — with GitHub Actions Inactive
test/verify/check-storage-lvm2 Outdated Show resolved Hide resolved
@mvollmer mvollmer temporarily deployed to gh-cockpituous October 27, 2023 06:27 — with GitHub Actions Inactive
@mvollmer mvollmer requested a review from jelly October 27, 2023 06:27
jelly
jelly previously approved these changes Oct 27, 2023
Otherwise the order of PVs passed to calls like
CreatePlainVolumeWithLayout depends on in which order the PVs have
been selected in the dialog, which would not be wrong, but is
unexpected.
So that extending this to "errors" in addition to "warnings" later on
becomes slightly cleaner.
It gets spuriously closed sometimes.
@mvollmer mvollmer temporarily deployed to gh-cockpituous October 27, 2023 13:51 — with GitHub Actions Inactive
@jelly
Copy link
Member

jelly commented Oct 27, 2023

centos-9 still seems unhealthy:

ok 47 test/verify/check-storage-raid1 TestStorageRaid1.testRaidLevelOne [ND@-1]
Traceback (most recent call last):
  File "/var/ARTIFACTS/work-optional2f9criud/plans/all/optional/discover/default-0/tests/bots/test-failure-policy", line 294, in <module>
    sys.exit(main())
  File "/var/ARTIFACTS/work-optional2f9criud/plans/all/optional/discover/default-0/tests/bots/test-failure-policy", line 67, in main
    number, img = check_all_known_issues(api, output, opts.image)
  File "/var/ARTIFACTS/work-optional2f9criud/plans/all/optional/discover/default-0/tests/bots/test-failure-policy", line 134, in check_all_known_issues
    number = check_known_issue(api, trace, img)
  File "/var/ARTIFACTS/work-optional2f9criud/plans/all/optional/discover/default-0/tests/bots/test-failure-policy", line 172, in check_known_issue
    if fnmatch.fnmatchcase(trace, match) or fnmatch.fnmatchcase(trace, match.replace("[", "?")):
  File "/usr/lib64/python3.9/fnmatch.py", line 76, in fnmatchcase
    match = _compile_pattern(pat)
  File "/usr/lib64/python3.9/fnmatch.py", line 52, in _compile_pattern
    return re.compile(res).match
  File "/usr/lib64/python3.9/re.py", line 252, in compile
    return _compile(pattern, flags)
  File "/usr/lib64/python3.9/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib64/python3.9/sre_compile.py", line 788, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib64/python3.9/sre_parse.py", line 955, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib64/python3.9/sre_parse.py", line 444, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib64/python3.9/sre_parse.py", line 841, in _parse
    p = _parse_sub(source, state, sub_verbose, nested + 1)
  File "/usr/lib64/python3.9/sre_parse.py", line 444, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib64/python3.9/sre_parse.py", line 755, in _parse
    p = _parse_sub(source, state, verbose, nested + 1)
  File "/usr/lib64/python3.9/sre_parse.py", line 444, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib64/python3.9/sre_parse.py", line 841, in _parse
    p = _parse_sub(source, state, sub_verbose, nested + 1)
  File "/usr/lib64/python3.9/sre_parse.py", line 444, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib64/python3.9/sre_parse.py", line 599, in _parse
    raise source.error(msg, len(this) + 1 + len(that))
re.error: bad character range k-j at position 229 (line 4, column 49)

And

WARNING: Waiting for !ph_is_present("#dialog") took 38.8 seconds, which is 64% of the timeout.
WARNING: Waiting for !ph_is_present("#dialog") took 43.7 seconds, which is 72% of the timeout.
WARNING: Waiting for !ph_is_present("#dialog") took 30.7 seconds, which is 51% of the timeout.
> log: {"problem":null,"name":"org.freedesktop.DBus.Error.UnknownInterface","message":"Unknown interface 'org.freedesktop.systemd1.Service'."}
WARNING: Waiting for !ph_is_present("#dialog") took 34.3 seconds, which is 57% of the timeout.
Traceback (most recent call last):
  File "/var/ARTIFACTS/work-optional2f9criud/plans/all/optional/discover/default-0/tests/test/verify/check-storage-stratis", line 263, in testBasic
    self.wait_mounted(1, 1)
  File "/var/ARTIFACTS/work-optional2f9criud/plans/all/optional/discover/default-0/tests/test/common/storagelib.py", line 492, in wait_mounted
    self.content_tab_wait_in_info(row, col, "Mount point",
  File "/var/ARTIFACTS/work-optional2f9criud/plans/all/optional/discover/default-0/tests/test/common/storagelib.py", line 215, in content_tab_wait_in_info
    self.retry_in_content_tab(row_index, tab_index, func)
  File "/var/ARTIFACTS/work-optional2f9criud/plans/all/optional/discover/default-0/tests/test/common/storagelib.py", line 190, in retry_in_content_tab
    self.browser.wait(step)
  File "/var/ARTIFACTS/work-optional2f9criud/plans/all/optional/discover/default-0/tests/test/common/testlib.py", line 574, in wait
    raise Error('timed out waiting for predicate to become true')
testlib.Error: timed out waiting for predicate to become true

Wrote screenshot to TestStorageStratis-testBasic-centos-9-stream-localhost-22-FAIL.png
Wrote HTML dump to TestStorageStratis-testBasic-centos-9-stream-localhost-22-FAIL.html
Wrote JS log to TestStorageStratis-testBasic-centos-9-stream-localhost-22-FAIL.js.log
Journal extracted to TestStorageStratis-testBasic-centos-9-stream-localhost-22-FAIL.log.gz
umount: /run/fsys1: not mounted.
umount: /dev/loop14: not mounted.
umount: /dev/loop13: not mounted.
umount: /dev/loop12: not mounted.
umount: /dev/loop11: not mounted.
umount: /dev/loop10: not mounted.
Execution failed:
stratisd failed to perform the operation that you requested. It returned the following information via the D-Bus: ERROR: low-level ioctl error due to nix error; ioctl number: 4, input header: Some(DeviceInfo { version: Version { major: 4, minor: 0, patch: 0 }, data_size: 16384, data_start: 312, target_count: 0, open_count: 0, flags: (empty), event_nr: 4199615, dev: Device { major: 0, minor: 0 }, name: Some(DmNameBuf { inner: "stratis-1-01ad51d0bd22451e89e9714281b60d65-thin-fs-2d7c6df211fc4ef3a4143589ea9b0b74" }), uuid: None }), header result: Some(DeviceInfo { version: Version { major: 4, minor: 48, patch: 0 }, data_size: 16384, data_start: 312, target_count: 0, open_count: 0, flags: (empty), event_nr: 4199615, dev: Device { major: 0, minor: 0 }, name: Some(DmNameBuf { inner: "stratis-1-01ad51d0bd22451e89e9714281b60d65-thin-fs-2d7c6df211fc4ef3a4143589ea9b0b74" }), uuid: None }), error: EBUSY: Device or resource busy. 

Execution failed:
stratisd failed to perform the operation that you requested. It returned the following information via the D-Bus: ERROR: filesystems remaining on pool. 

@mvollmer
Copy link
Member Author

re.error: bad character range k-j at position 229 (line 4, column 49)

I think this happens because of the 5106-journalctl-since-until-behaviour-change-2 pattern, which has this line in it:

})(["January 1, 2037", ["check-journal", "AFTER BOOT", ""], ["", "Reboot", ""], ["check-journal", "BEFORE BOOT", ""]])): Uncaught (in promise) Error: condition did not become true

The [ character is special for fnmatch and we should probably escape it in that pattern.

@mvollmer
Copy link
Member Author

Green as it gets. fedora-38/devel fails while posting the coverage comments,
testing-farm:centos-stream-9-x86_64:self fails with Stratis.

@jelly
Copy link
Member

jelly commented Oct 30, 2023

})(["January 1, 2037", ["check-journal", "AFTER BOOT", ""], ["", "Reboot", ""], ["check-journal", "BEFORE BOOT", ""]])): Uncaught (in promise) Error: condition did not become true

Blegh, this is a Python 3.9 specific issue on 3.11 this works fine.

It's not the [ or ] it's -

re.error: bad character range k-" at position 36

fnmatch.fnmatchcase("lol", '})(["January 1, 2037", ["check-"]')

@mvollmer mvollmer requested a review from jelly October 30, 2023 14:20
@mvollmer
Copy link
Member Author

It's not the [ or ] it's -

I guess it is the -inside [ ].

@jelly
Copy link
Member

jelly commented Oct 30, 2023

It's not the [ or ] it's -

I guess it is the -inside [ ].

It is, send cockpit-project/bots#5475

Copy link
Member

@jelly jelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still needs some screenshots for the Wednesday release.

@mvollmer mvollmer merged commit 1e2e3f4 into cockpit-project:main Oct 30, 2023
95 of 97 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants