Ceph-mon verifier¶
So far, the “reboot” and “shutdown” actions are supported and they both perform the same set of checks.
check minimum Juju version
check Ceph Monitor quorum
check Ceph clusters health
$ juju-verify reboot --unit ceph-mon/0
===[ceph-mon/0]===
Checks:
[OK] check_affected_machines check passed
[OK] check_has_sub_machines check passed
[OK] Minimum juju version check passed.
[OK] Ceph-mon quorum check passed.
[OK] ceph-mon/0: Ceph cluster is healthy
Result: OK (All checks passed)
check Juju version¶
Ceph-mon verification relies on Juju features introduced in 2.8.10. If this minimum
version requirement is not met, the verification will stop and return Failed result
immediately. (same behavior as if juju-verify was run with the --stop-on-failure
flag)
Example of response when check failed, due to Juju client version 2.7.5:
[FAIL] Juju agent on unit ceph-mon/0 has lower than minimum required version. 2.7.5 < 2.8.10
on the contrary, if the client is met the minimum version:
[OK] Minimum juju version check passed.
check Ceph Monitor quorum¶
This check verifies that intended action won’t remove more than half of monitors in each affected Ceph cluster. The majority of Ceph monitors must be kept alive after the change to maintain quorum.
If the reboot/shutdown of the ceph-mon unit(s) from the Ceph cluster does not endanger the monitoring quorum, the following message is displayed:
[OK] Ceph-mon quorum check passed.
and vice versa if reboot/shutdown the unit(s) causes a loss of Ceph quorum:
[FAIL] Rebooting or shutting down the unit ceph-mon/0 will lose ceph-mon quorum
Another possible failure is if it is not possible to read the result from the output of the “get-quorum-status” action. In this case, the following result message will be present along with the action ID.
[FAIL] Failed to parse quorum status from action 24.
check Ceph clusters health¶
This check runs get-health action on one of the targeted ceph-mon units to get
cluster’s health. If targeted units belong to multiple Juju applications, get-health
action is run on one unit per application. A cluster is considered healthy if the
action’s output contains HEALTH_OK. All affected clusters must be healthy for
verification to succeed.
The successful result message should look like this:
[OK] ceph-mon/1: Ceph cluster is healthy
On the other hand, the check fails if the output does not contain HEALTH_OK. A Ceph
cluster will be marked as unhealthy if the output contains HEALTH_WARN or
HEALTH_ERR, and in an unknown state if it does not contain any of the above
expressions.
[FAIL] ceph-mon/1: Ceph cluster is in a warning state
HEALTH_WARN too few PGs per OSD (8 < min 30)
There are several possible reasons why the Ceph cluster is not healthy, but not all of them can be listed here. For more info visit ceph-monitoring.