Ceph-mon verifier ================= So far, the "reboot" and "shutdown" actions are supported and they both perform the same set of checks. * check minimum Juju version * check Ceph Monitor quorum * check Ceph clusters health :: $ juju-verify reboot --unit ceph-mon/0 ===[ceph-mon/0]=== Checks: [OK] check_affected_machines check passed [OK] check_has_sub_machines check passed [OK] Minimum juju version check passed. [OK] Ceph-mon quorum check passed. [OK] ceph-mon/0: Ceph cluster is healthy Result: OK (All checks passed) check Juju version ------------------ Ceph-mon verification relies on Juju features introduced in 2.8.10. If this minimum version requirement is not met, the verification will stop and return ``Failed`` result immediately. (same behavior as if juju-verify was run with the ``--stop-on-failure`` flag) Example of response when check failed, due to Juju client version 2.7.5: :: [FAIL] Juju agent on unit ceph-mon/0 has lower than minimum required version. 2.7.5 < 2.8.10 on the contrary, if the client is met the minimum version: :: [OK] Minimum juju version check passed. check Ceph Monitor quorum ------------------------- This check verifies that intended action won't remove more than half of monitors in each affected Ceph cluster. The majority of Ceph monitors must be kept alive after the change to maintain quorum. If the reboot/shutdown of the ceph-mon unit(s) from the Ceph cluster does not endanger the monitoring quorum, the following message is displayed: :: [OK] Ceph-mon quorum check passed. and vice versa if reboot/shutdown the unit(s) causes a loss of Ceph quorum: :: [FAIL] Rebooting or shutting down the unit ceph-mon/0 will lose ceph-mon quorum Another possible failure is if it is not possible to read the result from the output of the "get-quorum-status" action. In this case, the following result message will be present along with the action ID. :: [FAIL] Failed to parse quorum status from action 24. check Ceph clusters health -------------------------- This check runs ``get-health`` action on one of the targeted ``ceph-mon`` units to get cluster's health. If targeted units belong to multiple Juju applications, ``get-health`` action is run on one unit per application. A cluster is considered healthy if the action's output contains ``HEALTH_OK``. All affected clusters must be healthy for verification to succeed. The successful result message should look like this: :: [OK] ceph-mon/1: Ceph cluster is healthy On the other hand, the check fails if the output does not contain ``HEALTH_OK``. A Ceph cluster will be marked as unhealthy if the output contains ``HEALTH_WARN`` or ``HEALTH_ERR``, and in an unknown state if it does not contain any of the above expressions. :: [FAIL] ceph-mon/1: Ceph cluster is in a warning state HEALTH_WARN too few PGs per OSD (8 < min 30) There are several possible reasons why the Ceph cluster is not healthy, but not all of them can be listed here. For more info visit `ceph-monitoring`_. .. _ceph-monitoring: https://docs.ceph.com/en/pacific/rados/operations/monitoring/