Ceph-mon verifier
=================

So far, the "reboot" and "shutdown" actions are supported and they both
perform the same set of checks.

* check minimum Juju version
* check Ceph Monitor quorum
* check Ceph clusters health

::

  $ juju-verify reboot --unit ceph-mon/0
  ===[ceph-mon/0]===
  Checks:
  [OK] check_affected_machines check passed
  [OK] check_has_sub_machines check passed
  [OK] Minimum juju version check passed.
  [OK] Ceph-mon quorum check passed.
  [OK] ceph-mon/0: Ceph cluster is healthy

  Result: OK (All checks passed)


check Juju version
------------------

Ceph-mon verification relies on Juju features introduced in 2.8.10. If this minimum
version requirement is not met, the verification will stop and return ``Failed`` result
immediately. (same behavior as if juju-verify was run with the ``--stop-on-failure``
flag)

Example of response when check failed, due to Juju client version 2.7.5:

::

  [FAIL] Juju agent on unit ceph-mon/0 has lower than minimum required version. 2.7.5 < 2.8.10

on the contrary, if the client is met the minimum version:

::

  [OK] Minimum juju version check passed.


check Ceph Monitor quorum
-------------------------

This check verifies that intended action won't remove more than half of monitors in each
affected Ceph cluster. The majority of Ceph monitors must be kept alive after the change
to maintain quorum.

If the reboot/shutdown of the ceph-mon unit(s) from the Ceph cluster does not endanger
the monitoring quorum, the following message is displayed:

::

  [OK] Ceph-mon quorum check passed.

and vice versa if reboot/shutdown the unit(s) causes a loss of Ceph quorum:

::

  [FAIL] Rebooting or shutting down the unit ceph-mon/0 will lose ceph-mon quorum

Another possible failure is if it is not possible to read the result from the output of
the "get-quorum-status" action. In this case, the following result message will be
present along with the action ID.

::

  [FAIL] Failed to parse quorum status from action 24.


check Ceph clusters health
--------------------------

This check runs ``get-health`` action on one of the targeted ``ceph-mon`` units to get
cluster's health. If targeted units belong to multiple Juju applications, ``get-health``
action is run on one unit per application. A cluster is considered healthy if the
action's output contains ``HEALTH_OK``. All affected clusters must be healthy for
verification to succeed.

The successful result message should look like this:

::

  [OK] ceph-mon/1: Ceph cluster is healthy


On the other hand, the check fails if the output does not contain ``HEALTH_OK``. A Ceph
cluster will be marked as unhealthy if the output contains ``HEALTH_WARN`` or
``HEALTH_ERR``, and in an unknown state if it does not contain any of the above
expressions.

::

  [FAIL] ceph-mon/1: Ceph cluster is in a warning state
    HEALTH_WARN too few PGs per OSD (8 < min 30)

There are several possible reasons why the Ceph cluster is not healthy, but not all of
them can be listed here. For more info visit `ceph-monitoring`_.


.. _ceph-monitoring: https://docs.ceph.com/en/pacific/rados/operations/monitoring/