We would like to share our experience of using PMM for DB cluster monitoring. This is a great tool with useful dashboards that help us improve the availability and performance of our infrastructure.
We perform weekly reviews of our main cluster using these dashboards:
The main concept on this screen is ‘Load’ which is the average query execution time multiplied by queries per sec, essentially this metric indicates how much MySQL resources are consumed. E.g. query that was running for 5 minutes but just a single run won’t make it to the top.
Usually, we look at the top 10 queries.
You can click on each query digest and see more details about it.
It helped us several times to detect inefficient queries and things when a query was executed a lot more often than it was anticipated.
Supplementary graph that just displays basic OS metrics – usually we look here to examine moments of the peak load in order to determine the performance bottleneck.
Same as (2) but holds MySQL-specific metrics.
This may be rather handy to be aware of vulnerabilities in your version, configuration flaws etc. Warnings in the screenshot above are about minor Percona Server version (”Current version is 8.0.22, the latest available version is 8.0.29.”).
- Our last check is PMM Query errors, but the representation there was inconvenient for us so we’ve replaced it with our own implementation which collects MySQL query errors each 2 seconds and injects them into Clickhouse for further inspection. The results look this way:
In the screenshot above we can see a lot of lock insert collisions (which is expected) and 15 DELETE deadlocks that are yet to be investigated.