How to set up alerts using PMM

Percona Monitoring and Management allows you to set different alerts related to your infrastructure.  Before we start setting up alerting, lets make sure this option is enabled on your PMM Server. To do it go to Configuration —> PMM Settings —>  Advanced Settings —>  Alerting

PMM Settings
Advanced Settings Alerting

PMM presents several types of alerts, depending on used datasources and complexity of needed alerts: Percona templated alerts, Grafana managed alerts, Mimir or Loki alerts, Mimir or Loki recording rules. As far as PMM Server gives you the most necessary alert templates out of the box, we will use them to set up our alerting system. And that is a good point to start. You may create rule from a template to fire alert about node high memory or cpu utilization, availability of your databases, and some most common parameters of different databases usage. To see necessary template go to Alerting —> Alert rule templates section

Alert rule templates

and chose one you need

alert templates

After that edit the name of alert rule, set up such  parameters as threshold, alerting level, event duration and select a folder where your rule will be stored

Alert Rule Settings

If you need more complex or just different metrics alert, you may create your own rule template. Templates should be written in YAML format. Here is an example of such file

templates:
 - name: pmm_mongodb_high_memory_usage
   version: 1
   summary: Memory used by MongoDB
   expr: |-
      sum by (node_name) (mongodb_ss_mem_resident * 1024 * 1024)
      / on (node_name) (node_memory_MemTotal_bytes)
       * 100
       > [[ .threshold ]]
   params:
     - name: threshold
       summary: A percentage from configured maximum
       unit: "%"
       type: float
       range: [0, 100]
       value: 80
   for: 5m
   severity: warning
   labels:
      custom_label: demo
   annotations:
      summary: MongoDB high memory usage ({{ $labels.service_name }})
      description: |-
         {{ $value }}% of memory (more than [[ .threshold ]]%) is used
         by {{ $labels.service_name }} on {{ $labels.node_name }}

It should contain the next options:

  • name (required field): uniquely identifies template. Spaces and special characters are not allowed.
  • version (required): defines the template format version.
  • summary (required field): a template description.
  • expr (required field): a MetricsQL query string with parameter placeholders.
    • params: contains parameter definitions required for the query. Each parameter has a name, type, and summary. It also may have a unit, available range, and default value.
    • name (required): the name of the parameter. Spaces and special characters are not allowed.
    • summary (required): a short description of what this parameter represents.
    • unit (optional): PMM currently supports either s (seconds) or % (percentage).
    • type (required):PMM currently supports the float type. string, bool, and other types will be available in a future release.
    • range (optional): defines the boundaries for the value of a float parameter.
  • value (optional): default parameter value. Value strings must not include any of these special characters: < > ! @ # $ % ^ & * ( ) _ / \ ‘ + – = (space).
  • for (required): specifies the duration of time that the expression must be met before the alert will be fired.
  • severity (required): specifies default alert severity level.
  • labels (optional): are additional labels to be added to generated alerts.
  • annotations (optional): are additional annotations to be added to generated alerts.

After alert rules are created from build on or custom template, setup contact points. You may chose one or more notification channels like Slack, Email, Telegram and others.

Contact Points

You may also silence some of alerts on some timetable to avoid spamming your communication channels. There is special section maned Silences in Alerting menu section 

Silences

You may setup much more complex alerts, using different data sources with Grafana managed alerts. But this is out of this article topic. Percona templated alerts will cover the majority of cases. So PMM server can help you not only to watch you databases server but also will notify you if something is going wrong.

DB performance troubleshooting workflow using Percona PMM

DB performance troubleshooting workflow using Percona PMM

Some time you may face with the situation when your DB became slow. You may notice it during developing your system. Maybe database on staging worked fine, and after deploy to prod you realised that is works not so good as you expected. Or even maybe you didn’t make any changes and the performance of your DB started degrading.  In any case you probably would want to get to know the reason, to improve performance and to make your app or infrastructure better. And PMM can help you to investigate this situation. It gives you very informative visualisation of your databases state and tools for more deep analysis of how they are working.

Let’s try to troubleshoot performance issues, using Percona Monitoring and Management. There are two good entry points to start your investigation. 

Step 1. Check your resource limits

They are hardware resources and network limits. First of all take a look at Network IO dashboard on your home dashboards page.

Network IO

If you see that it is very close to your network throughput that means you should expand the your channel. The next thing is to pay attention to  CPU and Memory usage. You may see the whole dashboard, and if you discovered that almost all resources are utilised, you may get some more details about the load time and correlate it with some events in your system.

CPU Busy
CPU Busy Inspect
Inspect: CPU Busy

The same with Memory Limits.  Another common limit is a disk speed. If you see that read or write is close to disk limit, probably you should change it for faster one. 

Disk Reads/Writes

Step 2. Explore your DB work

If you just build your app and faced with resources limitation, the only way for you is to tune your hardware. But what if your current system performance began degrade without any  changes from your side? Or maybe everything is OK with resources limits, but your database is still slow? In this case lets deep dive into database metrics. 
First of all take a look on the database connections dashboard. If you see significant increase of connections try to realise if these are new customers, or maybe some bugs in your system. One more important metric is the number of queries per second. If you see that its number is much higher than usual, compare it with the number of connections. Increasing on connections number logically will increase the number of queries. But if you see that the number of queries per second is too high with the same number of connections, you should explore your system to find the reason of such behaviour. 

DB connections and queries per second dashboards

Increasing of customers number is a good sign for business, and if you see that your database fails to cope with it, it means that it is time to scale your system. But increasing of connections and queries numbers can also indicate that you have some bugs in your system, like unclosed connection, duplicated queries etc.

Step 3. Deep dive into queries

It is good if you have solved performance issues an didn’t get to this step. But what if your resources are fine, connections and queries are also in normal range, but you still face with performance issues. That is a time to deep dive into exploring of queries. PMM gives you good tool to do this — Query Analytics

PMM Query Analytics

On this dashboard you may see queries, ranked by Load. Load is a metric which defined as Average Active Queries which is a mix of query time execution and the query count. And you can see that possibly some too heavy queries executed too many times. Inspect them, try to understand if you can optimise query expressions or maybe the frequency of their execution. Try to specify query parameters and use received data more effectively. 

To conclude, we may say that PMM can’t solve your database performance issues. But it gives you very powerful tools to determine issues reasons. Using it you may find the weakest places of your system and improve them the most effectively and economically.