You’ve certainly heard it before: “What gets measured gets done.”
It’s true: what you observe and measure is what you can improve.
The key to any improvement is to first identify 무엇 to measure and then collect the related metrics. Using those metrics, you can tune the underlying work and analyze the effectiveness of any changes. Then repeat the cycle until you’ve sufficiently improved.
At Couchbase, we needed to improve some of our day-to-day operations, so we created observability dashboards to help us identify issues and track improvement. We used a combination of 프로메테우스, which simplifies storing and querying time-series data, and Grafana, which can be used to make stunning data visualizations. In addition, we used 카우치베이스 to store historical data for later use with its 전체 텍스트 검색 그리고 분석 tools.
In this article, we’ll walk you through how to build your own observability dashboard using Prometheus, Grafana and Couchbase.
Your in-house data source pipelines may vary – as might your data visualization software. However, the steps we’ll show you today should be applicable across a number of tools and deployments.
Generic Observability Dashboard: Design & Architecture
In order to build a reusable and scalable tool, it’s better to work from common designs and templates as a first step. From there, you can customize as needed. With this approach, it’s quick and easy to develop future dashboards.
The diagram below shows the generic architecture of the observability dashboards we’ll build together:
In this architecture, two different data inputs form an interface to the dashboard service. Let’s take a closer look at each of these below.
- JSON metadata about the dashboard
- Data source definitions, including information about the data sources (like DB URL, SQL, credentials), the file paths, and Jenkins artifacts URLs.
- The Grafana layout template (or visual dashboard view), which we’ll design first and then use as templates for panels in our later dashboards.
- The actual data source files from
.json
그리고.csv
files and from Couchbase.- The design of these observability dashboards supports various data sources like Couchbase Server, and direct files like JSON 문서 and CSV (Comma Separated Values) files. You can extend the databoard proxy service code (in
dashboard.py
) to parse other data formats as needed.
- The design of these observability dashboards supports various data sources like Couchbase Server, and direct files like JSON 문서 and CSV (Comma Separated Values) files. You can extend the databoard proxy service code (in
The expected output will be a Grafana dashboard UI and Prometheus time-series collected metrics from the two inputs listed above. The central part of the above diagram shows the different services in the collection that support the creation of the dashboards.
Let’s take a closer look at the different facets and services included in the architecture diagram:
- Dashboard proxy service:
- This is a generic Python Flask web app service (
dashboard.py
) that interacts with the Grafana service to serve the tabular data and other APIs like/query
,/add
,/import
그리고/export
endpoints. You can develop a similar one to have a generic template (JSON) for the panels on Grafana and attach the graph data points and tabular-data points as target JSON to display on your Grafana dashboard.
- This is a generic Python Flask web app service (
- Prometheus export service:
- This is a custom Prometheus exporter (say
prometheus.py
) Flask web app service that connects to the data sources and serves the requests from Prometheus itself. At a high level, this acts as a bridge between the Prometheus and datasources. Note that this service is needed only when the data source is to be maintained for time series (many trends need this).
- This is a custom Prometheus exporter (say
- Grafana service:
- This is the regular Grafana tool itself that you use to create panels and display as dashboards.
- Prometheus service:
- This is the regular Prometheus tool itself that holds your metrics as time-series data.
- Alert Manager:
- The Alert Manager has custom alert rules that receive alerts when certain thresholds are met.
- Other services:
- 카우치베이스: You might already be using this NoSQL document database, but if not, you can install it through a container or directly on a different host. Couchbase stores your data as JSON documents, or you can have it store required fields as separate documents for historic trends while preparing your health or trend data.
- Docker: You’ll need to install the docker agent software on the host in order to use this containerized service deployment.
Sample Dashboard JSON Structure
In the table below, you’ll see a sample of the structure of both the input metadata and the input data source.
Input metadata JSON structure: | Input data sources structure: |
{ |
//Couchbase source |
Deploying the Observability Dashboard Services
사용 도커-컴포즈
file below to bring up all of the required services – e.g., Dashboard proxy, Grafana, Prometheus, Exporter, Alert manager – that appear in the architecture diagram above for our observability dashboards. You can install Couchbase on a different host to store your growing high-volume data.
To bring up: 도커-컴포지트 업
Next, visit http://host:3000
for the Grafana page.
To bring down: 도커-컴포지트 다운
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
버전: "2" 서비스: 대시보드: 다시 시작: 아니면-중지 빌드: ../../ 포트: - 5001:5000 환경: - GRAFANA_HOST=http://admin:password@grafana:3000 볼륨: - ./구성/대상.json:/앱/대상.json 그라파나: 이미지: 그라파나/그라파나:8.0.1 다시 시작: 아니면-중지 볼륨: - ./구성/그라파나:/var/lib/그라파나 환경: GF_INSTALL_PLUGINS: "simpod-json-datasource,marcusolsson-csv-datasource,ae3e-plotly-panel" GF_AUTH_ANONYMOUS_ENABLED: "true" GF_PLUGINS_ALLOW_LOADING_UNSIGNED_PLUGINS: "ae3e-plotly-panel" GF_RENDERING_SERVER_URL: http://renderer:8081/render GF_RENDERING_CALLBACK_URL: http://grafana:3000/ 포트: - 4000:3000 renderer: 이미지: 그라파나/그라파나-이미지-renderer:최신 프로메테우스: 다시 시작: 아니면-중지 이미지: 무도회/프로메테우스 볼륨: - ./구성/프로메테우스.yml:/등/프로메테우스/프로메테우스.yml - ./구성/알림.규칙.yml:/등/프로메테우스/알림.규칙.yml 수출자: 다시 시작: 아니면-중지 빌드: ../../수출자 볼륨: - ./구성/쿼리.json:/앱/쿼리.json 알림 관리자: 다시 시작: 아니면-중지 이미지: 무도회/알림 관리자 포트: - 9093:9093 볼륨: - ./구성/알림 관리자.yml:/등/알림 관리자/알림 관리자.yml - ./구성/alert_templates:/등/알림 관리자/템플릿 |
The above service reference files content – or snippets for brevity – can be found in the implementation section below.
Using these tools, you can create a wide variety of dashboards to suit your requirements. We’ll walk through three types of example dashboards to give you an idea of what’s possible.
Example Dashboards: Overview
# | Dashboard | Measurements | 메트릭 |
1 | Functional Regression Testing Cycles dashboards | Trends among functional regression testing cycles at both the build level & component level | total tests, passed, failed, aborts, total time, fresh run time, etc. |
2 | Infra VMs usage dashboards, including Static VMs & Dynamic VMs | Resources utilization & history | active count, available count, compute hours/max/created per day, week, month |
3 | Infra VMs Health dashboards, Static Servers, Jenkins Slaves VMs | VM health monitoring, alerts & history tracking of VMs | ssh_fail, pool_os vs real_os, cpu-memory-disk-swap usages, file descriptors, firewall rules, pool_mac_address vs real_mac_address, booted days, total and product processes, installed app versions and services etc. |
Dashboard #1: Functional Regression Testing Cycles Dashboard
Problem: Before we created this dashboard for ourselves, there were no trend graphs on the regression test cycles with metrics like total time taken, pass rate, fresh vs. reruns (e.g., due to infrastructure issues), inconsistent number of aborts and failures, and also no separate component- or module-level trends.
솔루션: The plan was to create a run analyzer script that analyzes the test data that is already stored in the Couchbase bucket. After that, we get the time-series data for the last n number of builds and targeted metrics for each build.
Dashboard snapshots:
Dashboard #2: Infrastructure Resources / VMs Usage Dashboard
Problem: Prior to building this dashboard, we had a large number of static and dynamic virtual machines but there was no tracking of how the hardware resources were utilized. We had no insight into metrics such as active VMs used at the time, available count, machine time used, or compute hours on a daily, weekly or monthly basis.
솔루션: Our plan was to first collect the data for all the VMs such as dynamically allocating and releasing IPs, exact time creation, and release times, as well as any groupings such as pools, etc. Most of this data already existed in 카우치베이스 서버 (managed by the respective service managers). Using the flexibility of SQL++ 쿼리 언어 (aka N1QL), we were able to extract that data into a format suitable for the graphs we wanted to show in this observability dashboard.
Dashboard snapshots:
Dashboard #3: Infrastructure VMs Health Dashboard
Problem: Before we had this dashboard, regression test runs were failing inconsistently, and there were low-hanging issues with the VMs. Some of the issues included SSH Failures, OS mismatches, VM IP switches, too many open files, swap issues, need reboots, duplicate IPs among multiple runs, high memory usage, disk full (/
또는 /data
), firewall rules stopping the endpoint connection, slave issues due to high memory, and disk usage were all common. There was no observability dashboard to look at and observe these metrics and also no checks and alerts for the test infrastructure health.
솔루션: We decided to create an automatic periodic health check that captures metrics data for the targeted VMs such as ssh_fail
, pool_os
vs real_os
, cpu-memory-disk-swap
usages, file descriptors, firewall rules, pool_mac_address
vs real_mac_address
, booted days, total and Couchbase processes, installed Couchbase version and services. (In sum, we captured ~50 metrics). These metrics are exposed as a Prometheus endpoint that is displayed in Grafana, and the information is also saved in Couchbase for future data analysis. Alerts were also created to monitor the key health metrics for issues to allow for quick intervention and finally achieve increased stability of the test runs.
Dashboard snapshots:
구현
So far, you’ve seen the high-level architecture of the observability dashboards, what services are required, what kind of dashboards you might need, and also how to deploy these services. Now, it’s time to look at some implementation details.
Our first stop is the collection and storage of metrics and the data visualization of the dashboards. Most of the data storage and display steps are similar for all use cases, but the metrics data collection depends on which metrics you choose to target.
How to Get Health Data for Your Dashboards
For the infrastructure monitoring use case, you have to collect various health metrics from hundreds of VMs to create a stable infrastructure.
For this step, we created a Python script that does the SSH connection to the VMs in parallel (multiprocessing pool) and collects the required data. In our case, we also have a Jenkins job that periodically runs this script and collects the health data (CSV), and then saves it to the Couchbase database.
The reason we created this custom script – rather than the readily available node exporter provided by Prometheus – is that some of the required metrics were not supported. In addition, this solution was simpler than deploying and maintaining the new software on 1000+ servers. The code snippet below shows some of the checks being done at the VM level.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
def check_vm(os_name, 호스트): 클라이언트 = SSHClient() 클라이언트.set_missing_host_key_policy(AutoAddPolicy()) ... cpus = get_cpuinfo(클라이언트) meminfo = get_meminfo(클라이언트) diskinfo = get_diskinfo(클라이언트) uptime = get_uptime(클라이언트) ... 반환 ssh_status, '', ssh_resp_time, real_os_version, cpus, meminfo, diskinfo, uptime, uptime_days, systime, cpu_load, cpu_total_processes, fdinfo, \ iptables_rules_count, mac_address, swapinfo, cb_processes, cb_version, cb_running_serv, cb_ind_serv def get_cpuinfo(ssh_client): 반환 ssh_command(ssh_client,"cat /proc/cpuinfo |egrep processor |wc -l") def get_meminfo(ssh_client): 반환 ssh_command(ssh_client,"cat /proc/meminfo |egrep Mem |cut -f2- -d':'|sed 's/ //g'|xargs|sed 's/ /,/g'|sed 's/kB//g'") def get_diskinfo(ssh_client): 반환 ssh_command(ssh_client,"df -ml --output=size,used,avail,pcent / |tail -1 |sed 's/ \+/,/g'|cut -f2- -d','|sed 's/%//g'") def get_uptime(ssh_client): 반환 ssh_command(ssh_client, "uptime -s") def get_cpu_users_load_avg(ssh_client): 반환 ssh_command(ssh_client, "uptime |rev|cut -f1-4 -d','|rev|sed 's/load average://g'|sed 's/ \+//g'|sed 's/users,/,/g'|sed 's/user,/,/g'") def get_file_descriptors(ssh_client): 반환 ssh_command(ssh_client, "echo $(cat /proc/sys/fs/file-nr;ulimit -n)|sed 's/ /,/g'") def get_mac_address(ssh_client): 반환 ssh_command(ssh_client, "ifconfig `ip link show | egrep eth[0-9]: -A 1 |tail -2 |xargs|cut -f2 -d' '|sed 's/://g'`|egrep ether |xargs|cut -f2 -d' '") def get_mac_address_ip(ssh_client): 반환 ssh_command(ssh_client, "ip a show `ip link show | egrep eth[0-9]: -A 1 |tail -2 |xargs|cut -f2 -d' '|sed 's/://g'`|egrep ether |xargs|cut -f2 -d' '") |
The below code snippet shows you how to connect to Couchbase using Python SDK 3.x with key-value operations, getting a document, or saving a document in the database.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
시도: self.cb_cluster = 클러스터("couchbase://"+self.cb_host, 클러스터 옵션(비밀번호 인증기(self.cb_username, self.cb_userpassword), \ timeout_options=클러스터타임아웃옵션(kv_timeout=timedelta(초=10)))) self.cb_b = self.cb_cluster.버킷(self.cb_bucket) self.cb = self.cb_b.기본_컬렉션() 예외 예외 as e: 인쇄('Connection Failed: %s ' % self.cb_host) 인쇄(e) def get_doc(self, doc_key, 재시도=3): # .. 반환 self.cb.get(doc_key) def save_doc(self, doc_key, doc_value, 재시도=3): #... self.cb.업서트(doc_key, doc_value) #... |
Implementing the Dashboard Proxy Service
For the tests observability use cases, the data is in a Jenkins artifact URL and also in Couchbase Server. To bridge these multiple data sources together (CSV, DB), we created a proxy API service that would accept requests from Grafana and return the data format understood by Grafana.
The below code snippets give the implementation and service preparation details.
dashboard.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# Dashboard API service @앱.경로("/query", 메소드=['POST']) def 쿼리(): """ /query responds to a Grafana data request and is formatted as either data points for time series data or rows and columns for tabular data """ 에 대한 대상 in 요청.json['targets']: data_type = 대상['type'] 만약 data_type == "timeseries": datapoints = calculate_datapoints(대상) elif data_type == "table": datapoints = calculate_rows_and_columns(대상) ... def calculate_datapoints(대상): """ Returns data in a time series format datapoints is formatted as a list of 2 item tuples in the format [value, timestamp] """ ... 만약 대상['source'] == "couchbase": ... elif 대상['source'] == "json": ... elif 대상['source'] == "csv": |
도커파일
1 2 3 4 5 6 7 8 9 10 11 |
FROM 우분투:최신 환경 DEBIAN_FRONTEND "noninteractive" RUN apt-get 업데이트 -y && apt-get 설치 -y python3-dev python3-pip python3-설정 도구 cmake 빌드-필수 RUN mkdir /앱 COPY ./요구 사항.txt /앱 WORKDIR /앱 RUN pip3 설치 -r 요구 사항.txt COPY ./대시보드.py /앱 COPY ./진입점.sh /앱 엔트리포인트 ["./entrypoint.sh"] |
entrypoint.sh
1 2 |
#!/bin/bash python3 대시보드.py $GRAFANA_HOST |
요구 사항.txt
1 2 3 |
카우치베이스==3.0.7 Flask==1.1.2 요청==2.24.0 |
How to Get the Tabular Data in Grafana
Grafana is a great tool for viewing time-series data. However, sometimes you want to show some non-time-series data in the same interface.
We achieved this goal using the Plotly plugin which is a JavaScript graphing library. Our main use case was to illustrate trends across a variety of important metrics for our weekly regression testing runs. Some of the most important metrics we wanted to track were pass rate, the number of tests, aborted jobs, and total time is taken. Since the release of Grafana 8, there is limited support for bar graphs. At the time of writing, the bar graph functionality is still in beta and doesn’t offer all of the features we require, such as stacking.
Our goal was to support generic CSV/JSON files or a Couchbase SQL++ query and view the data as a table in Grafana. For maximum portability, we wanted to have a single file that would define both the data sources and Grafana template layout together.
For the tabular data to be displayed, below are the two viable options.
- Write a UI plugin for Grafana
- Provide a JSON proxy using the JSON datasource plugin
We chose option 2 for our implementation, since it seemed simpler than trying to learn the Grafana plugin tools and creating a separate UI plugin for the configuration.
Note that since finishing this project, a new plugin has been released that allows you to add CSV data to Grafana directly. If viewing tabular data from a CSV is your only requirement, then this plugin is a good solution.
Implementing the Prometheus Service
prometheus.yml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# Prometheus global config 글로벌: 스크랩_간격: 1m # Set the scrape interval to every 15 seconds. Default is every 1 minute. scrape_timeout: 30s # Alertmanager configuration 경고: 알림 관리자: - static_configs: - 대상: - 알림 관리자:9093 규칙_파일: - "alert.rules.yml" - job_name: "prometheus" static_configs: - 대상: ["localhost:9090"] - job_name: "automation_exporter" static_configs: - 대상: ["exporter:8000"] |
alert.rules.yml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
그룹: - 이름: 알림.규칙 규칙: - 알림: PoolVMDown expr: vm_health_ssh_status == 0 에 대한: 1m 주석: title: "Server Pool VM {{ $labels.ipaddr }} SSH Failed" 설명: "{{ $labels.ipaddr }} SSH failed with error: {{ $labels.ssh_error }}." 레이블: 심각도: "critical" - 알림: PoolVMHighDiskUsage expr: disk_usage >= 95 에 대한: 1m 주석: title: "Server Pool VM {{ $labels.ipaddr }} high disk usage" 설명: "{{ $labels.ipaddr }} has disk usage of {{ $value }}%" 레이블: 심각도: "critical" |
How to Get Custom Metrics through the Prometheus Exporter
Many cloud-native services integrate directly with Prometheus to allow centralized metrics collection for all of your services.
We wanted to see how we could utilize this technology to monitor our existing infrastructure. If you have services that don’t directly expose a Prometheus metrics endpoint, the way to solve it is to use an exporter. In fact, there is even a Couchbase exporter to expose all of the important metrics from your cluster. (Note: In 카우치베이스 서버 7.0, a Prometheus endpoint is directly available, and internally, Couchbase 7 uses Prometheus for server stats collection and management to service the web UI).
While creating our observability dashboards, we had various data stored in JSON files, in CSV files, and in Couchbase buckets. We wanted a way to expose all of this data and show it in Grafana both in tabular format and as time-series data using Prometheus.
Prometheus expects a simple line-based text output. Here’s an example from our server pool monitoring:
1 2 |
available_vms{pool="12hrreg"} 1 available_vms{pool="regression"} 16 |
Let’s take a closer look at how to implement data sources from both CSV files and from Couchbase directly.
CSV Files as Your Data Source
Each time Prometheus polls the endpoint, we fetch the CSV, and for each column, we expose a metric, appending labels for multiple rows if a label is supplied in the config.
For the above example, the CSV looks like:
1 2 3 |
pool,사용 가능_카운트 12hrreg,1 regression,16 |
Couchbase as Your Data Source
Each time Prometheus polls the endpoint, we execute the SQL++ queries defined in the config, and for each query, we expose a metric, appending labels for multiple rows if a label is supplied in the config.
Below is an example SQL++ response that produces the above metrics:
1 2 3 4 5 6 7 8 |
[{ "pool", "12hrreg", "카운트": 1 }, { "pool", "regression", "카운트": 16 }] |
This exporter Python service exposes a /metrics
endpoint to be used in Prometheus. These metrics are defined in queries.json
and define which queries and CSV columns should be exposed as metrics. See the below JSON snippet (reduced for brevity) as an example.
queries.json
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
{ "클러스터": { "static_vms": { "host": "<ip-address>", "username": "관리자", "비밀번호": "xxxx" }, ... }, "queries": [ { "name": "available_vms", "클러스터": "static_vms", "query": "SELECT poolId as `pool`, COUNT(*) AS count FROM (SELECT poolId FROM `QE-server-pool` WHERE IS_ARRAY(poolId)=FALSE and state='available' UNION ALL SELECT poolId FROM `QE-server-pool` UNNEST poolId where `QE-server-pool`.state = 'available' ) AS pools group by poolId", "설명": "Available VMs for each server pool", "value_key": "count", "레이블": ["pool"] }, ... ], "csvs": { "vm_health": "http://<jenkins-host-job-url>/lastSuccessfulBuild/artifact/vm_health_info.csv/", ... }, "columns": [ { "name": "memory_usage", "csv": "vm_health", "설명": "Memory usage", "column": "memory_use(%)", "레이블": ["ipaddr"] }, { "name": "disk_usage", "csv": "vm_health", "설명": "Disk usage", "column": "disk_use%", "레이블": ["ipaddr"] }, { "name": "cpu_load_avg_5mins", "csv": "vm_health", "설명": "CPU load average (5mins)", "column": "cpu_load_avg_5mins", "레이블": ["ipaddr"] }, { "name": "vm_health_ssh_status", "csv": "vm_health", "설명": "SSH Status", "column": "ssh_status", "레이블": ["ipaddr", "ssh_error", "pool_state", "couchbase_version", "pool_ids"] }, ... ] } |
exporter.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
에 대한 옵션 in 설정['queries'] + 설정["columns"]: 로그.정보("Registered metrics collection for {}".형식(옵션['name'])) def get_labels(행, 옵션): rename_map = 옵션.get("rename", {}) 반환 ["{}=\"{}\"".형식(rename_map[레이블] 만약 레이블 in rename_map else 레이블, 행[레이블]) 에 대한 레이블 in 옵션["레이블"]] def collect_cb(클러스터, 메트릭, 옵션): 행 = 클러스터[옵션["클러스터"]].쿼리(옵션["query"]).행() 에 대한 행 in 행: 만약 len(옵션["레이블"]) > 0: 레이블 = get_labels(행, 옵션) 메트릭.추가("{}{{{}}} {}".형식( 옵션["name"], ",".join(레이블), 행[옵션["value_key"]])) else: 메트릭.추가("{} {}".형식( 옵션["name"], 행[옵션["value_key"]])) def collect_csv(메트릭, 옵션) csvfile = 요청.get(csvs[옵션["csv"]]).텍스트.splitlines() reader = DictReader(csvfile) 에 대한 행 in reader: 만약 옵션["column"] not in 행 또는 행[옵션["column"]] == "": 계속 만약 len(옵션["레이블"]) > 0: 레이블 = get_labels(행, 옵션) 메트릭.추가("{}{{{}}} {}".형식( 옵션["name"], ",".join(레이블), 행[옵션["column"]])) else: 메트릭.추가("{} {}".형식( 옵션["name"], 행[옵션["column"]])) @앱.경로("/metrics") def 메트릭(): 메트릭 = [] 클러스터 = {} 에 대한 [클러스터_이름, 옵션] in 설정['clusters'].항목(): 만약 클러스터_이름 not in 클러스터: 시도: 클러스터[클러스터_이름] = 클러스터('couchbase://'+옵션['host'], 클러스터 옵션( 비밀번호 인증기(옵션['username'], 옵션['password']))) 예외 예외 as e: 로그.경고("Couldn't connect to cluster {}".형식(e)) 로그.debug("Connected to {}".형식(옵션['host'])) 에 대한 옵션 in 설정["queries"] + 설정["columns"]: 로그.debug("Collecting metrics for {}".형식(옵션["name"])) 시도: 만약 "클러스터" in 옵션: collect_cb(클러스터, 메트릭, 옵션) elif "csv" in 옵션: collect_csv(메트릭, 옵션) else: raise 예외("Invalid type") 예외 예외 as e: 로그.경고("Error while collecting {}: {}".형식( 옵션["name"], e)) 반환 응답("\n".join(메트릭), mimetype="text/plain") |
Implementing the Alert Manager Service
Prometheus also supports alerting where it tracks specific metrics for you over time. If that metric starts returning results, it will trigger an alert.
For the example above you could add an alert for when the regression pool has no servers available. If you specify the query as available_vms{pool="regression"} == 0
that will return a series when there are 0
available. Once added, Prometheus tracks this for you (default is every minute). If that is all you do, you can visit the Prometheus UI and the alerts tab will show you which alerts are firing.
With the Alert Manager, you can take this a step further and connect communications services so that Prometheus can alert you via email or a Slack channel, for example, when an alert fires. This means you can be informed immediately via your preferred method when something goes wrong. At Couchbase, we set up alerts to be notified of high disk usage on servers as well as when servers could not be reached via SSH. See the example below:
alertmanager.yml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
글로벌: resolve_timeout: 1m smtp_from: qa@카우치베이스.com smtp_smarthost: 메일-com.메일.protection.outlook.com:25 경로: group_by: ["alertname"] group_wait: 10s group_interval: 10s repeat_interval: 24h 수신자: "infra-email" matchers: - alertname =~ PoolVMDown|PoolVMOSMismatch|PoolVMHighDiskUsage|SlaveVMHighDiskUsage|SlaveVMHighDiskUsageData receivers: - 이름: "infra-email" email_configs: - 에: jake.rawsthorne@카우치베이스.com,jagadesh.munta@카우치베이스.com |
결론
In conclusion, we hope you can learn from our experience of creating observability dashboards that help you hone in on the metrics that matter most in your implementation or use case with the power of data visualization.
In our case, this effort allowed us to find server infrastructure and test stability issues. Building dashboards also reduced the number of failed tests as well as the total regression time for multiple product releases.
We hope this walkthrough helps you build better observability dashboards in the future.
Also, we’d like to extend special thanks to Raju and the QE team for their feedback on improving the targeted metrics.
Hi this is a very good blog but seems its incomplete at many stages like how to run . Please help me with git code as i want to try this execution .very less information on how to use couchbase and what form is used for post in your code . Please help or suggest if you have git code at some repository and full instrctions