Prometheus监控系统实战:从零到生产环境部署
在现代微服务架构中,监控系统是保障服务稳定性的关键基础设施。Prometheus作为CNCF毕业项目,已成为云原生监控的事实标准。本文将深入探讨Prometheus的架构设计、部署实践和性能优化。
Prometheus架构概览
核心组件
Prometheus监控系统由以下核心组件构成:
# prometheus.yml 核心配置
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-west-2'
rule_files:
- "rules/*.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node1:9100', 'node2:9100', 'node3:9100']
scrape_interval: 10s
metrics_path: /metrics
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
数据模型
Prometheus采用多维时间序列数据模型:
# 指标示例
http_requests_total{method="GET", handler="/api/users", status="200"} 1027
http_requests_total{method="POST", handler="/api/users", status="201"} 94
http_requests_total{method="GET", handler="/api/users", status="500"} 3
# 查询示例
# 计算HTTP请求成功率
sum(rate(http_requests_total{status=~"2.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
# 计算P95响应时间
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
生产环境部署架构
高可用部署
# docker-compose.yml
version: '3.8'
services:
prometheus-1:
image: prom/prometheus:v2.45.0
container_name: prometheus-1
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./rules:/etc/prometheus/rules
- prometheus-1-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=50GB'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
restart: unless-stopped
prometheus-2:
image: prom/prometheus:v2.45.0
container_name: prometheus-2
ports:
- "9091:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./rules:/etc/prometheus/rules
- prometheus-2-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=50GB'
restart: unless-stopped
alertmanager:
image: prom/alertmanager:v0.25.0
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
restart: unless-stopped
volumes:
prometheus-1-data:
prometheus-2-data:
告警规则配置
# rules/alerts.yml
groups:
- name: infrastructure
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage detected"
description: "Memory usage is above 85% on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"}) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space is running low"
description: "Disk space is below 10% on {{ $labels.instance }} mount {{ $labels.mountpoint }}"
- name: application
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 5
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is above 5% for more than 2 minutes"
- alert: HighResponseTime
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High response time detected"
description: "95th percentile response time is above 500ms"
性能优化实践
存储优化
#!/bin/bash
# 存储优化脚本
# 1. 配置合适的保留策略
RETENTION_TIME="30d"
RETENTION_SIZE="50GB"
# 2. 优化采集间隔
# 根据业务需求调整采集频率
# 基础设施监控: 15s-30s
# 应用监控: 10s-15s
# 业务监控: 5s-10s
# 3. 使用recording rules预计算
cat > recording_rules.yml << EOF
groups:
- name: cpu_rules
interval: 30s
rules:
- record: instance:cpu_usage:rate5m
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- record: instance:memory_usage:ratio
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes
- name: http_rules
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: job:http_request_duration:p95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
EOF
# 4. 配置远程存储
cat >> prometheus.yml << EOF
remote_write:
- url: "https://prometheus-remote-write.example.com/api/v1/write"
basic_auth:
username: "prometheus"
password: "secure_password"
queue_config:
max_samples_per_send: 1000
max_shards: 200
capacity: 2500
remote_read:
- url: "https://prometheus-remote-read.example.com/api/v1/read"
basic_auth:
username: "prometheus"
password: "secure_password"
EOF
查询优化
-- 优化前:低效查询
sum(http_requests_total) by (instance)
-- 优化后:使用rate函数
sum(rate(http_requests_total[5m])) by (instance)
-- 优化前:大范围时间查询
avg_over_time(cpu_usage[1d])
-- 优化后:使用recording rule
instance:cpu_usage:rate5m
-- 复杂聚合查询优化
# 使用subquery减少计算量
max_over_time(
(
sum(rate(http_requests_total[5m])) by (instance)
)[1h:5m]
)
监控最佳实践
1. 指标设计原则
// Go应用指标示例
package main
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
"net/http"
"time"
)
var (
// Counter: 单调递增的计数器
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
// Histogram: 分布统计
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
// Gauge: 可增可减的仪表盘
activeConnections = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "Number of active connections",
},
)
)
func instrumentHandler(next http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// 执行业务逻辑
next(w, r)
// 记录指标
duration := time.Since(start).Seconds()
httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
}
}
2. 告警策略
# alertmanager.yml
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@company.com'
smtp_auth_username: 'alerts@company.com'
smtp_auth_password: 'app_password'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
group_wait: 5s
repeat_interval: 30m
- match:
severity: warning
receiver: 'warning-alerts'
repeat_interval: 4h
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://webhook.example.com/alerts'
send_resolved: true
- name: 'critical-alerts'
email_configs:
- to: 'oncall@company.com'
subject: '[CRITICAL] {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
{{ end }}
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts-critical'
title: 'Critical Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: 'warning-alerts'
email_configs:
- to: 'team@company.com'
subject: '[WARNING] {{ .GroupLabels.alertname }}'
故障排查案例
案例1:高内存使用率告警
# 1. 查看内存使用趋势
curl -G 'http://prometheus:9090/api/v1/query_range' \
--data-urlencode 'query=node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100' \
--data-urlencode 'start=2023-03-15T10:00:00Z' \
--data-urlencode 'end=2023-03-15T12:00:00Z' \
--data-urlencode 'step=60s'
# 2. 分析内存分布
node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes
# 3. 查看进程内存使用
topk(10, process_resident_memory_bytes)
案例2:API响应时间异常
# 1. 查看响应时间分布
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# 2. 按端点分析
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (endpoint, le)
)
# 3. 查看错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint) /
sum(rate(http_requests_total[5m])) by (endpoint) * 100
总结
Prometheus监控系统的成功部署需要考虑以下关键要素:
- 架构设计:合理规划组件部署,确保高可用性
- 性能优化:优化存储、查询和采集策略
- 告警策略:设计合理的告警规则和通知机制
- 运维实践:建立完善的故障排查和性能调优流程
通过本文的实践指南,您可以构建一个稳定、高效的Prometheus监控系统,为业务系统提供可靠的监控保障。
本文基于Prometheus 2.45.0版本编写,涵盖了生产环境的最佳实践。如有问题或建议,欢迎在评论区讨论。