Prometheus – Open Source Monitoring and Alerting Toolkit
What is Prometheus?
Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. Since its inception in 2012, Prometheus has become a cornerstone of cloud-native monitoring, now maintained as part of the Cloud Native Computing Foundation (CNCF) alongside Kubernetes. Its pull-based architecture, powerful query language, and excellent Kubernetes integration have made it the standard for monitoring containerized environments.
Unlike traditional push-based monitoring systems, Prometheus scrapes metrics from configured targets at regular intervals, storing time-series data locally with efficient compression. This architecture simplifies service discovery, reduces client-side complexity, and enables sophisticated querying and alerting based on dimensional data.
Prometheus excels at recording numeric time series, making it ideal for machine-centric monitoring and highly dynamic service-oriented architectures. Combined with Grafana for visualization and Alertmanager for alert handling, Prometheus forms a complete observability solution for modern infrastructure.
Key Features and Capabilities
Multi-Dimensional Data Model
Prometheus stores all data as time series identified by metric names and key-value pairs (labels). This dimensional data model enables flexible querying, allowing users to slice and dice data across multiple dimensions like instance, job, environment, or any custom label.
PromQL Query Language
PromQL (Prometheus Query Language) provides powerful expressions for selecting, aggregating, and manipulating time-series data. Functions for rate calculations, histograms, aggregations, and predictions enable complex analysis directly within Prometheus.
Pull-Based Architecture
Prometheus actively scrapes targets over HTTP, simplifying architecture and enabling easy service discovery. Targets expose metrics on designated endpoints, and Prometheus handles the collection schedule and storage.
Service Discovery
Automatic service discovery integrates with Kubernetes, Consul, EC2, and other platforms. Static configurations support traditional environments, while file-based discovery enables custom integration patterns.
Alerting
Prometheus evaluates alerting rules against collected metrics, pushing alerts to Alertmanager for deduplication, grouping, and routing to notification channels. This separation of concerns enables sophisticated alert management.
Federation
Hierarchical federation enables aggregating metrics from multiple Prometheus servers. This pattern supports large-scale deployments, organizational boundaries, and global views of distributed systems.
System Requirements
Hardware Requirements
Prometheus is designed for efficiency. A single server can handle millions of time series with reasonable hardware. Typical deployments need 4-8 GB RAM for moderate workloads. SSD storage is strongly recommended due to the high I/O requirements of time-series workloads.
Supported Platforms
Prometheus runs on Linux, macOS, and Windows. Pre-built binaries are available for multiple architectures. Docker images and Kubernetes operators simplify deployment in containerized environments.
Installation Guide
Installing on Linux
# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
# Extract
tar xvfz prometheus-2.48.0.linux-amd64.tar.gz
cd prometheus-2.48.0.linux-amd64
# Run Prometheus
./prometheus --config.file=prometheus.yml
# Create systemd service
sudo tee /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--web.enable-lifecycle
[Install]
WantedBy=multi-user.target
EOF
# Start service
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
# Access at http://localhost:9090
Installing with Docker
# Run Prometheus container
docker run -d \
--name prometheus \
-p 9090:9090 \
-v prometheus-data:/prometheus \
-v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
# Docker Compose with complete stack
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./rules:/etc/prometheus/rules
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.enable-lifecycle'
restart: unless-stopped
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
prometheus-data:
Installing on Kubernetes
# Using Helm - Prometheus Stack (includes Grafana)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
# Access Prometheus
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring
# Using Prometheus Operator
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/bundle.yaml
Configuration
Basic prometheus.yml
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'production'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Rule files
rule_files:
- "rules/*.yml"
# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter
- job_name: 'node'
static_configs:
- targets:
- 'node-exporter:9100'
- 'server2:9100'
- 'server3:9100'
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '([^:]+):\d+'
replacement: '${1}'
# Kubernetes service discovery
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Application metrics with basic auth
- job_name: 'application'
basic_auth:
username: prometheus
password: secret
static_configs:
- targets: ['app:8080']
metrics_path: /metrics
PromQL Query Examples
Basic Queries
# Instant vector selector
http_requests_total
# With label matching
http_requests_total{method="GET", status="200"}
# Regex matching
http_requests_total{method=~"GET|POST"}
# Negative matching
http_requests_total{status!="200"}
# Time range (range vector)
http_requests_total[5m]
# Offset modifier
http_requests_total offset 1h
Aggregation and Functions
# Rate of requests per second over 5 minutes
rate(http_requests_total[5m])
# Increase over time period
increase(http_requests_total[1h])
# Sum across all instances
sum(rate(http_requests_total[5m]))
# Sum by label
sum by (method) (rate(http_requests_total[5m]))
# Average
avg(node_cpu_seconds_total)
# Maximum
max(node_memory_MemTotal_bytes)
# Count
count(up == 1)
# Quantile (percentile)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Top 5 by value
topk(5, rate(http_requests_total[5m]))
Common Monitoring Queries
# CPU usage percentage
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk usage percentage
100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100)
# HTTP error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
# Request latency (95th percentile)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Availability (uptime)
avg_over_time(up[24h]) * 100
# Network I/O rate
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
# Container memory usage (Kubernetes)
sum by (pod) (container_memory_usage_bytes{container!=""})
# Pod CPU usage (Kubernetes)
sum by (pod) (rate(container_cpu_usage_seconds_total{container!=""}[5m]))
Alerting Rules
Alert Configuration
# rules/alerts.yml
groups:
- name: infrastructure
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | printf \"%.2f\" }}%"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | printf \"%.2f\" }}%"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Only {{ $value | printf \"%.2f\" }}% disk space remaining"
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute"
- name: application
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High HTTP error rate"
description: "Error rate is {{ $value | printf \"%.2f\" }}%"
- alert: SlowResponseTime
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Slow response times detected"
description: "95th percentile latency is {{ $value | printf \"%.2f\" }}s"
Alertmanager Configuration
alertmanager.yml
# alertmanager.yml
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'password'
slack_api_url: 'https://hooks.slack.com/services/xxx'
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
email_configs:
- to: 'ops@example.com'
- name: 'slack'
slack_configs:
- channel: '#alerts'
send_resolved: true
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'your-pagerduty-key'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
Recording Rules
Precomputed Queries
# rules/recording.yml
groups:
- name: recording_rules
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: instance:node_cpu:ratio
expr: 1 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]))
- record: job:http_request_duration_seconds:p95
expr: histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))
Exporters
Common Exporters
# Node Exporter (system metrics)
./node_exporter --web.listen-address=":9100"
# Blackbox Exporter (probing)
./blackbox_exporter --config.file=blackbox.yml
# MySQL Exporter
./mysqld_exporter --config.my-cnf=".my.cnf"
# PostgreSQL Exporter
./postgres_exporter
# Redis Exporter
./redis_exporter --redis.addr=redis://localhost:6379
# Nginx Exporter
./nginx-prometheus-exporter -nginx.scrape-uri=http://localhost/stub_status
Best Practices
Naming Conventions
Metric naming:
- Use snake_case
- Include unit in suffix (_seconds, _bytes, _total)
- Use _total suffix for counters
- Be specific but concise
Example metrics:
- http_requests_total
- http_request_duration_seconds
- node_memory_MemAvailable_bytes
- process_cpu_seconds_total
Conclusion
Prometheus provides a powerful, scalable foundation for monitoring modern infrastructure and applications. Its pull-based architecture, dimensional data model, and sophisticated query language make it ideal for dynamic, containerized environments.
Combined with Alertmanager for notifications and Grafana for visualization, Prometheus delivers a complete observability solution for cloud-native architectures.
Download Options
Download Prometheus – Open Source Monitoring and Alerting Toolkit
Version 2.48
File Size: 100 MB
Download NowSafe & Secure
Verified and scanned for viruses
Regular Updates
Always get the latest version
24/7 Support
Help available when you need it