Prometheus – Open Source Monitoring and Alerting Toolkit

4.8 Stars

Version 2.48

100 MB

What is Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. Since its inception in 2012, Prometheus has become a cornerstone of cloud-native monitoring, now maintained as part of the Cloud Native Computing Foundation (CNCF) alongside Kubernetes. Its pull-based architecture, powerful query language, and excellent Kubernetes integration have made it the standard for monitoring containerized environments.

Unlike traditional push-based monitoring systems, Prometheus scrapes metrics from configured targets at regular intervals, storing time-series data locally with efficient compression. This architecture simplifies service discovery, reduces client-side complexity, and enables sophisticated querying and alerting based on dimensional data.

Prometheus excels at recording numeric time series, making it ideal for machine-centric monitoring and highly dynamic service-oriented architectures. Combined with Grafana for visualization and Alertmanager for alert handling, Prometheus forms a complete observability solution for modern infrastructure.

Key Features and Capabilities

Multi-Dimensional Data Model

Prometheus stores all data as time series identified by metric names and key-value pairs (labels). This dimensional data model enables flexible querying, allowing users to slice and dice data across multiple dimensions like instance, job, environment, or any custom label.

PromQL Query Language

PromQL (Prometheus Query Language) provides powerful expressions for selecting, aggregating, and manipulating time-series data. Functions for rate calculations, histograms, aggregations, and predictions enable complex analysis directly within Prometheus.

Pull-Based Architecture

Prometheus actively scrapes targets over HTTP, simplifying architecture and enabling easy service discovery. Targets expose metrics on designated endpoints, and Prometheus handles the collection schedule and storage.

Service Discovery

Automatic service discovery integrates with Kubernetes, Consul, EC2, and other platforms. Static configurations support traditional environments, while file-based discovery enables custom integration patterns.

Alerting

Prometheus evaluates alerting rules against collected metrics, pushing alerts to Alertmanager for deduplication, grouping, and routing to notification channels. This separation of concerns enables sophisticated alert management.

Federation

Hierarchical federation enables aggregating metrics from multiple Prometheus servers. This pattern supports large-scale deployments, organizational boundaries, and global views of distributed systems.

System Requirements

Hardware Requirements

Prometheus is designed for efficiency. A single server can handle millions of time series with reasonable hardware. Typical deployments need 4-8 GB RAM for moderate workloads. SSD storage is strongly recommended due to the high I/O requirements of time-series workloads.

Supported Platforms

Prometheus runs on Linux, macOS, and Windows. Pre-built binaries are available for multiple architectures. Docker images and Kubernetes operators simplify deployment in containerized environments.

Installation Guide

Installing on Linux

# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz

# Extract
tar xvfz prometheus-2.48.0.linux-amd64.tar.gz
cd prometheus-2.48.0.linux-amd64

# Run Prometheus
./prometheus --config.file=prometheus.yml

# Create systemd service
sudo tee /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --web.enable-lifecycle

[Install]
WantedBy=multi-user.target
EOF

# Start service
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus

# Access at http://localhost:9090

Installing with Docker

# Run Prometheus container
docker run -d \
  --name prometheus \
  -p 9090:9090 \
  -v prometheus-data:/prometheus \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

# Docker Compose with complete stack
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./rules:/etc/prometheus/rules
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.enable-lifecycle'
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'

volumes:
  prometheus-data:

Installing on Kubernetes

# Using Helm - Prometheus Stack (includes Grafana)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

# Access Prometheus
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring

# Using Prometheus Operator
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/bundle.yaml

Configuration

Basic prometheus.yml

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'production'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

# Rule files
rule_files:
  - "rules/*.yml"

# Scrape configurations
scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter
  - job_name: 'node'
    static_configs:
      - targets:
        - 'node-exporter:9100'
        - 'server2:9100'
        - 'server3:9100'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+):\d+'
        replacement: '${1}'

  # Kubernetes service discovery
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

  # Application metrics with basic auth
  - job_name: 'application'
    basic_auth:
      username: prometheus
      password: secret
    static_configs:
      - targets: ['app:8080']
    metrics_path: /metrics

PromQL Query Examples

Basic Queries

# Instant vector selector
http_requests_total

# With label matching
http_requests_total{method="GET", status="200"}

# Regex matching
http_requests_total{method=~"GET|POST"}

# Negative matching
http_requests_total{status!="200"}

# Time range (range vector)
http_requests_total[5m]

# Offset modifier
http_requests_total offset 1h

Aggregation and Functions

# Rate of requests per second over 5 minutes
rate(http_requests_total[5m])

# Increase over time period
increase(http_requests_total[1h])

# Sum across all instances
sum(rate(http_requests_total[5m]))

# Sum by label
sum by (method) (rate(http_requests_total[5m]))

# Average
avg(node_cpu_seconds_total)

# Maximum
max(node_memory_MemTotal_bytes)

# Count
count(up == 1)

# Quantile (percentile)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Top 5 by value
topk(5, rate(http_requests_total[5m]))

Common Monitoring Queries

# CPU usage percentage
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk usage percentage
100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100)

# HTTP error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# Request latency (95th percentile)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Availability (uptime)
avg_over_time(up[24h]) * 100

# Network I/O rate
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

# Container memory usage (Kubernetes)
sum by (pod) (container_memory_usage_bytes{container!=""})

# Pod CPU usage (Kubernetes)
sum by (pod) (rate(container_cpu_usage_seconds_total{container!=""}[5m]))

Alerting Rules

Alert Configuration

# rules/alerts.yml
groups:
  - name: infrastructure
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.2f\" }}%"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | printf \"%.2f\" }}%"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Only {{ $value | printf \"%.2f\" }}% disk space remaining"

      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute"

  - name: application
    rules:
      - alert: HighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High HTTP error rate"
          description: "Error rate is {{ $value | printf \"%.2f\" }}%"

      - alert: SlowResponseTime
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow response times detected"
          description: "95th percentile latency is {{ $value | printf \"%.2f\" }}s"

Alertmanager Configuration

alertmanager.yml

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager@example.com'
  smtp_auth_password: 'password'
  slack_api_url: 'https://hooks.slack.com/services/xxx'

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    email_configs:
      - to: 'ops@example.com'

  - name: 'slack'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'your-pagerduty-key'
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

Recording Rules

Precomputed Queries

# rules/recording.yml
groups:
  - name: recording_rules
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      - record: instance:node_cpu:ratio
        expr: 1 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]))

      - record: job:http_request_duration_seconds:p95
        expr: histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))

Exporters

Common Exporters

# Node Exporter (system metrics)
./node_exporter --web.listen-address=":9100"

# Blackbox Exporter (probing)
./blackbox_exporter --config.file=blackbox.yml

# MySQL Exporter
./mysqld_exporter --config.my-cnf=".my.cnf"

# PostgreSQL Exporter
./postgres_exporter

# Redis Exporter
./redis_exporter --redis.addr=redis://localhost:6379

# Nginx Exporter
./nginx-prometheus-exporter -nginx.scrape-uri=http://localhost/stub_status

Best Practices

Naming Conventions

Metric naming:
- Use snake_case
- Include unit in suffix (_seconds, _bytes, _total)
- Use _total suffix for counters
- Be specific but concise

Example metrics:
- http_requests_total
- http_request_duration_seconds
- node_memory_MemAvailable_bytes
- process_cpu_seconds_total

Conclusion

Prometheus provides a powerful, scalable foundation for monitoring modern infrastructure and applications. Its pull-based architecture, dimensional data model, and sophisticated query language make it ideal for dynamic, containerized environments.

Combined with Alertmanager for notifications and Grafana for visualization, Prometheus delivers a complete observability solution for cloud-native architectures.

Developer: CNCF

Download Options

Download Prometheus – Open Source Monitoring and Alerting Toolkit

Version 2.48

File Size: 100 MB

Download Now

Safe & Secure

Verified and scanned for viruses

Regular Updates

Always get the latest version

24/7 Support

Help available when you need it