Files
CHORUS/docs/comprehensive/packages/metrics.md
anthonyrawlins c5b7311a8b docs: Add Phase 3 coordination and infrastructure documentation
Comprehensive documentation for coordination, messaging, discovery, and internal systems.

Core Coordination Packages:
- pkg/election - Democratic leader election (uptime-based, heartbeat mechanism, SLURP integration)
- pkg/coordination - Meta-coordination with dependency detection (4 built-in rules)
- coordinator/ - Task orchestration and assignment (AI-powered scoring)
- discovery/ - mDNS peer discovery (automatic LAN detection)

Messaging & P2P Infrastructure:
- pubsub/ - GossipSub messaging (31 message types, role-based topics, HMMM integration)
- p2p/ - libp2p networking (DHT modes, connection management, security)

Monitoring & Health:
- pkg/metrics - Prometheus metrics (80+ metrics across 12 categories)
- pkg/health - Health monitoring (4 HTTP endpoints, enhanced checks, graceful degradation)

Internal Systems:
- internal/licensing - License validation (KACHING integration, cluster leases, fail-closed)
- internal/hapui - Human Agent Portal UI (9 commands, HMMM wizard, UCXL browser, decision voting)
- internal/backbeat - P2P operation telemetry (6 phases, beat synchronization, health reporting)

Documentation Statistics (Phase 3):
- 10 packages documented (~18,000 lines)
- 31 PubSub message types cataloged
- 80+ Prometheus metrics documented
- Complete API references with examples
- Integration patterns and best practices

Key Features Documented:
- Election: 5 triggers, candidate scoring (5 weighted components), stability windows
- Coordination: AI-powered dependency detection, cross-repo sessions, escalation handling
- PubSub: Topic patterns, message envelopes, SHHH redaction, Hypercore logging
- Metrics: All metric types with labels, Prometheus scrape config, alert rules
- Health: Liveness vs readiness, critical checks, Kubernetes integration
- Licensing: Grace periods, circuit breaker, cluster lease management
- HAP UI: Interactive terminal commands, HMMM composition wizard, web interface (beta)
- BACKBEAT: 6-phase operation tracking, beat budget estimation, drift detection

Implementation Status Marked:
-  Production: Election, metrics, health, licensing, pubsub, p2p, discovery, coordinator
- 🔶 Beta: HAP web interface, BACKBEAT telemetry, advanced coordination
- 🔷 Alpha: SLURP election scoring
- ⚠️ Experimental: Meta-coordination, AI-powered dependency detection

Progress: 22/62 files complete (35%)

Next Phase: AI providers, SLURP system, API layer, reasoning engine

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-30 18:27:39 +10:00

25 KiB

CHORUS Metrics Package

Overview

The pkg/metrics package provides comprehensive Prometheus-based metrics collection for the CHORUS distributed system. It exposes detailed operational metrics across all system components including P2P networking, DHT operations, PubSub messaging, elections, task management, and resource utilization.

Architecture

Core Components

  • CHORUSMetrics: Central metrics collector managing all Prometheus metrics
  • Prometheus Registry: Custom registry for metric collection
  • HTTP Server: Exposes metrics endpoint for scraping
  • Background Collectors: Periodic system and resource metric collection

Metric Types

The package uses three Prometheus metric types:

  1. Counter: Monotonically increasing values (e.g., total messages sent)
  2. Gauge: Values that can go up or down (e.g., connected peers)
  3. Histogram: Distribution of values with configurable buckets (e.g., latency measurements)

Configuration

MetricsConfig

type MetricsConfig struct {
    // HTTP server configuration
    ListenAddr  string        // Default: ":9090"
    MetricsPath string        // Default: "/metrics"

    // Histogram buckets
    LatencyBuckets []float64  // Default: 0.001s to 10s
    SizeBuckets    []float64  // Default: 64B to 16MB

    // Node identification labels
    NodeID      string        // Unique node identifier
    Version     string        // CHORUS version
    Environment string        // deployment environment (dev/staging/prod)
    Cluster     string        // cluster identifier

    // Collection intervals
    SystemMetricsInterval   time.Duration  // Default: 30s
    ResourceMetricsInterval time.Duration  // Default: 15s
}

Default Configuration

config := metrics.DefaultMetricsConfig()
// Returns:
// - ListenAddr: ":9090"
// - MetricsPath: "/metrics"
// - LatencyBuckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
// - SizeBuckets: [64, 256, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216]
// - SystemMetricsInterval: 30s
// - ResourceMetricsInterval: 15s

Metrics Catalog

System Metrics

chorus_system_info

Type: Gauge Description: System information with version labels Labels: node_id, version, go_version, cluster, environment Value: Always 1 when present

chorus_uptime_seconds

Type: Gauge Description: System uptime in seconds since start Value: Current uptime in seconds

P2P Network Metrics

chorus_p2p_connected_peers

Type: Gauge Description: Number of currently connected P2P peers Value: Current peer count

chorus_p2p_messages_sent_total

Type: Counter Description: Total number of P2P messages sent Labels: message_type, peer_id Usage: Track outbound message volume per type and destination

chorus_p2p_messages_received_total

Type: Counter Description: Total number of P2P messages received Labels: message_type, peer_id Usage: Track inbound message volume per type and source

chorus_p2p_message_latency_seconds

Type: Histogram Description: P2P message round-trip latency distribution Labels: message_type Buckets: Configurable latency buckets (default: 1ms to 10s)

chorus_p2p_connection_duration_seconds

Type: Histogram Description: Duration of P2P connections Labels: peer_id Usage: Track connection stability

chorus_p2p_peer_score

Type: Gauge Description: Peer quality score Labels: peer_id Value: Score between 0.0 (poor) and 1.0 (excellent)

DHT (Distributed Hash Table) Metrics

chorus_dht_put_operations_total

Type: Counter Description: Total number of DHT put operations Labels: status (success/failure) Usage: Track DHT write operations

chorus_dht_get_operations_total

Type: Counter Description: Total number of DHT get operations Labels: status (success/failure) Usage: Track DHT read operations

chorus_dht_operation_latency_seconds

Type: Histogram Description: DHT operation latency distribution Labels: operation (put/get), status (success/failure) Usage: Monitor DHT performance

chorus_dht_provider_records

Type: Gauge Description: Number of provider records stored in DHT Value: Current provider record count

chorus_dht_content_keys

Type: Gauge Description: Number of content keys stored in DHT Value: Current content key count

chorus_dht_replication_factor

Type: Gauge Description: Replication factor for DHT keys Labels: key_hash Value: Number of replicas for specific keys

chorus_dht_cache_hits_total

Type: Counter Description: DHT cache hit count Labels: cache_type Usage: Monitor DHT caching effectiveness

chorus_dht_cache_misses_total

Type: Counter Description: DHT cache miss count Labels: cache_type Usage: Monitor DHT caching effectiveness

PubSub Messaging Metrics

chorus_pubsub_topics

Type: Gauge Description: Number of active PubSub topics Value: Current topic count

chorus_pubsub_subscribers

Type: Gauge Description: Number of subscribers per topic Labels: topic Value: Subscriber count for each topic

chorus_pubsub_messages_total

Type: Counter Description: Total PubSub messages Labels: topic, direction (sent/received), message_type Usage: Track message volume per topic

chorus_pubsub_message_latency_seconds

Type: Histogram Description: PubSub message delivery latency Labels: topic Usage: Monitor message propagation performance

chorus_pubsub_message_size_bytes

Type: Histogram Description: PubSub message size distribution Labels: topic Buckets: Configurable size buckets (default: 64B to 16MB)

Election System Metrics

chorus_election_term

Type: Gauge Description: Current election term number Value: Monotonically increasing term number

chorus_election_state

Type: Gauge Description: Current election state (1 for active state, 0 for others) Labels: state (idle/discovering/electing/reconstructing/complete) Usage: Only one state should have value 1 at any time

chorus_heartbeats_sent_total

Type: Counter Description: Total number of heartbeats sent by this node Usage: Monitor leader heartbeat activity

chorus_heartbeats_received_total

Type: Counter Description: Total number of heartbeats received from leader Usage: Monitor follower connectivity to leader

chorus_leadership_changes_total

Type: Counter Description: Total number of leadership changes Usage: Monitor election stability (lower is better)

chorus_leader_uptime_seconds

Type: Gauge Description: Current leader's tenure duration Value: Seconds since current leader was elected

chorus_election_latency_seconds

Type: Histogram Description: Time taken to complete election process Usage: Monitor election efficiency

Health Monitoring Metrics

chorus_health_checks_passed_total

Type: Counter Description: Total number of health checks passed Labels: check_name Usage: Track health check success rate

chorus_health_checks_failed_total

Type: Counter Description: Total number of health checks failed Labels: check_name, reason Usage: Track health check failures and reasons

chorus_health_check_duration_seconds

Type: Histogram Description: Health check execution duration Labels: check_name Usage: Monitor health check performance

chorus_system_health_score

Type: Gauge Description: Overall system health score Value: 0.0 (unhealthy) to 1.0 (healthy) Usage: Monitor overall system health

chorus_component_health_score

Type: Gauge Description: Component-specific health score Labels: component Value: 0.0 (unhealthy) to 1.0 (healthy) Usage: Track individual component health

Task Management Metrics

chorus_tasks_active

Type: Gauge Description: Number of currently active tasks Value: Current active task count

chorus_tasks_queued

Type: Gauge Description: Number of queued tasks waiting execution Value: Current queue depth

chorus_tasks_completed_total

Type: Counter Description: Total number of completed tasks Labels: status (success/failure), task_type Usage: Track task completion and success rate

chorus_task_duration_seconds

Type: Histogram Description: Task execution duration distribution Labels: task_type, status Usage: Monitor task performance

chorus_task_queue_wait_time_seconds

Type: Histogram Description: Time tasks spend in queue before execution Usage: Monitor task scheduling efficiency

SLURP (Context Generation) Metrics

chorus_slurp_contexts_generated_total

Type: Counter Description: Total number of SLURP contexts generated Labels: role, status (success/failure) Usage: Track context generation volume

chorus_slurp_generation_time_seconds

Type: Histogram Description: Time taken to generate SLURP contexts Buckets: [0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0] Usage: Monitor context generation performance

chorus_slurp_queue_length

Type: Gauge Description: Length of SLURP generation queue Value: Current queue depth

chorus_slurp_active_jobs

Type: Gauge Description: Number of active SLURP generation jobs Value: Currently running generation jobs

chorus_slurp_leadership_events_total

Type: Counter Description: SLURP-related leadership events Usage: Track leader-initiated context generation

SHHH (Secret Sentinel) Metrics

chorus_shhh_findings_total

Type: Counter Description: Total number of SHHH redaction findings Labels: rule, severity (low/medium/high/critical) Usage: Monitor secret detection effectiveness

UCXI (Protocol Resolution) Metrics

chorus_ucxi_requests_total

Type: Counter Description: Total number of UCXI protocol requests Labels: method, status (success/failure) Usage: Track UCXI usage and success rate

chorus_ucxi_resolution_latency_seconds

Type: Histogram Description: UCXI address resolution latency Usage: Monitor resolution performance

chorus_ucxi_cache_hits_total

Type: Counter Description: UCXI cache hit count Usage: Monitor caching effectiveness

chorus_ucxi_cache_misses_total

Type: Counter Description: UCXI cache miss count Usage: Monitor caching effectiveness

chorus_ucxi_content_size_bytes

Type: Histogram Description: Size of resolved UCXI content Usage: Monitor content distribution

Resource Utilization Metrics

chorus_cpu_usage_ratio

Type: Gauge Description: CPU usage ratio Value: 0.0 (idle) to 1.0 (fully utilized)

chorus_memory_usage_bytes

Type: Gauge Description: Memory usage in bytes Value: Current memory consumption

chorus_disk_usage_ratio

Type: Gauge Description: Disk usage ratio Labels: mount_point Value: 0.0 (empty) to 1.0 (full)

chorus_network_bytes_in_total

Type: Counter Description: Total bytes received from network Usage: Track inbound network traffic

chorus_network_bytes_out_total

Type: Counter Description: Total bytes sent to network Usage: Track outbound network traffic

chorus_goroutines

Type: Gauge Description: Number of active goroutines Value: Current goroutine count

Error Metrics

chorus_errors_total

Type: Counter Description: Total number of errors Labels: component, error_type Usage: Track error frequency by component and type

chorus_panics_total

Type: Counter Description: Total number of panics recovered Usage: Monitor system stability

Usage Examples

Basic Initialization

import "chorus/pkg/metrics"

// Create metrics collector with default config
config := metrics.DefaultMetricsConfig()
config.NodeID = "chorus-node-01"
config.Version = "v1.0.0"
config.Environment = "production"
config.Cluster = "cluster-01"

metricsCollector := metrics.NewCHORUSMetrics(config)

// Start metrics HTTP server
if err := metricsCollector.StartServer(config); err != nil {
    log.Fatalf("Failed to start metrics server: %v", err)
}

// Start background metric collection
metricsCollector.CollectMetrics(config)

Recording P2P Metrics

// Update peer count
metricsCollector.SetConnectedPeers(5)

// Record message sent
metricsCollector.IncrementMessagesSent("task_assignment", "peer-abc123")

// Record message received
metricsCollector.IncrementMessagesReceived("task_result", "peer-def456")

// Record message latency
startTime := time.Now()
// ... send message and wait for response ...
latency := time.Since(startTime)
metricsCollector.ObserveMessageLatency("task_assignment", latency)

Recording DHT Metrics

// Record DHT put operation
startTime := time.Now()
err := dht.Put(key, value)
latency := time.Since(startTime)

if err != nil {
    metricsCollector.IncrementDHTPutOperations("failure")
    metricsCollector.ObserveDHTOperationLatency("put", "failure", latency)
} else {
    metricsCollector.IncrementDHTPutOperations("success")
    metricsCollector.ObserveDHTOperationLatency("put", "success", latency)
}

// Update DHT statistics
metricsCollector.SetDHTProviderRecords(150)
metricsCollector.SetDHTContentKeys(450)
metricsCollector.SetDHTReplicationFactor("key-hash-123", 3.0)

Recording PubSub Metrics

// Update topic count
metricsCollector.SetPubSubTopics(10)

// Record message published
metricsCollector.IncrementPubSubMessages("CHORUS/tasks/v1", "sent", "task_created")

// Record message received
metricsCollector.IncrementPubSubMessages("CHORUS/tasks/v1", "received", "task_completed")

// Record message latency
startTime := time.Now()
// ... publish message and wait for delivery confirmation ...
latency := time.Since(startTime)
metricsCollector.ObservePubSubMessageLatency("CHORUS/tasks/v1", latency)

Recording Election Metrics

// Update election state
metricsCollector.SetElectionTerm(42)
metricsCollector.SetElectionState("idle")

// Record heartbeat sent (leader)
metricsCollector.IncrementHeartbeatsSent()

// Record heartbeat received (follower)
metricsCollector.IncrementHeartbeatsReceived()

// Record leadership change
metricsCollector.IncrementLeadershipChanges()

Recording Health Metrics

// Record health check success
metricsCollector.IncrementHealthCheckPassed("database-connectivity")

// Record health check failure
metricsCollector.IncrementHealthCheckFailed("p2p-connectivity", "no_peers")

// Update health scores
metricsCollector.SetSystemHealthScore(0.95)
metricsCollector.SetComponentHealthScore("dht", 0.98)
metricsCollector.SetComponentHealthScore("pubsub", 0.92)

Recording Task Metrics

// Update task counts
metricsCollector.SetActiveTasks(5)
metricsCollector.SetQueuedTasks(12)

// Record task completion
startTime := time.Now()
// ... execute task ...
duration := time.Since(startTime)

metricsCollector.IncrementTasksCompleted("success", "data_processing")
metricsCollector.ObserveTaskDuration("data_processing", "success", duration)

Recording SLURP Metrics

// Record context generation
startTime := time.Now()
// ... generate SLURP context ...
duration := time.Since(startTime)

metricsCollector.IncrementSLURPGenerated("admin", "success")
metricsCollector.ObserveSLURPGenerationTime(duration)

// Update queue length
metricsCollector.SetSLURPQueueLength(3)

Recording SHHH Metrics

// Record secret findings
findings := scanForSecrets(content)
for _, finding := range findings {
    metricsCollector.IncrementSHHHFindings(finding.Rule, finding.Severity, 1)
}

Recording Resource Metrics

import "runtime"

// Get runtime stats
var memStats runtime.MemStats
runtime.ReadMemStats(&memStats)

metricsCollector.SetMemoryUsage(float64(memStats.Alloc))
metricsCollector.SetGoroutines(runtime.NumGoroutine())

// Record system resource usage
metricsCollector.SetCPUUsage(0.45)  // 45% CPU usage
metricsCollector.SetDiskUsage("/var/lib/CHORUS", 0.73)  // 73% disk usage

Recording Errors

// Record error occurrence
if err != nil {
    metricsCollector.IncrementErrors("dht", "timeout")
}

// Record recovered panic
defer func() {
    if r := recover(); r != nil {
        metricsCollector.IncrementPanics()
        // Handle panic...
    }
}()

Prometheus Integration

Scrape Configuration

Add the following to your prometheus.yml:

scrape_configs:
  - job_name: 'chorus-nodes'
    scrape_interval: 15s
    scrape_timeout: 10s
    metrics_path: '/metrics'
    static_configs:
      - targets:
          - 'chorus-node-01:9090'
          - 'chorus-node-02:9090'
          - 'chorus-node-03:9090'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
      - source_labels: [__address__]
        regex: '([^:]+):.*'
        target_label: node
        replacement: '${1}'

Example Queries

P2P Network Health

# Average connected peers across cluster
avg(chorus_p2p_connected_peers)

# Message rate per second
rate(chorus_p2p_messages_sent_total[5m])

# 95th percentile message latency
histogram_quantile(0.95, rate(chorus_p2p_message_latency_seconds_bucket[5m]))

DHT Performance

# DHT operation success rate
rate(chorus_dht_get_operations_total{status="success"}[5m]) /
rate(chorus_dht_get_operations_total[5m])

# Average DHT operation latency
rate(chorus_dht_operation_latency_seconds_sum[5m]) /
rate(chorus_dht_operation_latency_seconds_count[5m])

# DHT cache hit rate
rate(chorus_dht_cache_hits_total[5m]) /
(rate(chorus_dht_cache_hits_total[5m]) + rate(chorus_dht_cache_misses_total[5m]))

Election Stability

# Leadership changes per hour
rate(chorus_leadership_changes_total[1h]) * 3600

# Nodes by election state
sum by (state) (chorus_election_state)

# Heartbeat rate
rate(chorus_heartbeats_sent_total[5m])

Task Management

# Task success rate
rate(chorus_tasks_completed_total{status="success"}[5m]) /
rate(chorus_tasks_completed_total[5m])

# Average task duration
histogram_quantile(0.50, rate(chorus_task_duration_seconds_bucket[5m]))

# Task queue depth
chorus_tasks_queued

Resource Utilization

# CPU usage by node
chorus_cpu_usage_ratio

# Memory usage by node
chorus_memory_usage_bytes / (1024 * 1024 * 1024)  # Convert to GB

# Disk usage alert (>90%)
chorus_disk_usage_ratio > 0.9

System Health

# Overall system health score
chorus_system_health_score

# Component health scores
chorus_component_health_score

# Health check failure rate
rate(chorus_health_checks_failed_total[5m])

Alerting Rules

Example Prometheus alerting rules for CHORUS:

groups:
  - name: chorus_alerts
    interval: 30s
    rules:
      # P2P connectivity alerts
      - alert: LowPeerCount
        expr: chorus_p2p_connected_peers < 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low P2P peer count on {{ $labels.instance }}"
          description: "Node has {{ $value }} peers (minimum: 2)"

      # DHT performance alerts
      - alert: HighDHTFailureRate
        expr: |
          rate(chorus_dht_get_operations_total{status="failure"}[5m]) /
          rate(chorus_dht_get_operations_total[5m]) > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High DHT failure rate on {{ $labels.instance }}"
          description: "DHT failure rate: {{ $value | humanizePercentage }}"

      # Election stability alerts
      - alert: FrequentLeadershipChanges
        expr: rate(chorus_leadership_changes_total[1h]) * 3600 > 5
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Frequent leadership changes"
          description: "{{ $value }} leadership changes per hour"

      # Task management alerts
      - alert: HighTaskQueueDepth
        expr: chorus_tasks_queued > 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High task queue depth on {{ $labels.instance }}"
          description: "{{ $value }} tasks queued"

      # Resource alerts
      - alert: HighMemoryUsage
        expr: chorus_memory_usage_bytes > 8 * 1024 * 1024 * 1024  # 8GB
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage: {{ $value | humanize1024 }}B"

      - alert: HighDiskUsage
        expr: chorus_disk_usage_ratio > 0.9
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "High disk usage on {{ $labels.instance }}"
          description: "Disk usage: {{ $value | humanizePercentage }}"

      # Health monitoring alerts
      - alert: LowSystemHealth
        expr: chorus_system_health_score < 0.75
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low system health score on {{ $labels.instance }}"
          description: "Health score: {{ $value }}"

      - alert: ComponentUnhealthy
        expr: chorus_component_health_score < 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Component {{ $labels.component }} unhealthy"
          description: "Health score: {{ $value }}"

HTTP Endpoints

Metrics Endpoint

URL: /metrics Method: GET Description: Prometheus metrics in text exposition format

Response Format:

# HELP chorus_p2p_connected_peers Number of connected P2P peers
# TYPE chorus_p2p_connected_peers gauge
chorus_p2p_connected_peers 5

# HELP chorus_dht_put_operations_total Total number of DHT put operations
# TYPE chorus_dht_put_operations_total counter
chorus_dht_put_operations_total{status="success"} 1523
chorus_dht_put_operations_total{status="failure"} 12

# HELP chorus_task_duration_seconds Task execution duration
# TYPE chorus_task_duration_seconds histogram
chorus_task_duration_seconds_bucket{task_type="data_processing",status="success",le="0.001"} 0
chorus_task_duration_seconds_bucket{task_type="data_processing",status="success",le="0.005"} 12
chorus_task_duration_seconds_bucket{task_type="data_processing",status="success",le="0.01"} 45
...

Health Endpoint

URL: /health Method: GET Description: Basic health check for metrics server

Response: 200 OK with body OK

Best Practices

Metric Naming

  • Use descriptive metric names with chorus_ prefix
  • Follow Prometheus naming conventions: component_metric_unit
  • Use _total suffix for counters
  • Use _seconds suffix for time measurements
  • Use _bytes suffix for size measurements

Label Usage

  • Keep label cardinality low (avoid high-cardinality labels like request IDs)
  • Use consistent label names across metrics
  • Document label meanings and expected values
  • Avoid labels that change frequently

Performance Considerations

  • Metrics collection is lock-free for read operations
  • Histogram observations are optimized for high throughput
  • Background collectors run on separate goroutines
  • Custom registry prevents pollution of default registry

Error Handling

  • Metrics collection should never panic
  • Failed metric updates should be logged but not block operations
  • Use nil checks before accessing metrics collectors

Testing

func TestMetrics(t *testing.T) {
    config := metrics.DefaultMetricsConfig()
    config.NodeID = "test-node"

    m := metrics.NewCHORUSMetrics(config)

    // Test metric updates
    m.SetConnectedPeers(5)
    m.IncrementMessagesSent("test", "peer1")

    // Verify metrics are collected
    // (Use prometheus testutil for verification)
}

Troubleshooting

Metrics Not Appearing

  1. Verify metrics server is running: curl http://localhost:9090/metrics
  2. Check configuration: ensure correct ListenAddr and MetricsPath
  3. Verify Prometheus scrape configuration
  4. Check for errors in application logs

High Memory Usage

  1. Review label cardinality (check for unbounded label values)
  2. Adjust histogram buckets if too granular
  3. Reduce metric collection frequency
  4. Consider metric retention policies in Prometheus

Missing Metrics

  1. Ensure metric is being updated by application code
  2. Verify metric registration in initializeMetrics()
  3. Check for race conditions in metric access
  4. Review metric type compatibility (Counter vs Gauge vs Histogram)

Migration Guide

From Default Prometheus Registry

// Old approach
prometheus.MustRegister(myCounter)

// New approach
config := metrics.DefaultMetricsConfig()
m := metrics.NewCHORUSMetrics(config)
// Use m.IncrementErrors(...) instead of direct counter access

Adding New Metrics

  1. Add metric field to CHORUSMetrics struct
  2. Initialize metric in initializeMetrics() method
  3. Add helper methods for updating the metric
  4. Document the metric in this file
  5. Add Prometheus queries and alerts as needed