Back to Projects

Enterprise Observability Stack

Building production-grade, self-hosted observability with Prometheus, Grafana, and Loki—achieving full-stack visibility at a fraction of cloud costs

2024-2025
20+ Services • 4 Environments

The Challenge

The platform had zero observability across 20 microservices running on Kubernetes. Critical operational questions remained unanswered:

  • "Which service is causing the 500 errors?"
  • "Why did the pod restart 5 times in the last hour?"
  • "What's our Kafka consumer lag right now?"
  • "Are we hitting CPU/memory limits?"
  • "Where are the logs for that failed deployment?"

The Cost Problem

Cloud observability solutions (Datadog, New Relic, Splunk) would cost $50K-150K/year for our scale—prohibitively expensive for the budget

The mandate: Build enterprise-grade observability in-house at minimal cost while maintaining production reliability

The Solution: Self-Hosted Observability Platform

I architected and deployed a complete self-hosted observability stack using open-source tooling, achieving enterprise-grade monitoring at <5% of cloud costs.

Metrics

Prometheus + Thanos for long-term storage

Visualisation

Grafana with custom dashboards

Logs

Loki (microservices mode) + Promtail

Architecture & Implementation

Enterprise Observability Stack Architecture

InfrastructureExportersNode ExporterKube State MetricsKafka • Postgres • RedisMicroservices20+ services/metrics endpointsCustom business metricsPrometheusMetrics collection30s scrape interval50+ alert rulesThanosLong-term storageCompression • DownsamplingAWS S3Object storageUnlimited retentionPromtailDaemonSet collectorLabel enrichmentPod/namespace contextLokiLog aggregationMicroservices modeLabel-based indexingTempoDistributed tracingOpenTelemetryTrace storageGrafana25+ dashboardsMetrics + LogsUnified observabilityIoT • Platform • BusinessAlertmanagerAlert routingSmart groupingInhibition rulesTeamsPower AutomateDev: business hoursQA/Prod: 24/7scrapescrapestorearchivelogsaggregatetracesqueryqueryqueryalertsnotifyBusiness Impact & Scale<$5K/year self-hosted cost vs $50K-150K cloud solutions (95%+ savings)25+ Grafana dashboards • 50+ alert rules • 100% service coverage • ~70% MTTD reduction • Full-stack tracingPlatform: Cluster health, infrastructure, service mesh • Service: IoT gateway, multi-tenant analytics • Business: Cost tracking, SLA monitoring

Metrics Flow: Services (expose /metrics) + Exporters → Prometheus (scrape) → Thanos → S3

Logs Flow: Services (stdout/stderr) → Promtail (DaemonSet collector) → Loki (aggregate)

Traces Flow: Services (OpenTelemetry instrumentation) → Tempo (distributed trace storage)

Visualization: Grafana queries Prometheus, Loki, and Tempo for unified observability (metrics + logs + traces)

Alerting: Prometheus → Alertmanager (routing + inhibition) → Teams (environment-specific channels)

Core Stack Components

1. Prometheus (Metrics Collection)

Configuration: Kustomize base + environment overlays (dev, qa, preprod, prod)

Service Discovery: Kubernetes SD for automatic service detection

Scrape Targets: API servers, nodes, pods, services, exporters (15+ scrape jobs)

Long-term Storage: Thanos sidecar → S3 for historical data (cost-effective)

Retention: 15 days local, unlimited S3 storage

2. Exporters (Data Sources)

Deployed comprehensive exporters for full-stack visibility:

  • Node Exporter: CPU, memory, disk, network metrics from EC2 instances
  • Kube State Metrics: Kubernetes object state (pods, deployments, nodes)
  • Kafka Exporter: MSK consumer lag, partition offsets, topic metrics
  • Postgres Exporter: Database connections, query performance, replication lag
  • Redis Exporter: Cache hit ratio, memory usage, key evictions
  • CloudWatch Exporter: AWS service metrics (RDS, MSK, ELB)

3. Grafana (Visualisation)

Deployment: Persistent storage with PVC for dashboard retention

Data Sources: Prometheus (metrics), Loki (logs), AWS CloudWatch

Dashboards Created: 25+ service-specific and platform-wide dashboards

Access: Istio VirtualService with environment-specific routing

4. Loki (Log Aggregation)

Architecture: Microservices mode (distributor, ingester, querier, query-frontend, compactor)

Storage: S3 backend for cost-effective log storage

Collection: Promtail DaemonSet scraping pod logs

Indexing: Label-based indexing (namespace, pod, container) for fast queries

5. Alertmanager (Alert Routing)

Integration: Teams webhooks via Power Automate workflows

Smart Routing: Environment-specific channels (Dev → business hours, Prod → 24/7)

Alert Inhibition: Suppress low-severity alerts when critical alerts fire

Grouping: Batch alerts by namespace/service to reduce noise

Alert Rules Implemented

Node Alerts:High CPU/Memory/Disk (85% warning, 95% critical)
Kubernetes Alerts:Pod crash loops, nodes not ready, replica mismatches, PV usage
Network Alerts:High latency, dropped packets, interface errors
Application Alerts:Service-specific metrics (HTTP errors, consumer lag, query latency)

Service Instrumentation

Instrumented all 20 microservices to expose custom Prometheus metrics for business intelligence:

Service A (Data Ingestion): Gateway throughput, active connections, telemetry ingestion rate, data pipeline success rate

Service B (Message Processing): Kafka consumer lag, message processing rate, error rates, device processing outcomes

Service C (Rules Engine): Rule execution count, rule evaluation latency, alarm creation rate

Service D (API Gateway): HTTP request rates, response times, error rates, authentication success/failure

Business Intelligence Dashboards

Created 25+ Grafana dashboards providing comprehensive visibility from infrastructure to business metrics.

Platform Overview Dashboards

  • Kubernetes Cluster Health: Node status, pod health, resource utilisation
  • Infrastructure Metrics: CPU, memory, disk, network across all nodes
  • Service Mesh: Istio traffic flow, success rates, latencies
  • Database Performance: Postgres connections, query times, replication lag
  • Message Queue Health: Kafka consumer lag, partition metrics, throughput

Service-Specific Dashboards

  • IoT Gateway Intelligence: Throughput (req/s), active connections, data pipeline success, vendor performance ranking
  • Multi-Tenant Analytics: Tenant activity championship, ingestion rates by tenant, retail chain data flow
  • Integration Performance: Vendor system performance, $750K+ annual savings tracking
  • Data Quality: Validation rates, error rates, processing latencies

Example: IoT Gateway Dashboard

Mission Control Section: Gateway throughput (20.7 req/s), active IoT connections (3), availability (100%), data pipeline success (100%)

Business Intelligence: Retail chain telemetry ingestion trends, tenant activity rankings, multi-vendor integration performance

Cost Savings Tracking: Annual savings from optimisations ($750K+), vendor performance comparison

Dashboard Previews

Real production dashboards showing metrics collection, business intelligence, and infrastructure monitoring in action.

IoT Gateway Mission Control
Service Metrics

IoT Gateway Mission Control

Real-time gateway throughput (61.9 req/s), active IoT connections, availability tracking, and data pipeline success rates. Includes multi-tenant business intelligence showing retail chain telemetry ingestion and tenant activity rankings.

Kafka Consumer Lag & Topic Health
Data Platform

Kafka Consumer Lag & Topic Health

Comprehensive Kafka monitoring showing consumer lag trends by group and topic, partition-level lag visualisation, and topic size tracking. Essential for preventing data pipeline bottlenecks and ensuring message processing SLAs.

Node Exporter Infrastructure Metrics
Infrastructure

Node Exporter Infrastructure Metrics

System-level monitoring via Node Exporter showing CPU pressure, memory usage, disk I/O, network traffic, and system load. Critical for infrastructure health and capacity planning across Kubernetes cluster nodes.

Operational Runbooks

Created comprehensive runbooks for every alert, enabling rapid incident response and knowledge sharing.

Runbook Structure

1. Symptom:

Clear description of what triggered the alert

2. Investigation Steps:

Decision tree with kubectl commands, log queries, metric queries

3. Common Resolution Commands:

Copy-paste commands for typical fixes (restart pods, scale deployment, clear cache)

4. Escalation Path:

Who to contact if standard resolution doesn't work

Example: Pod Crash Loop Runbook
# Check pod status
kubectl get pods -l app=service-a -n production

# View recent logs
kubectl logs service-a-abc123 --tail=100

# Check events
kubectl describe pod service-a-abc123 -n production

# Common fixes:
kubectl rollout restart deployment/service-a -n production
kubectl delete pod service-a-abc123 -n production

Business Impact

<$5K/yr

Total infrastructure cost (vs $50K-150K for cloud solutions)—95%+ savings

25+

Grafana dashboards providing platform-wide to service-specific visibility

50+

Alert rules covering infrastructure, Kubernetes, network, and application metrics

~70%

Reduction in mean time to detect (MTTD)—proactive alerting catches issues early

100%

Service coverage—all 20 microservices instrumented and monitored

4 Envs

Consistent observability across Dev, QA, PreProd, and Production

Technical Highlights

  • Deployed complete self-hosted stack (Prometheus, Grafana, Loki, Alertmanager) with Kustomize
  • Configured Loki microservices architecture for scalable log aggregation
  • Integrated Thanos for long-term metrics storage in S3 (cost-effective historical data)
  • Deployed 6+ specialized exporters for comprehensive infrastructure monitoring
  • Created 50+ alert rules with smart routing and inhibition logic
  • Built 25+ Grafana dashboards from platform infrastructure to business intelligence
  • Instrumented all 20 microservices with custom Prometheus metrics
  • Authored comprehensive runbooks for rapid incident response
  • Achieved 95%+ cost savings vs commercial observability solutions