Enterprise Observability Stack
Building production-grade, self-hosted observability with Prometheus, Grafana, and Loki—achieving full-stack visibility at a fraction of cloud costs
The Challenge
The platform had zero observability across 20 microservices running on Kubernetes. Critical operational questions remained unanswered:
- "Which service is causing the 500 errors?"
- "Why did the pod restart 5 times in the last hour?"
- "What's our Kafka consumer lag right now?"
- "Are we hitting CPU/memory limits?"
- "Where are the logs for that failed deployment?"
The Cost Problem
Cloud observability solutions (Datadog, New Relic, Splunk) would cost $50K-150K/year for our scale—prohibitively expensive for the budget
The mandate: Build enterprise-grade observability in-house at minimal cost while maintaining production reliability
The Solution: Self-Hosted Observability Platform
I architected and deployed a complete self-hosted observability stack using open-source tooling, achieving enterprise-grade monitoring at <5% of cloud costs.
Metrics
Prometheus + Thanos for long-term storage
Visualisation
Grafana with custom dashboards
Logs
Loki (microservices mode) + Promtail
Architecture & Implementation
Enterprise Observability Stack Architecture
Metrics Flow: Services (expose /metrics) + Exporters → Prometheus (scrape) → Thanos → S3
Logs Flow: Services (stdout/stderr) → Promtail (DaemonSet collector) → Loki (aggregate)
Traces Flow: Services (OpenTelemetry instrumentation) → Tempo (distributed trace storage)
Visualization: Grafana queries Prometheus, Loki, and Tempo for unified observability (metrics + logs + traces)
Alerting: Prometheus → Alertmanager (routing + inhibition) → Teams (environment-specific channels)
Core Stack Components
1. Prometheus (Metrics Collection)
Configuration: Kustomize base + environment overlays (dev, qa, preprod, prod)
Service Discovery: Kubernetes SD for automatic service detection
Scrape Targets: API servers, nodes, pods, services, exporters (15+ scrape jobs)
Long-term Storage: Thanos sidecar → S3 for historical data (cost-effective)
Retention: 15 days local, unlimited S3 storage
2. Exporters (Data Sources)
Deployed comprehensive exporters for full-stack visibility:
- • Node Exporter: CPU, memory, disk, network metrics from EC2 instances
- • Kube State Metrics: Kubernetes object state (pods, deployments, nodes)
- • Kafka Exporter: MSK consumer lag, partition offsets, topic metrics
- • Postgres Exporter: Database connections, query performance, replication lag
- • Redis Exporter: Cache hit ratio, memory usage, key evictions
- • CloudWatch Exporter: AWS service metrics (RDS, MSK, ELB)
3. Grafana (Visualisation)
Deployment: Persistent storage with PVC for dashboard retention
Data Sources: Prometheus (metrics), Loki (logs), AWS CloudWatch
Dashboards Created: 25+ service-specific and platform-wide dashboards
Access: Istio VirtualService with environment-specific routing
4. Loki (Log Aggregation)
Architecture: Microservices mode (distributor, ingester, querier, query-frontend, compactor)
Storage: S3 backend for cost-effective log storage
Collection: Promtail DaemonSet scraping pod logs
Indexing: Label-based indexing (namespace, pod, container) for fast queries
5. Alertmanager (Alert Routing)
Integration: Teams webhooks via Power Automate workflows
Smart Routing: Environment-specific channels (Dev → business hours, Prod → 24/7)
Alert Inhibition: Suppress low-severity alerts when critical alerts fire
Grouping: Batch alerts by namespace/service to reduce noise
Alert Rules Implemented
Service Instrumentation
Instrumented all 20 microservices to expose custom Prometheus metrics for business intelligence:
Service A (Data Ingestion): Gateway throughput, active connections, telemetry ingestion rate, data pipeline success rate
Service B (Message Processing): Kafka consumer lag, message processing rate, error rates, device processing outcomes
Service C (Rules Engine): Rule execution count, rule evaluation latency, alarm creation rate
Service D (API Gateway): HTTP request rates, response times, error rates, authentication success/failure
Business Intelligence Dashboards
Created 25+ Grafana dashboards providing comprehensive visibility from infrastructure to business metrics.
Platform Overview Dashboards
- • Kubernetes Cluster Health: Node status, pod health, resource utilisation
- • Infrastructure Metrics: CPU, memory, disk, network across all nodes
- • Service Mesh: Istio traffic flow, success rates, latencies
- • Database Performance: Postgres connections, query times, replication lag
- • Message Queue Health: Kafka consumer lag, partition metrics, throughput
Service-Specific Dashboards
- • IoT Gateway Intelligence: Throughput (req/s), active connections, data pipeline success, vendor performance ranking
- • Multi-Tenant Analytics: Tenant activity championship, ingestion rates by tenant, retail chain data flow
- • Integration Performance: Vendor system performance, $750K+ annual savings tracking
- • Data Quality: Validation rates, error rates, processing latencies
Example: IoT Gateway Dashboard
Mission Control Section: Gateway throughput (20.7 req/s), active IoT connections (3), availability (100%), data pipeline success (100%)
Business Intelligence: Retail chain telemetry ingestion trends, tenant activity rankings, multi-vendor integration performance
Cost Savings Tracking: Annual savings from optimisations ($750K+), vendor performance comparison
Dashboard Previews
Real production dashboards showing metrics collection, business intelligence, and infrastructure monitoring in action.

IoT Gateway Mission Control
Real-time gateway throughput (61.9 req/s), active IoT connections, availability tracking, and data pipeline success rates. Includes multi-tenant business intelligence showing retail chain telemetry ingestion and tenant activity rankings.

Kafka Consumer Lag & Topic Health
Comprehensive Kafka monitoring showing consumer lag trends by group and topic, partition-level lag visualisation, and topic size tracking. Essential for preventing data pipeline bottlenecks and ensuring message processing SLAs.

Node Exporter Infrastructure Metrics
System-level monitoring via Node Exporter showing CPU pressure, memory usage, disk I/O, network traffic, and system load. Critical for infrastructure health and capacity planning across Kubernetes cluster nodes.
Operational Runbooks
Created comprehensive runbooks for every alert, enabling rapid incident response and knowledge sharing.
Runbook Structure
Clear description of what triggered the alert
Decision tree with kubectl commands, log queries, metric queries
Copy-paste commands for typical fixes (restart pods, scale deployment, clear cache)
Who to contact if standard resolution doesn't work
# Check pod status kubectl get pods -l app=service-a -n production # View recent logs kubectl logs service-a-abc123 --tail=100 # Check events kubectl describe pod service-a-abc123 -n production # Common fixes: kubectl rollout restart deployment/service-a -n production kubectl delete pod service-a-abc123 -n production
Business Impact
Total infrastructure cost (vs $50K-150K for cloud solutions)—95%+ savings
Grafana dashboards providing platform-wide to service-specific visibility
Alert rules covering infrastructure, Kubernetes, network, and application metrics
Reduction in mean time to detect (MTTD)—proactive alerting catches issues early
Service coverage—all 20 microservices instrumented and monitored
Consistent observability across Dev, QA, PreProd, and Production
Technical Highlights
- ✓Deployed complete self-hosted stack (Prometheus, Grafana, Loki, Alertmanager) with Kustomize
- ✓Configured Loki microservices architecture for scalable log aggregation
- ✓Integrated Thanos for long-term metrics storage in S3 (cost-effective historical data)
- ✓Deployed 6+ specialized exporters for comprehensive infrastructure monitoring
- ✓Created 50+ alert rules with smart routing and inhibition logic
- ✓Built 25+ Grafana dashboards from platform infrastructure to business intelligence
- ✓Instrumented all 20 microservices with custom Prometheus metrics
- ✓Authored comprehensive runbooks for rapid incident response
- ✓Achieved 95%+ cost savings vs commercial observability solutions