Skip to content

Platform & MLOps
Engineer

I get models off laptops and onto clusters that don't fall over.

Production infrastructure for AI workloads and distributed systems — Kubernetes, GPU scheduling, GitOps, observability. MSc AI finishing August 2026.Available for fully remote B2B contracts starting September 2026.

KubernetesMLOpsAWSGitOpsObservabilityPyTorch
scroll
<expertise />

What I Build

Resilient infrastructure for distributed systems and AI workloads. The boring fundamentals, done well.

Kubernetes

& EKS

Cluster operations, GPU node pools, zero-downtime upgrades, right-sizing for cost

Observability

Stack

Prometheus, Grafana, Loki, distributed tracing, alerting

GitOps

& CI/CD

ArgoCD, pipeline automation, deployment strategies, security scanning

Security

Automation

SAST/SCA integration, runtime security, policy-as-code

AWS

Cloud

EKS, IAM, VPC, Route 53, S3, MSK, RDS — the boring fundamentals

MLOps

& AI Infrastructure

Model serving on K8s, GPU scheduling, reproducible training pipelines, drift monitoring

Data

Platforms

Kafka, stream processing, training data pipelines, schema evolution

Or just look at what I've shipped.

View Case Studies
<projects />

Featured Work

Four things I've owned end-to-end. What they are, what changed, and a few decisions worth flagging.

01

Heimdall

Deployment intelligence platform

The dashboard the platform team checks every morning. Answers one question: where is my ticket right now? Used daily by 20+ engineers across 17 services.

17
services tracked
20+
engineers daily
10 min
data freshness
PythonFlaskTimescaleDBPrometheusArgoCDKubernetes
Read Case Study
heimdall
$ curl heimdall/api/v1/debug | jq .
collection.age_seconds: 142
db_pool.checked_out: 2 / 10
circuit_breakers: all closed
02

Pipeline Platform

Shared CI/CD library

One Bitbucket pipeline library, imported by every Java and Node service. Tests live in their own repo, promotion belongs to ArgoCD. ~400 deploys/month across 20 services on a single .ci/builds.yaml.

20
services, one library
~400
deploys/month
1 file
to onboard
Bitbucket Shared PipelinesArgoCDImage UpdaterKubernetesKustomize
Read Case Study
pipeline-platform
$ cat .ci/builds.yaml
service: payments-api
import: java-shared-pipeline:1.4.0
→ Image Updater handles the rest
03

Observability Stack

Self-hosted monitoring

Prometheus, Grafana and Loki for 20 services across four environments. Built it ourselves because the commercial quotes were ~£100k and we already had the cluster capacity.

~£5k/yr
vs ~£100k commercial
~25
dashboards
50+
alerts, runbook each
PrometheusGrafanaLokiThanosAlertmanager
Read Case Study
observability
$ prometheus targets
20/20 targets healthy
25 dashboards active
50+ alert rules configured
04

Smart Home on K3s

Self-hosted home automation

Single-node Kubernetes cluster on a Raspberry Pi 5, GitOps-reconciled by ArgoCD, observable end-to-end through Prometheus and Grafana. Twenty-plus lights, plugs and sensors. Zero ports exposed to the internet. Same discipline I apply at work, sized to a flat.

Single-node
K3s + ArgoCD + Prometheus
20+
lights, plugs and sensors
0
ports exposed to the internet
K3sArgoCDHome AssistantZigbee2MQTTPrometheusGrafanaTailscale
Read Case Study
smart-home
$ kubectl get apps -n argocd
home-assistant Synced Healthy
zigbee2mqtt Synced Healthy
prometheus + grafana Synced Healthy
<contact />

Let's talk.

For teams that need specialised infrastructure for AI workloads, GPU-aware Kubernetes, or a platform that holds up under production load. Outside IR35 or international B2B equivalent. I usually reply within a day.

Available for fully remote B2B contracts starting September 2026