Site Reliability Engineer

We’re hiring a Site Reliability Engineer (SRE) to help ensure the reliability and performance of the Targon platform around the clock. You’ll work at the intersection of systems engineering and DevOps to keep our infrastructure scalable, observable, and resilient. You will be focused on:

Ensuring our services stay online and performant, including during off hours
Optimizing our Kubernetes clusters, including service mesh, metrics, and logging
Benchmarking services and identifying bottlenecks in our current infrastructure
Improving observability and alerting systems to catch issues before they impact users
Scaling services to minimize downtime under load
Developing CI/CD pipelines for new and existing services

Ideal Experiences

Hands-on experience with Kubernetes in production environments
Proficiency with Golang for systems and infrastructure tooling
Familiarity with confidential virtual machines (CVMs)
Experience with Prometheus, Loki, and Grafana for monitoring and observability

Bonus Skills

Experience with CI/CD tools and best practices
Familiarity with tools like GitHub, Discord, Notion and Linear for modern team collaboration