Site Reliability Engineer
We’re hiring a Site Reliability Engineer (SRE) to help ensure the reliability
and performance of the Targon platform around the clock. You’ll work at the
intersection of systems engineering and DevOps to keep our infrastructure
scalable, observable, and resilient. You will be focused on:
- Ensuring our services stay online and performant, including during off hours
- Optimizing our Kubernetes clusters, including service mesh, metrics, and
logging
- Benchmarking services and identifying bottlenecks in our current
infrastructure
- Improving observability and alerting systems to catch issues before they
impact users
- Scaling services to minimize downtime under load
- Developing CI/CD pipelines for new and existing services
Ideal Experiences
- Hands-on experience with Kubernetes in production environments
- Proficiency with Golang for systems and infrastructure tooling
- Familiarity with confidential virtual machines (CVMs)
- Experience with Prometheus, Loki, and Grafana for monitoring and observability
Bonus Skills
- Experience with CI/CD tools and best practices
- Familiarity with tools like GitHub, Discord, Notion and Linear for modern team
collaboration