Senior Site Reliability Engineer
As we expand our global footprint, we are building a strong Site Reliability Engineering capability to ensure the resilience, scalability, and performance of mission-critical systems for our clients.
We are seeking a Senior Site Reliability Engineer (SRE) to join our international team on a remote basis. This role combines software engineering, systems engineering, and operations expertise to design and maintain highly reliable distributed systems. You will work closely with engineering teams to improve system availability, performance, and efficiency through automation, monitoring, and modern SRE practices.This is a hands-on, senior-level role with ownership over reliability engineering, where you will be expected to guide best practices, mentor engineers, and help scale production systems for global enterprises.
Key Responsibilities
- Design, build, and maintain highly available and scalable infrastructure across cloud platforms (AWS, Azure, or GCP)
- Develop automation and tooling to reduce toil, improve deployments, and ensure reliability
- Establish SLIs, SLOs, and SLAs with engineering teams and drive a culture of reliability
- Build observability stacks with Prometheus, Grafana, ELK, or similar tools
- Implement incident management, root cause analysis, and postmortem processes
- Partner with development teams to design resilient architectures and enforce production readiness standards
- Optimize performance and reliability of distributed systems, databases, and networks
- Own capacity planning, scalability, and disaster recovery strategies
- Participate in on-call rotations and ensure smooth incident response processes
- Contribute to infrastructure-as-code (Terraform, CloudFormation) and CI/CD automation pipelines
Requirements
- 7+ years in systems engineering, DevOps, or SRE roles (with at least 3 years in an SRE-focused capacity)
- Strong Linux/Unix systems expertise and deep understanding of OS fundamentals
- Expertise in cloud infrastructure (AWS strongly preferred; Azure/GCP also valuable)
- Hands-on experience with Kubernetes and container orchestration in production
- Strong automation and scripting skills (Python, Bash, or Go)
- Proven experience implementing monitoring, logging, and observability at scale (Grafana, Prometheus, ELK, etc.)
- Solid troubleshooting skills for distributed systems, networking, and application-level issues
- Experience with CI/CD systems (Jenkins, GitLab CI, or similar)
- Strong knowledge of reliability principles: error budgets, SLIs/SLOs, chaos engineering, capacity planning
- Excellent communication skills, detail-oriented, and able to operate independently in a remote team environment
Nice to Have
- Certifications in AWS, Kubernetes, or SRE/DevOps tooling
- Experience with large-scale database systems (Cassandra, MySQL, Postgres)
- Familiarity with incident management platforms (PagerDuty, Opsgenie)
- Exposure to security, compliance, and data governance in cloud environments
What We Offer
- Flexible remote role with opportunities to shape our SRE practice
- Work with international clients across South Africa, the Middle East, and Europe
- Engineering-first culture with autonomy and ownership
- Competitive compensation based on senior-level expertise
- Opportunity to build and run production-grade systems at scale in cloud-native environments