Senior Site Reliability Engineer

Location:

Remote - Eastern Europe - South Africa

Type:

Full-time

Level:

Senior

As we expand our global footprint, we are building a strong Site Reliability Engineering capability to ensure the resilience, scalability, and performance of mission-critical systems for our clients.

We are seeking a Senior Site Reliability Engineer (SRE) to join our international team on a remote basis. This role combines software engineering, systems engineering, and operations expertise to design and maintain highly reliable distributed systems. You will work closely with engineering teams to improve system availability, performance, and efficiency through automation, monitoring, and modern SRE practices.This is a hands-on, senior-level role with ownership over reliability engineering, where you will be expected to guide best practices, mentor engineers, and help scale production systems for global enterprises.

Key Responsibilities

Design, build, and maintain highly available and scalable infrastructure across cloud platforms (AWS, Azure, or GCP)
Develop automation and tooling to reduce toil, improve deployments, and ensure reliability
Establish SLIs, SLOs, and SLAs with engineering teams and drive a culture of reliability
Build observability stacks with Prometheus, Grafana, ELK, or similar tools
Implement incident management, root cause analysis, and postmortem processes
Partner with development teams to design resilient architectures and enforce production readiness standards
Optimize performance and reliability of distributed systems, databases, and networks
Own capacity planning, scalability, and disaster recovery strategies
Participate in on-call rotations and ensure smooth incident response processes
Contribute to infrastructure-as-code (Terraform, CloudFormation) and CI/CD automation pipelines

Requirements

7+ years in systems engineering, DevOps, or SRE roles (with at least 3 years in an SRE-focused capacity)
Strong Linux/Unix systems expertise and deep understanding of OS fundamentals
Expertise in cloud infrastructure (AWS strongly preferred; Azure/GCP also valuable)
Hands-on experience with Kubernetes and container orchestration in production
Strong automation and scripting skills (Python, Bash, or Go)
Proven experience implementing monitoring, logging, and observability at scale (Grafana, Prometheus, ELK, etc.)
Solid troubleshooting skills for distributed systems, networking, and application-level issues
Experience with CI/CD systems (Jenkins, GitLab CI, or similar)
Strong knowledge of reliability principles: error budgets, SLIs/SLOs, chaos engineering, capacity planning
Excellent communication skills, detail-oriented, and able to operate independently in a remote team environment

Nice to Have

Certifications in AWS, Kubernetes, or SRE/DevOps tooling
Experience with large-scale database systems (Cassandra, MySQL, Postgres)
Familiarity with incident management platforms (PagerDuty, Opsgenie)
Exposure to security, compliance, and data governance in cloud environments

What We Offer

Flexible remote role with opportunities to shape our SRE practice
Work with international clients across South Africa, the Middle East, and Europe
Engineering-first culture with autonomy and ownership
Competitive compensation based on senior-level expertise
Opportunity to build and run production-grade systems at scale in cloud-native environments