YOUR IMPACT:
We are seeking a highly skilled and experienced Level 3 Site Reliability Engineer (SRE) to join our Cloud Operations team. This role is critical in driving advanced engineering initiatives to ensure infrastructure reliability, scalability, and automation across multi-cloud environments. As an L3 SRE, you will lead complex cloud support operations, troubleshoot infrastructure as code, implement observability frameworks, and guide junior SREs while helping shape future architectural direction.
This role demands hands-on expertise in AWS, Azure, or GCP, advanced scripting, and deep observability integrationcontributing directly to uptime, automation maturity, and strategic improvements to cloud infrastructure.
WHAT THE ROLE OFFERS:
Cloud Infrastructure & Architecture
- Architect and maintain scalable, resilient systems across AWS, Azure, and GCP.
- Lead cloud adoption and migration strategies while ensuring minimal disruption and high reliability.
- Implement security and governance controls including VPC, Security Groups, Route53, ACM, and Security Hub.
- Perform deep infrastructure troubleshooting and root cause analysis, especially with IaC-based deployments.
Infrastructure as Code (IaC) & Configuration Management
- Design and manage infrastructure using Terraform, Terragrunt, and CloudFormation.
- Oversee configuration management using tools like AWS SSM, SaltStack, and Packer.
- Review and remediate issues within Git-based CI/CD workflows for IaC and service deployment.
Observability & Monitoring
- Build and maintain monitoring/alerting pipelines using CloudWatch, EventBridge, SNS, and Hund.io.
- Develop custom observability tooling for end-to-end visibility and proactive issue detection.
- Lead incident response and contribute to post-incident reviews and reliability reports.
Automation, Scripting & CI/CD
- Develop and maintain automation tools using Bash, Python, Ruby, or PHP.
- Integrate deployment pipelines into secure, scalable CI/CD processes.
- Automate vulnerability assessments and compliance scans with ISO 27001 standards.
Containerization & Microservices Support
- Lead container platform deployments using EKS, ECS, ECR, and Fargate.
- Guide engineering teams in Kubernetes resource optimization and troubleshooting.
Database & Storage Management
- Provide advanced operational support for RDS, PostgreSQL, and Elasticsearch.
- Monitor database performance and ensure availability across distributed systems.
Mentorship & Strategy
- Mentor L1 and L2 SREs on technical tasks and troubleshooting best practices.
- Contribute to cloud architecture planning, operational readiness, and process improvements.
- Help define and track Key Performance Indicators (KPIs) related to system uptime, MTTR, and automation coverage.
WHAT YOU NEED TO SUCCEED:
- 7-12 years of experience in Site Reliability Engineering or DevOps roles.
- Advanced expertise in multi-cloud environments (AWS, Azure, GCP).
- Strong Linux and Windows administration background (Fedora, Debian, Microsoft).
- Proficiency in Terraform, Terragrunt, CloudFormation, and config management tools.
- Hands-on with monitoring tools like CloudWatch, SNS, EventBridge, and third-party integrations.
- Advanced scripting skills in Python, Bash, Ruby, or PHP.
- Knowledge of container platforms including EKS, ECS, and Fargate.
- Familiarity with Vulnerability Management, ISO 27001, and audit-readiness practices.