As an SRE Manager, you will lead a team of talented Site Reliability Engineers (SREs) responsible for ensuring the reliability, scalability, and performance of Infrastructure systems.
You ll collaborate with cross-functional teams to drive operational excellence, implement best practices, and maintain high availability for our Top client critical services.
Responsibilities:
Technical Expertise:
Proficiency in cloud services (AWS, GCP, Azure), container orchestration (Kubernetes), and infrastructure as code (Terraform, Ansible).
Knowledge of networking , storage , and database management .
Familiarity with scripting and programming languages (Python, Go, Shell scripting).
Systems Thinking:
Ability to understand complex systems and the interdependencies within them.
Skill in identifying potential points of failure and implementing resilience strategies .
Operational Excellence:
Experience in setting up and tracking SLOs/SLIs and error budgets .
Proficiency in monitoring and observability tools (Splunk, ITSI, BMC TrueSight, Prometheus, Grafana, ELK stack).
Automation and Tooling:
Strong background in automating deployment, scaling, and recovery processes.
Automate deployment, monitoring, and recovery processes using tools like Terraform, Kubernetes, Salt, Chef and Prometheus.
Experience with CI/CD pipelines and version control systems (Git).
Leadership and Communication:
Proven ability to lead and mentor a team of engineers.
Excellent communication skills to collaborate with cross-functional teams and stakeholders.
Manage and mentor a team of SREs, providing guidance, performance feedback, and career development
Job Classification
Industry: IT Services & ConsultingFunctional Area / Department: Engineering - Software & QARole Category: DevOpsRole: Release ManagerEmployement Type: Full time