We are looking for an experienced Engineering Manager to lead our Site Reliability Engineering (SRE) team. The ideal candidate will have a strong background in SRE principles and practices, as well as experience managing and mentoring engineers. The SRE Manager will be responsible for the overall success of the SRE team, including ensuring that our systems are reliable, scalable, and secure. The team is responsible for monitoring the stability and availability of mission critical production systems, managing incidents for quicker resolution, and establishing BAU. Team also building tools/infra which to be used by all development teams to assist in monitoring and troubleshooting.
As a Site Reliability Engineering Manager at Arcesium, you are expected to:
Manage a team of SRE engineers / SRE Leads
Own end to end availability and performance of mission critical services and build automation to prevent problem recurrence
Work closely with engineering managers and development teams to ensure that platforms are designed with scale and operability in mind
Help manage the teams infrastructure e.g. containers infrastructure using Docker & Kubernetes cluster, Kakfa clusters, etc.
Manage the teams AWS accounts and other infra provisioning.
Day to day support of dashboard, including responding to outages and triaging cases escalated by clients/internal teams
Manage on-call rotations to provide 24 hours coverage
Ensure systems are always DR ready
Manage team projects with Agile Methodology (Scrum/Kanban).
Review various processes from time to time and drive continual improvement.
Mentor SREs with incident case-studies and technical workshops
Mentor and coach engineers to be curious and effective at discovering and solving technical challenges
What you ll need:
10+ years of experience in DevOps/Site reliability/Automation with 4+ years of People/Team Management exposure
Experienced with variety of tools that help manage, understand, and debug large, complex distributed systems
Good knowledge of Unix system, web technologies, databases and public cloud systems like AWS, Networking, Systems
Reliability: An exposure to Chaos Engineering and various reliability practices including disaster recovery will be good to have
IT Service Management: Incident Management, Problem Management, Change Management
IT Service Management: Incident Management, Problem Management, Change Management
Agile: Scrum, Kanban
Job Classification
Industry: Financial ServicesFunctional Area / Department: Engineering - Software & QARole Category: DevOpsRole: Head - DevOpsEmployement Type: Full time