Your browser does not support javascript! Please enable it, otherwise web will not work for you.

SRE Manager - Distributed Systems @ Arcesium

Home > Devops

 SRE Manager - Distributed Systems

Job Description

We are looking for an experienced Engineering Manager to lead our Site Reliability Engineering (SRE) team. The ideal candidate will have a strong background in SRE principles and practices, as well as experience managing and mentoring engineers. The SRE Manager will be responsible for the overall success of the SRE team, including ensuring that our systems are reliable, scalable, and secure. The team is responsible for monitoring the stability and availability of mission critical production systems, managing incidents for quicker resolution, and establishing BAU. Team also building tools/infra which to be used by all development teams to assist in monitoring and troubleshooting.

As a Site Reliability Engineering Manager at Arcesium, you are expected to:
  • Manage a team of SRE engineers / SRE Leads
  • Own end to end availability and performance of mission critical services and build automation to prevent problem recurrence
  • Work closely with engineering managers and development teams to ensure that platforms are designed with scale and operability in mind
  • Help manage the teams infrastructure e.g. containers infrastructure using Docker & Kubernetes cluster, Kakfa clusters, etc.
  • Manage the teams AWS accounts and other infra provisioning.
  • Day to day support of dashboard, including responding to outages and triaging cases escalated by clients/internal teams
  • Manage on-call rotations to provide 24 hours coverage
  • Ensure systems are always DR ready
  • Manage team projects with Agile Methodology (Scrum/Kanban).
  • Review various processes from time to time and drive continual improvement.
  • Mentor SREs with incident case-studies and technical workshops
  • Mentor and coach engineers to be curious and effective at discovering and solving technical challenges
What you ll need:
  • 10+ years of experience in DevOps/Site reliability/Automation with 4+ years of People/Team Management exposure
  • Experienced with variety of tools that help manage, understand, and debug large, complex distributed systems
  • Good knowledge of Unix system, web technologies, databases and public cloud systems like AWS, Networking, Systems
  • Reliability: An exposure to Chaos Engineering and various reliability practices including disaster recovery will be good to have
  • IT Service Management: Incident Management, Problem Management, Change Management
  • Languages: Any of Python/Java/Node.js/Ruby
  • Linux: System Administration + Shell Scripting
  • Cloud Computing: Amazon Web Services
  • Microservices & Containerization -- Docker, Kubernetes
  • Version Control -- Git, Github, Gitlab, etc.
  • Configuration Management -- Ansible/Chef/Puppet
  • IT Service Management: Incident Management, Problem Management, Change Management
  • Agile: Scrum, Kanban

Job Classification

Industry: Financial Services
Functional Area / Department: Engineering - Software & QA
Role Category: DevOps
Role: Head - DevOps
Employement Type: Full time

Contact Details:

Company: Arcesium
Location(s): Hyderabad

+ View Contactajax loader


Keyskills:   Unix Cloud computing Automation Change management Networking Configuration management Shell scripting Troubleshooting Ruby Python

 Job seems aged, it may have been expired!
 Fraud Alert to job seekers!

₹ Not Disclosed

Similar positions

Gen AI- Bangalore

  • Imaginators Try Going
  • 2 - 5 years
  • Bengaluru
  • 2 days ago
₹ 2.5-5.5 Lacs P.A.

Gen AI- Bangalore

  • Imaginators Try Going
  • 2 - 5 years
  • Bengaluru
  • 3 days ago
₹ 2.5-5.5 Lacs P.A.

Senior Infrastructure Engineer - Observability and Python

  • Wells Fargo
  • 4 - 9 years
  • Hyderabad
  • 3 days ago
₹ Not Disclosed

Urgently Hiring - Devops Engineer

  • Talent Sketchers
  • 5 - 10 years
  • Hyderabad
  • 3 days ago
₹ Not Disclosed

Arcesium

https://www.arcesium.com/