Manage production incidents, perform root cause analysis, and ensure preventive actions are implemented.
Collaborate with development, QA, and infrastructure teams to ensure applications are production-ready and reliable.
Implement and maintain observability tools (monitoring, logging, alerting) for proactive issue detection and resolution.
Support CI/CD pipelines and help enforce SRE best practices.
Automate routine production tasks and manual interventions.
Participate in on-call rotations for incident response and escalation handling.
Drive continuous improvement in system reliability, performance, and supportability.
Ensure compliance with internal controls, security standards, and disaster recovery protocols.
Required Skills & Experience:
6+ years of experience in Production Support or SRE roles.
Strong knowledge of Linux/Unix systems and scripting (Shell, Python, etc.).
Experience with monitoring tools like Prometheus, Grafana, AppDynamics, Splunk, or ELK Stack.
Familiarity with cloud platforms (AWS).
Exposure to containerization tools like Docker and orchestration with Kubernetes.
Experience with incident management processes and tools (ServiceNow, JIRA).
Understanding of SRE principles such as SLAs, SLOs, SLIs, and error budgets.
Background in Core Banking/Financial applications is a plus.
Job Classification
Industry: IT Services & ConsultingFunctional Area / Department: Engineering - Software & QARole Category: Software DevelopmentRole: Software Development - OtherEmployement Type: Full time