Job Description
About the Role:Grade Level (for internal use):13
S&P Global Ratings The RoleDirector, Application Operations, SRE (Site Reliability Engineering)
The Team This team is part of the global SRE group that provides Site Reliability Engineering Services for the critical applications used by the analysts for conducting the business. Application Operations team is responsible for the Stability (Uptime), Reliability (Quality & Performance) and Engineering of these applications to improve business outcomes, user experience and efficiencies.The Team operates at the intersection of IT operations and software development, ensuring that our services are not only robust but also agile enough to adapt to the ever-evolving business needs.
Impact and Responsibilities The Impact of this role extends far beyond the immediate team. You will be instrumental in shaping the reliability and performance standards of our critical applications, ensuring they meet the highest benchmarks. By driving advancements in automation and cloud technologies, you will contribute significantly to the organization's strategic goals and toil reduction, enhancing both the user experience and operational efficiency. You will nurture the team members to be the best-in-class by upskilling and cross-skilling.General & Team management:
Ensure the team balances its focus between daily operational tasks and strategic long-term projectsDrive the adoption of new technologies and processes through training and mentoringLead/Mentor/Guide/Coach and transform a team of Application Operations to SREsCreate/maintain documentation for systems and processes to ensure continuity and knowledge sharing within the team. Adoption of Gen AI to leverage knowledge repositoryCollaborate with cross-functional teams to ensure seamless integration and support for new technologies and initiativesOversee daily operations and ensure the shifts are adequately managedSet the roadmap; derive goals for each team member; review, motivate and support to make them successfulStability:
Build a SRE practice that improves system stability with Monitoring & AIOps. Avert P1/P2 incidents and minimize business impactAnalyze system vulnerabilities, SPOFs and address them proactively to improve stabilityRefactor monolithic apps and databases to containerized services to improve delivery/scaleWork with business users to understand needs, issues, develop root cause analysis and work with the cross functional teams to address them permanentlyReliability:
Monitor system performance and create strategies to improve itReduce the number of incidents and the time taken to resolve them (MTTR)Develop and implement disaster recovery plans to ensure business continuityLead DevOps transformation to improve the delivery of value to business, reduction of costs & manual errors, increased velocity of releases and improved config managementEngineering:
Involvement in Architecture and Development design reviews (Shift-left) for new implementation and integration projects to build SRE best practices into the SDLCContinuously look for opportunities to automate tasks, simplify processes, Self-service to reduce the toilValue Stream Alignment:
While alignment as horizontal lead is expected to begin with, its expected that you also handle the role of a SRE value stream lead going forward.Ensure smooth inter-working with value streams (VS) to meet the objectives & realize valueFoster a 2-way knowledge sharing with VS and reduce dependency on SREHelp shepherd VS to improve SRE maturity levels; implement & prioritize best practices like monitoring, post-mortem, toil reduction, retrospectives etc.Application to User Journey orientation and transformationWhats in it for you In this role, you will have the opportunity to collaborate with a diverse and talented team, working on cutting-edge technology solutions to drive efficiency and innovation within the organization. You will be at the forefront of implementing best practices in site reliability engineering, with a strong emphasis on automation, cloud technologies, and performance optimization. You will interface with the value stream leads to improve the SRE practices and maturity levels within the value streams.
What Were Looking ForBasic Qualifications Bachelors degree in computer science or equivalent is required, or in lieu, a demonstrated equivalence in work experience15+ years of experience in Information Technology domain including cloud, systems & database administration, networking, performance, and application operationsProven experience in IT Operations and/or Site Reliability Engineering, successful handling of Application Operations in a complex IT setupManage Multi-cloud (AWS/Azure) environmentsEngineering and implementing proactive monitoring of applications, infrastructure & databases. Engineering automation to self-heal and mature towards AIOpsManage, innovate, and create processes, software and tools that continuously improve the availability, reliability, scalability, latency and efficiency of platformsEngineer Self-service portals, Scalable platforms and repeatable processes that allow product teams to own the entire life cycle of their products, reducing the SRE dependencyExcellent communication skills with experience in managing, coaching, and building highly effective teams.Manage and inspire a team of full stack Site Reliability Engineers across regions and time zones, emphasizing collaboration and efficiency.Establish relationships with business teams & other IT partners. Identifying and measuring KPIs like CSAT/NPS scores, establishing feedback channels which have a direct correlation to UXCost management through forecasting consumption, budgeting, tagging assets & tracking cost, disposing unused allocations & right sizing, optimizing usage & correlating cost to business valueEstablish incident & defect review process to help guide and continually improve stability of applicationsShapes and leverages advanced conceptual thinking to solve complex and/or completely new or novel situations that have never been dealt with before. Actively pursues innovative solutions that align with the companys tolerance for risk (business and reputational)Looks at external companies, products and capabilities and how they may accelerate Ratings technology initiativesPreferred QualificationsExperience in application & data architecture, system design, algorithms, data structures, complexity analysis, and software designAbility to architect high availability application and servers on cloud adhering best practices.Ability to perform technical deep-dives into code, networking, systems, databases and storage configurationExperience working in Agile software product developmentExperience working with stakeholders and collaborating across organizational boundaries.Configuration management, automation of patching, threat and vulnerability management, security monitoring, network security, endpoint security, cloud application and data securityAwareness of security frameworks like NIST to address technology, information and resilience risk, information security and risk managementSupport & transform ITSM process Incident, Change & Problem management to align with DevOps maturity Job Classification
Industry: Banking
Functional Area / Department: Engineering - Software & QA
Role Category: Software Development
Role: Head - Engineering
Employement Type: Full time
Contact Details:
Company: S&P Global Market
Location(s): Hyderabad
Keyskills:
AWS
NPS scores
Change management
product development
Incident management
CSAT
IT Operations