Primary Responsibilities:
Undergraduate degree or equivalent experience.
Undergraduate degree or equivalent experience
Overall, 10-12 years of experience in IT industry across entire SDLC
Proven work experience as a Site Reliability Engineer or similar role
5+ years of experience in integrating monitoring and alerting into cloud software solutions
3+ years of coding experience with one or more of the follow languages Java, C#, C/C++, Go, Python, Perl, PowerShell or JavaScript with a willingness and ability to learn new ones
2+ years of experience building and programmatically consuming REST APIs
3+ years of experience in Splunk / Dynatrace / DataDog/Grafana/ Telemetry or similar for monitoring tools
Experience with programmatic interaction with a relational database SQL Server/MySQL/PostgreSQL
Experience planning and supporting 99.999% availability against critical applications in production
Solid understanding of engineering fundamentals: unit testing, performance testing, code reviews, telemetry, agile and DevOps
Defining and setting up best industry alert and monitoring practices across line of business and design/architect efficient monitoring dashboards on Splunk/Dynatrace /Grafana common for all applications/products across line of business
Experience with any database.
Knowledge of any scripting or programming language.
Experience in operations support for any application.
ServiceNow experience.
Participating in 5-9 program and other peak season readiness initiatives and collaboration with application teams evaluating applications from resiliency, availability, and reliability perspective
Act as a gatekeeper for changes rolling into production
Embrace continuous learning of engineering practices to ensure industry best practices and technology adoption, including DevOps, Cloud and Agile thinking
Tech debt reduction/Tech transformation including opensource/inner source adoption, Cloud adoption, HCP assessment and adoption
Improve processes/runbooks and lead automation efforts of any manual items around support cutting down manual toil
Participate in on-call rotation
Improve operational tooling, frameworks, perform chaos engineering activities
Respond to platform emergencies, alerts, and escalations from Customer Support
Keyskills: Site Reliability Engineering Disaster Recovery Dynatrace Splunk SLI Performance Testing SLA