Write, configure, and deploy code that improves service reliability for existing or new systems; set standard for others with respect to code quality.
Provide helpful and actionable feedback and review for code or production changes
Drive repair/optimization of complex systems with consideration towards a wide range of contributing factors.
Lead debugging, troubleshooting, and analysis of service architecture and design.
Participate in on-call rotation and provide 24x7 support
Write documentation: design, system analysis, runbooks, playbooks. Provide design feedback and uplevel design skills of others.
Implement and manage SRE monitoring application backends using Java, Postgres, React, NoSQL and OpenTelemetry.
Develop tooling using Terraform and other IaC tools to ensure visibility and proactive issue detection across our platforms.
Work within GCP infrastructure, optimizing performance, and cost, and scaling resources to meet demand.
Collaborate with development teams to enhance system reliability and performance, applying a platform engineering mindset to system administration tasks.
Develop and maintain automated solutions for operational aspects such as on-call monitoring, performance tuning, and disaster recovery.
Troubleshoot and resolve issues in our dev, test, and production environments.
Participate in postmortem analysis and create preventative measures for future incidents.
Skills Required:
Application Support
Experience Required:
4+ years of experience as an SRE, DevOps Engineer, Software Engineer or similar role
Experience Preferred:
Should be willing to work in 24/7 shift
Education Required:
Bachelor's Degree.
Job Classification
Industry: IT Services & ConsultingFunctional Area / Department: Engineering - Software & QARole Category: DBA / Data warehousingRole: Database AdministratorEmployement Type: Full time