Job Description
The SRE team at GreyOrange is responsible for monitoring the stability and availability of
mission-critical production systems, managing incidents for quicker resolution, and
establishing BAU. The team also manages and maintains internal tools/infra which is
consumed by other development teams.
The experienced SRE will play a crucial role in ensuring the reliability, scalability, capacity
planning, and performance of our infrastructure and applications. The ideal candidate will
have a strong background in software engineering, system administration, containerization,
and cloud technologies.
Requirements
- Should have 6 to 11 years of experience
- Well-versed with scripting/programming languages (Python/Bash/PowerShell, etc.) to automate manual work, particularly within cloud environments
- Well-versed with Observability tools (Grafana, Splunk, Dynatrace) for monitoring, alerting, and logging solutions to identify and address potential issues, especially in cloud infrastructure
- Working experience with automation tools (Jenkins, GitLab, Ansible/Chef for configuration management) and processes to streamline deployment, monitoring, and management of systems and applications in the cloud
- Hands-on experience with containerization and orchestration technologies such as Docker, Kubernetes, or similar, particularly in cloud-native environments
- Well aware of SLI, SLO, SLA, and Error Budget concepts and their implementations; provide on-call support and participate in incident management & response activities as needed
- Expert with troubleshooting production issues and bugs.
- Good knowledge of Unix systems, networking, web technologies, and databases.
- Incident Management experience coupled with effective communication skills for production workload.
- Working knowledge in any one of the cloud platforms (AWS or GCP)
What you'll do?
- Lead reliability engineering projects and drive them to closure.
- Ensure system stability and high availability by proactively monitoring performance and troubleshooting issues
- Design, build and maintain efficient, reliable, and scalable cloud-based infrastructure and services
- Automate processes and find opportunities to improve the observability and availability of the Platform to reduce toil.
- Implement and manage observability tools for comprehensive monitoring, alerting, and logging
- Own end-to-end availability and performance of different services & tools.
- Practice sustainable incident response and blameless postmortems.
- Provide on-call support for incident management and participate actively in response activities
Job Classification
Industry: Analytics / KPO / Research
Functional Area / Department: Engineering - Software & QA
Role Category: DevOps
Role: Site Reliability Engineer
Employement Type: Full time
Contact Details:
Company: GreyOrange
Location(s): Noida, Gurugram
Keyskills:
Devops
Jenkins
Terraform
Docker
SRE
Ansible
Kafka
Site Reliability Engineering
Devops Engineer
Ci/Cd
Kubernetes