Job Description
We are looking for a highly skilled Site Reliability Engineer (SRE) / DevOps Engineer with expertise in platform engineering, tooling, and application reliability. The ideal candidate will enhance our infrastructure, ensure observability, and improve the performance and scalability of our platforms. You will work closely with development teams to implement best practices for system reliability, scalability, and efficiency. Key Responsibilities: 1. Platform Engineering:
- Design, build, and maintain scalable and resilient platform solutions using Kubernetes, Docker, and other container orchestration tools.
- Implement and manage Kafka clusters for real-time data streaming and event-driven architecture.
- Collaborate with application teams to design and deploy microservices-based architectures, focusing on scalability, performance, and reliability. 2. Tooling Automation:
- Develop and maintain CI/CD pipelines to automate application deployment and infrastructure provisioning using tools like Jenkins, Git.
- Create automation scripts and tooling to reduce manual intervention in operational tasks, leveraging languages such as Python, Bash.
- Implement Infrastructure as Code (IaC) using Ansible to manage cloud and on-premises environments efficiently. 3. Observability Monitoring:
- Implement observability solutions using Grafana, Prometheus, and ELK Stack to monitor application performance, infrastructure health, and system reliability.
- Develop dashboards, alerts, and runbooks to enable proactive incident management and quick response to service disruptions.
- Conduct performance tuning and capacity planning to ensure optimal operation of platforms and applications. 4. Application Engineering Support:
- Work closely with development teams to optimize application performance and troubleshoot production issues.
- Implement service mesh solutions for microservices management, ensuring secure and efficient communication between services.
- Assist in the design and implementation of scalable data pipelines and workflows using Kafka and other streaming technologies. 5. Security Compliance:
- Ensure platform security through effective access controls, secure deployment practices, and regular vulnerability assessments.
- Collaborate with security teams to implement policies and tools that safeguard data and application integrity. 6. Collaboration Documentation:
- Document infrastructure, processes, and best practices to ensure knowledge sharing across teams.
- Work in a cross-functional environment, collaborating with software developers, QA engineers, and other SREs to continuously improve system reliability. Qualifications: - Bachelors degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or Platform Engineering roles.
- Strong knowledge of Kubernetes, Docker, and container orchestration platforms.
- Proficiency in managing Kafka clusters and understanding data streaming technologies.
- Experience with observability tools such as Grafana, Prometheus, and ELK Stack.
- Hands-on experience with CI/CD pipelines and automation tools.
- Expertise in scripting languages like Python, Bash,.
- Familiarity with cloud platforms (AWS, GCP, Azure) and Infrastructure as Code (IaC) tools like Ansible.
Job Classification
Industry: IT Services & Consulting
Functional Area / Department: Engineering - Software & QA
Role Category: DevOps
Role: Site Reliability Engineer
Employement Type: Full time
Contact Details:
Company: Euclid Innovations
Location(s): Mumbai
Keyskills:
Performance tuning
GIT
orchestration
Incident management
Vulnerability
Engineering Design
Operations
Python
Capacity planning