We are seeking a skilled Site Reliability Engineer (SRE) to join our team and help build, maintain, and scale reliable, high-performance systems.
Key Responsibilities:
Design, build, and maintain scalable infrastructure using automation and Infrastructure-as-Code (IaC) tools.
Monitor system performance, reliability, and availability using observability tools (e.g., Prometheus, Grafana, Datadog).
Develop automation scripts to reduce manual tasks and improve system efficiency.
Collaborate with development teams to design for reliability, scalability, and performance.
Conduct root cause analysis and postmortems for incidents, ensuring follow-up on action items.
Implement and improve CI/CD pipelines for faster, more reliable deployments.
Define and maintain service-level objectives (SLOs), service-level agreements (SLAs), and error budgets.
Participate in on-call rotations and incident response.
Continuously improve system security, compliance, and resilience.
Qualifications:
Bachelor s degree in Computer Science, Engineering, or a related field, or equivalent experience.
Strong knowledge of Linux/Unix systems and networking fundamentals.
Proficiency in at least one programming or scripting language (e.g., Python, Go, Bash).
Experience with cloud platforms (AWS, GCP, Azure) and container orchestration (Kubernetes, Docker).
Familiar with monitoring, logging, and alerting tools.
Hands on with CI/CD tools and practices.
Strong troubleshooting and problem-solving skills.
Excellent communication and collaboration abilities.
Preferred :
Experience in designing and managing large-scale, distributed systems.
Familiarity with configuration management tools (e.g., Ansible, Terraform, Chef, Puppet).
Knowledge of database management and optimization (SQL, NoSQL).
Prior experience in a DevOps or SRE role.
Job Classification
Industry: Software ProductFunctional Area / Department: Engineering - Software & QARole Category: DevOpsRole: Site Reliability EngineerEmployement Type: Full time