As we scale our business and customer base, we seek an experienced SRE to deliver real-time insights from massive-scale data. We are looking for someone who brings fresh ideas, demonstrates a unique and informed perspective, and enjoys collaborating with a cross-functional team to develop real-world solutions and positive user experiences at every interaction.
Role and Responsibilities :
Participate in an on-call rotation for incident response and implement proactive measures to prevent incidents.
Develop monitoring alerts and incident response processes to ensure high availability and reliability.
Document actions taken during incidents and create automated solutions to improve incident response.
Collaborate with the engineering team as an expert in reliability, performance, and efficiency to support ongoing projects.
Consistently deliver high-quality managed services, ensuring optimal uptime and scalability of infrastructure, applications, and cloud services.
Automate the detection and resolution of recurring issues to enhance system stability.
Build tools and automation frameworks to eliminate repetitive tasks and prevent incident occurrence.
Continuously improve engineering, operational processes, and team practices to enhance efficiency and productivity.
Demonstrate strong programming skills and a deep understanding of systems to support the reliability and scalability of services.
Foster a culture of continuous improvement by promoting process changes and best practices.
Engage in continuous learning to expand skills through experimentation or training.
Soft Skills :
Ability to work asynchronously and independently.
Strong collaboration skills and willingness to work as part of a team.
Excellent problem-solving skills with the ability to think clearly under pressure.
Strong analytical and management skills.
Effective communication and documentation skills.
Qualifications :
Bachelors or Graduate degree in Computer Engineering, Computer Science, Engineering, Information Systems Management, or equivalent experience.
Experience with Monitoring/Observability/Log tools such as AWS CloudWatch, Datadog, Prometheus/Grafana, and ELK.
Proficiency with Public Cloud platforms, LINUX/UNIX environments, and programming languages such as Java, Python, or Go.
Familiarity with Agile methodologies, SaaS environments, RDBMS, NoSQL databases, Cloud Architecture, and Frontend/Backend Systems and tools.
Comfortable with scripting and debugging production systems and services.
Strong collaboration skills with a mindset for continuous improvement.
Expertise in scalability and root cause analysis exercises.
Job Classification
Industry: IT Services & ConsultingFunctional Area / Department: Engineering - Software & QARole Category: DevOpsRole: Site Reliability EngineerEmployement Type: Full time