Flexera is looking for an experienced Member Technical Staff - Site Reliability Engineer to join our SRE team. we're a fast-growing, category-leading organization with ambitious objectives and a positive, inclusive culture. we're looking for passionate professionals who want to grow their talents and achieve great things. If that sounds like you, we want to talk to you about joining our team.
As a Site Reliability Engineer, you will be tasked with everything from helping with product design, to diagnosing issues, and writing automated scripts for mediating issues that occur in our production systems.
You will be driven to build fault-tolerant, scalable systems and automate away as much operational toil as you can. You align with the goals of the DevOps movement in improving collaboration between the development and operations disciplines.
We are seeking someone with expensive experience working on a SaaS/Cloud product with a microservices architecture.
Responsibilities:
Help to eliminate operational toil - seek to automate repetitive operations work
Work with product development teams to ensure that our new features are able to meet SLAs
Help mature the delivery process for teams; defining/managing automated deployment pipelines such as Jenkins pipelines, designing canary release deploys, building in automated fallbacks or optimizing the build chain, Infrastructure & pipeline as code, you help craft the appropriate solution for the product
Optimize product service code to ensure that its secure, scalable and performant
Optimize testing capabilities to increase the assurances we have with each release
Improve the fault detection for our services
Create dashboards which help communicate the metrics for a given product service
Work with product owners and product engineering teams to perform capacity planning
Work with product engineering teams to understand performance and behavior patterns
Be part of an on-call rotation for alerts that require engineering expertise to diagnose
Help carry out root cause analysis for incidents, and design solutions (both software and human processes) that will help to ensure the same problem doesnt happen in the same way again
Contribute to platform security
Minimum Qualifications:
Computer Science degree, or related industry experience managing a mission-critical production system in AWS (or equivalent Azure/Google cloud) for at least 2 years
Critical Skills / Competencies:
Agile software delivery methodologies
Experience managing cloud-based services like AWS or Azure at scale
Experience with DevOps
Experience with docker Containers, Kubernetes, EKS, ECS
Experience with Infrastructure as code eg Terraform, CloudFormation
Experience with IaaS and Serverless services from a cloud provider
Experience implementing fault detection, and automating fixes
Experience designing scalable services
Experience designing distributed, fault-tolerant systems
A good understanding of SQL databases
A solid understanding of data structures and algorithms
A positive attitude and willingness to learn
Strong conflict resolution competence
Excellent written and verbal communication skills
Detail oriented. The ideal candidate is one who naturally digs as deep as they need to understand the why
Bonus Skills:
Python / Ruby / Golang / Bash experience
Experience with Monitoring systems such as Prometheus / Grafana
Security background
MySQL / Amazon RDS
Elasticsearch
Relevant Certification eg AWS, GCP, Azure
Experience of Disciplined Agile Delivery (DAD)
Job Classification
Industry: IT Services & ConsultingFunctional Area / Department: Engineering - Software & QARole Category: DevOpsRole: Site Reliability EngineerEmployement Type: Full time