As a Site Reliability Engineering (SRE) Technical Leader on the Network Assurance Data Platform (NADP) team at Cisco ThousandEyes, you will be responsible for ensuring the reliability, scalability, and security of the cloud and big data platforms. Your role will involve representing the NADP SRE team, contributing to the technical roadmap, and collaborating with cross-functional teams to design, build, and maintain SaaS systems operating at multi-region scale. Your efforts will be crucial in supporting machine learning (ML) and AI initiatives by ensuring the platform infrastructure is robust, efficient, and aligned with operational excellence. You will be tasked with designing, building, and optimizing cloud and data infrastructure to guarantee high availability, reliability, and scalability of big-data and ML/AI systems. This will involve implementing SRE principles such as monitoring, alerting, error budgets, and fault analysis. Additionally, you will collaborate with various teams to create secure and scalable solutions, troubleshoot technical problems, lead the architectural vision, and shape the technical strategy and roadmap. Your role will also encompass mentoring and guiding teams, fostering a culture of engineering and operational excellence, engaging with customers and stakeholders to understand use cases and feedback, and utilizing your strong programming skills to integrate software and systems engineering. Furthermore, you will develop strategic roadmaps, processes, plans, and infrastructure to efficiently deploy new software components at an enterprise scale while enforcing engineering best practices. To be successful in this role, you should have relevant experience (8-12 yrs) and a bachelor's engineering degree in computer science or its equivalent. You should possess the ability to design and implement scalable solutions, hands-on experience in Cloud (preferably AWS), Infrastructure as Code skills, experience with observability tools, proficiency in programming languages such as Python or Go, and a good understanding of Unix/Linux systems and client-server protocols. Experience in building Cloud, Big data, and/or ML/AI infrastructure is essential, along with a sense of ownership and accountability in architecting software and infrastructure at scale. Additional qualifications that would be advantageous include experience with the Hadoop Ecosystem, certifications in cloud and security domains, and experience in building/managing a cloud-based data platform. Cisco encourages individuals from diverse backgrounds to apply, as the company values perspectives and skills that emerge from employees with varied experiences. Cisco believes in unlocking potential and creating diverse teams that are better equipped to solve problems, innovate, and make a positive impact.,
Employement Category:
Employement Type: Full timeIndustry: IT Services & ConsultingRole Category: Not SpecifiedFunctional Area: Not SpecifiedRole/Responsibilies: Site Reliability Engineering Technical Leader,