Your browser does not support javascript! Please enable it, otherwise web will not work for you.

Senior, Software Engineer (sre) @ Walmart

Home > Software Development

 Senior, Software Engineer (sre)

Job Description

Skill: Site Reliability Engineer (Java, Splunk, Dynatrace, Devops)
Experience range: 5 - 10 years
Location: Bangalore As a Site Reliability Engineer in the CPC Team , you will work with L2, Other dependent Applications, Platform team, DevOps and Engineering practitioners to proactively maintain mission-critical infrastructure, cloud platforms, microservices, tools, and processes that will ensure the highest levels of availability and reliability of CPC applications.

Our team works closely with our US stores and eCommerce business to better serve customers by empowering team members, stores, and merchants with technological innovation. From groceries and entertainment to sporting goods and crafts, Walmart U.S. offers an extensive selection that our customers value, whether they shop online at Walmart.com , through one of our mobile apps, or in-store. Focus areas include customers, stores and employees, in-store service, merchant tools, merchant data science, and search and personalization.
What youll do:
  • Incident triage, Escalation and Resolution : Triage site-impacting production issues by quantifying impact, severity and urgency, analyzing systems for quick remediation, engaging the right teams for recovery [Reduce MTTE Mean Time to Engage], and focusing on immediate restoration [ Reduce MTTR Mean Time to Restore] of large-scale enterprise systems.
  • Alert, Monitoring, Log analysis : Detect and analyze monitoring graphs and alerts to identify systems causing production impacts with various tools like Grafana, Prometheus, MMS, Service Now, JIRA, Dynatrace, Splunk etc [Reduce MTTD Mean Time to Detect].
  • Enhance Alerting solutions : Design and implement JavaScript for the integration of alerting tool with service API endpoints with various tools like ServiceNow, Spotlight, Splunk, and xMatters. Requires knowledge of: Monitoring and alerting tools; Monitoring metrics and key performance indicators (for example, availability, MTBF, MTTR); SLIs and SLOs (for example, request latency, availability, error rates, saturation); Distributed tracing; Alerting logic. To demonstrate awareness of the metrics used to monitor software or system performance. Monitors current performance data to ensure adherence to defined SLOs and SLIs for simple applications/systems. Demonstrates awareness of the different types of alerts generated by the monitoring tools. Demonstrates awareness of infrastructure and application metrics.
  • Disaster Recovery Planning : Requires knowledge of: Disaster recovery procedures and processes; Enterprise disaster recovery systems. To work with business partners to identify and document critical applications.
  • Performance and Optimization : Requires knowledge of: Unix/Linux performance optimization tuning; Java/NodeJS/Tomcat/Apache tuning and optimization; Chaos tools to utilize established criteria (for example, probability of failure, frequency of failure) to measure site reliability. Monitors site reliability conditions and new reliability requirements.
  • Work on Product Enrichment ; Content Services projects at Walmart: Develop enterprise monitoring and utilize tooling software solutions such as Grafana, Splunk etc, to improve visibility, pro-actively detect issues and restore system availability.
  • Develop Tools and support: Design and develop solutions for widespread internal communications for cloud applications support or workflows for infrastructure availability issues with various internal applications with multiple programming languages like Java, JavaScript (React, Node JS), Python and Shell programming technologies like Prometheus, Database Query languages. Design and develop a UI tool to display Item Content Quality data on a dashboard using AngularJS, ReactJs, HTML5 ; CSS3 etc
  • To create and maintain Playbooks.
  • Steps to perform correct analysis on the issues and engage correct teams for CPC, Dependent downstream services and Platform teams.
  • To handle Deployments. Streamline the deployments process and handle the responsibility as a single team. Understand and explore Post validations and back out steps to make app more resilient.
  • Coordinate with platform teams for non-app releases like VM upgrades, DB Maintenance, and other component environment related tasks.
  • Participate in rotating on-call duties and work across different time zone with a multi-national team
  • Responsible for timely root cause analysis [RCA] of production issues.
  • Develop reusable tooling and processes to drive and improve customer experience and lower operational costs.
  • Understand DevOps Industry best practices
  • Help teams to build highly Observable and Resilient systems
  • Collaborate with developers to capture requirements and understanding pain points
  • Build reusable tools, library, dashboards which can be used across DevOps/SRE teams
What youll bring:
  • Bachelors degree in Computer Science, Engineering or related discipline
  • 5+ years of hands-on related to SRE, Operations ; Development experience with Java Script, Java, Restful services, Git, Maven, Jenkins, DevOps, Containerization, Docker, Kubernetes, Azure, Google cloud, Kafka, Azure Cosmos, Azure SQL, Mega cache CI/CD ,Prometheus, Grafana, Splunk etc.
  • Automation and Self-healing: Demonstrate knowledge of scripting and software development for automation and self-healing of multi-cloud environments. Help enhance existing solutions by developing automation with Docker, Kubernetes and working with DevOps and Engineering partners.
  • Excellent end to end technical understanding of core infrastructure, cloud services, platforms, and micro-services.
  • Ability to effectively triage be able to detect and determine symptom vs cause.
  • Identify and drive continuous improvement efforts to reduce waste (eliminate, automate or streamline).
  • Influence the design of system architecture and tactical solutions.
  • Familiar with log centric tooling. Produce time series data and reusable dashboards for use both during and post event.

Job Classification

Industry: Retail
Functional Area / Department: Engineering - Software & QA
Role Category: Software Development
Role: Software Development - Other
Employement Type: Full time

Contact Details:

Company: Walmart
Location(s): Bengaluru

+ View Contactajax loader


Keyskills:   Unix System architecture Networking Linux devops splunk Cosmos Information technology SQL Python

 Fraud Alert to job seekers!

₹ Not Disclosed

Similar positions

Senior Software Engineer - Cloud Automation & AI

  • Capgemini
  • 7 - 12 years
  • Hyderabad
  • 16 hours ago
₹ Not Disclosed

Java Engineer

  • Fiserv
  • 6 - 8 years
  • Chennai
  • 3 days ago
₹ Not Disclosed

Sr. Software Engineer

  • Tech Mahindra
  • 5 - 8 years
  • Chennai
  • 3 days ago
₹ Not Disclosed

CyberArk (Consultant/Engineer/Lead"s/Architect"s)

  • Capgemini
  • 10 - 14 years
  • Mumbai
  • 4 days ago
₹ Not Disclosed

Walmart

Walmart Global Tech India