Job Description
Roles and Responsibilities
Desired Candidate Profile
Exp -9+ yrs
Location - Hyderabad, Pune, Bangalore, Kolkata, Gurgaon, Mumbai, Chennai
Requirement:
Knowledgeable in Compute/Network/Storage configurations and architecture for HPC /Super POD design, deployment and operations
Education :
BE in computer , Masters or equivalent experience in Computer Architecture, Computer Science, Electrical Engineering or related field.
Job description:
- Experience in design , deployment and operations of HPC production-grade environments leveraging both SLURM and Kubernetes clusters
- Deep understanding of scale-out compute, networking and external storage architectures for optimizing performance and acceleration of AI/HPC workloads
- Working experience on DGX/Super POD, DGX A100 Compute nodes, Fabrics ( Storage/Compute) , Management networks & Software (DeepOps), Key system software for optimizing GPU communications I/O and application performance,
- Establish storage management guidelines for RAM/NVMe (internal storage) and External high speed storage (DDN, Netapp..) allocation to optimize performance and cost of running varying data-sets and workloads
- Management Servers - infrastructure design & setup for enabling user logins, provisioning (OS images & other internal infrastructure services for the pod), Work-load management (resource management and scheduling/orchestration), container mgmt. system monitors /logs
- Operations /run-time optimization of A100 compute resources (MIG partitions) for varying workloads
- Deployment of NVIDIA- DGX catalog containers to process AI/ML/DL workloads in HPC environment.
- Creating custom python based metrics and analytics solution to profile HPC and Hadoop
- Proven experience deploying, upgrading, migrating, and driving user adoption of sophisticated enterprise scale systems.
- Well versed in agile methodology.
Good to have-
- Working experience in git, conda, pip, yum, apt, zypper, julia, npm and a multitude of other installation frameworks
- Creating custom reporting dashboards in grafana from prometheus kubernetes metrics.
- Programming skills to build distributed storage and compute systems, backend services, microservices, and web technologies.
If intereested, plz share cvs on sh******s@an***e.co.in
Regards
Sheetal Shewale
Job Classification
Industry: IT Services & Consulting
Functional Area: IT & Information Security,
Role Category: IT Infrastructure Services
Role: IT Infrastructure Services
Employement Type: Full time
Education
Under Graduation: B.Tech/B.E. in Any Specialization
Post Graduation: M.Tech in Any Specialization, MS/M.Sc(Science) in Any Specialization, MBA/PGDM in Any Specialization, MCA in Any Specialization
Doctorate: Any Doctorate
Contact Details:
Company: nlage Infotech (I) Pvt. Ltd.
Location(s): Hyderabad
Keyskills:
High Performance Computing
HPC engineer
super pod design
HPC
infraops architect
HPC architect
NVIDIA DGX Architect
AI/ML/DL workloads
HPC engineering
Infiniband Architect
High Speed Storage engineering
AI Infrastructure Architect
AI/HPC workloads