Track This Job
Add this job to your tracking list to:
- Monitor application status and updates
- Change status (Applied, Interview, Offer, etc.)
- Add personal notes and comments
- Set reminders for follow-ups
- Track your entire application journey
Save This Job
Add this job to your saved collection to:
- Access easily from your saved jobs dashboard
- Review job details later without searching again
- Compare with other saved opportunities
- Keep a collection of interesting positions
- Receive notifications about saved jobs before they expire
AI-Powered Job Summary
Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.
Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.
What You’ll Be Doing
- Innovate and Implement: Design, implement, and maintain large-scale HPC/AI clusters with state-of-the-art monitoring, logging, and alerting systems.
- Infrastructure as Code (IaC): Utilize and develop tools to manage infrastructure as code, ensuring scalable and repeatable deployments.
- Streamline CI/CD Pipelines: Develop and maintain continuous integration and continuous delivery (CI/CD) pipelines to automate and streamline deployment processes.
- Automate Everything: Develop automation scripts and tools to automate deployment, configuration management, and operational monitoring.
- Develop complex Networking automations.
- Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.
- Lead and Educate: Serve as a technical resource, developing and sharing best practices with internal teams.
- Drive Innovation: Support R&D activities and engage in proof of concepts (POCs) and proof of values (POVs) for future improvements.
- B.Sc. in Computer Science, Engineering, or a related field with 5+ years of experience.
- Deep knowledge of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software.
- Advanced proficiency in programming and scripting languages, with a solid understanding of object-oriented programming principles.
- Familiarity with Jenkins, Ansible, Puppet/Chef.
- Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu), networking and OS-level security.
- Deep understanding of networking protocols such as InfiniBand and Ethernet.
- Experience with job scheduling workloads and orchestration tools such as Slurm and Kubernetes.
- Background with multiple storage solutions like Lustre, GPFS, ZFS, and XFS.
- Expertise with virtual systems (VMware, Hyper-V, KVM, Citrix).
- Familiarity with cloud platforms (AWS, Azure, Google Cloud).
- Proven networking experience or strong knowledge through professional networking training.
- Architectural Insight: Knowledge of CPU and/or GPU architecture.
- Container Expertise: Understanding of Kubernetes and container-related microservice technologies.
- GPU Focus: Experience with GPU-focused hardware/software (DGX, CUDA).
- RDMA Fabrics: Background with RDMA (InfiniBand or RoCE) fabrics.
, , JR2010371
Key Skills
Ranked by relevanceReady to apply?
Join NVIDIA and take your career to the next level!
Application takes less than 5 minutes

