-
EPAM Systems

Senior HPC Network Engineer - AI Infrastructure

EPAM Systems
Argentina · Full-time · Mid-Senior

We are seeking a Senior HPC Network Engineer to support advanced AI, research, and Kubernetes-based GPU infrastructure for a major global technology client.

The role focuses on architecting, operating, and optimizing high-performance network fabrics for large-scale LLM and distributed AI workloads, including InfiniBand/RDMA, high-speed Ethernet, Kubernetes networking, host-side GPU networking, SmartNIC/DPU technologies, and deep network observability.

The ideal candidate has strong hands-on experience with InfiniBand NDR/HDR and next-generation fabrics, RDMA/RoCE, NVIDIA/Mellanox networking, NCCL/MSCCL communication patterns, Linux host networking, PCIe/GPU/NIC topology, and Kubernetes networking for GPU clusters.

 

Responsibilities

  • Architect, operate, and troubleshoot high-performance InfiniBand/RDMA and Ethernet fabrics for large-scale GPU clusters and distributed AI/LLM workloads
  • Design and evaluate cluster network topologies, including Fat-tree, Clos, Rail-optimized, and Dragonfly, based on workload scale and performance needs
  • Optimize host-side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement, PCIe topology, and GPU-to-NIC communication paths
  • Tune and troubleshoot RDMA/RoCE, NCCL/MSCCL, and collective communication performance for multi-node GPU training workloads
  • Design and maintain Kubernetes networking for GPU clusters, including CNI plugins, network policies, multi-NIC pods, RDMA/GPU device plugins, and workload orchestration integration
  • Support SmartNIC/DPU technologies such as NVIDIA BlueField where applicable, including SR-IOV, offload, isolation, and security use cases
  • Build and improve network observability, including metrics, dashboards, alerts, congestion detection, latency tracing, SLO reporting, and capacity/performance analysis
  • Collaborate with Kubernetes, storage, GPU infrastructure, observability, and AI research teams to resolve network and I/O bottlenecks and improve workload reliability

Requirements

  • 5+ years of experience in network, infrastructure, HPC, SRE, or similar engineering roles, with 2+ years focused on HPC, AI/ML, or GPU cluster networking
  • Proven hands-on experience with InfiniBand/RDMA fabrics, high-speed Ethernet, and Linux networking in performance-critical distributed compute environments
  • Understanding of host-side networking, including NICs, drivers, and firmware, along with PCIe topology, NUMA awareness, and GPU-to-NIC affinity
  • Knowledge of distributed AI training communication patterns, including NCCL-based workloads and collective operations such as all-reduce and all-gather
  • Expertise in Kubernetes and container networking for GPU or distributed workloads, including CNI concepts, network policies, multi-NIC patterns, and RDMA/GPU device integration
  • Proficiency in RDMA networking concepts, including InfiniBand, RoCE/RoCEv2, GPUDirect-related patterns, congestion behavior, and performance tuning
  • Skills in Linux networking and host-side troubleshooting, including IRQ affinity, MTU, offloads, and performance diagnostics
  • Background in network observability and performance management, including telemetry, traffic monitoring, and congestion detection, as well as latency analysis, SLOs, and capacity planning, along with alerting and troubleshooting across L1-L4, fabric, and RDMA layers
  • Strong troubleshooting, root-cause analysis, documentation, and communication skills for working with client engineering teams, researchers, and platform stakeholders
  • English level of minimum B2 (Upper-Intermediate) for effective communication

Nice to have

  • Familiarity with Azure Networking, Ethernet, and GPGPU/GPU technologies
  • Competency in Grafana, Prometheus, and Network Administration
  • Capability to develop and maintain Infrastructure as Code
  • Flexibility to use Python and UNIX shell scripting for automation and tooling

 

We offer

  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Key Skills

Ranked by relevance

kubernetes ethernet ai linux shell scripting prometheus grafana storage python unix
Login to Apply
Posted
May 14, 2026
Type
Full-time
Level
Mid-Senior
Location
Argentina

Industries

Software Development IT Services IT Consulting Technology Information Internet

Categories

Business Development Information Technology Engineering

Related Jobs

3 roles aligned with this opportunity

View all jobs
View Job Details
EPAM Systems
Related

DevOps Engineer

2026-05-27

Full-time
Associate
Argentina
Software Development
Engineering
View Job Details
EPAM Systems
Related

Senior Software Engineer (Node.js)

2026-05-17

Full-time
Mid-Senior
Argentina
Software Development
Information Technology
View Job Details
EPAM Systems
Related

Node.js Developer

2026-05-17

Full-time
Associate
Argentina
Software Development
Information Technology