AI Inference Optimization Engineer

We are looking for a highly capable AI Inference Optimization Engineer to drive the performance and efficiency of AI models during inference—especially large language models (LLMs). This role involves close collaboration with data scientists, ML researchers, and infrastructure engineers to scale cutting-edge AI systems across real-world business scenarios.


Key Responsibilities

  • Optimize AI models for speed, resource efficiency and accuracy
  • Improve and maintain APIs for AI inference, ensuring system reliability, observability, and scalability
  • Benchmark inference systems to identify and resolve performance bottlenecks
  • Design and tune inference systems for high concurrency, low latency, and high reliability
  • Work with GPU hardware and frameworks like CUDA and Triton for performance optimization
  • Implement and monitor GPU utilization, memory usage, throughput, and system health
  • Manage resource allocation for real-time and batch inference workloads
  • Design and maintain ML infrastructure for training, inference, model and dataset management, and orchestration
  • Collaborate with business teams to optimize the integration and use of AI applications across various scenarios
  • Research and implement state-of-the-art techniques in LLMs, AIGC, NLP, CV, and ML system engineering
  • Stay current with emerging trends in model inference, distributed systems, and AI infrastructure


Qualifications

  • Strong understanding of machine learning, deep learning, and inference principles
  • Proficiency in Python, C++
  • Experience with deep learning frameworks such as PyTorch or TensorFlow
  • Familiarity with CUDA, Triton, or similar GPU frameworks
  • Hands-on experience with inference optimization techniques (e.g., quantization, speculative decoding, continuous batching)
  • Experience with distributed systems, large-scale data processing, and parallel computing
  • Familiar with ML infrastructure for large-scale training and deployment
  • Ability to analyze and optimize GPU kernel performance
  • Advanced experience with frameworks like DeepSpeed, Megatron, FSDP, or GSPMD
  • In-depth CUDA programming and tuning using tools like Cutlass or Triton
  • Experience in LLM application and agent development
  • Research or industry experience in generative AI, NLP, CV, or multimodal models
  • Competitive programming accolades (e.g., ACM/ICPC, NOI/IOI, Kaggle, Top Coder)
  • Familiar with latest LLM research trends including long context handling, active learning, alignment, and agent ecosystems



About Us

Hyperfusion provides high-performance computing and AI solutions for businesses of all sizes. We focus on innovation, security, and exceptional performance, supported by reliable hardware and certified data centers.

Post Date
2025-05-22
Job Type
-
Employment type
Full-time
Category
Engineering, Information Technology
Level
Entry
Country
United Arab Emirates
Industry
IT System Data Services ,
Hyperfusion*******