Track This Job
Add this job to your tracking list to:
- Monitor application status and updates
- Change status (Applied, Interview, Offer, etc.)
- Add personal notes and comments
- Set reminders for follow-ups
- Track your entire application journey
Save This Job
Add this job to your saved collection to:
- Access easily from your saved jobs dashboard
- Review job details later without searching again
- Compare with other saved opportunities
- Keep a collection of interesting positions
- Receive notifications about saved jobs before they expire
AI-Powered Job Summary
Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.
Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.
We are looking for a highly capable AI Inference Optimization Engineer to drive the performance and efficiency of AI models during inference—especially large language models (LLMs). This role involves close collaboration with data scientists, ML researchers, and infrastructure engineers to scale cutting-edge AI systems across real-world business scenarios.
Key Responsibilities
- Optimize AI models for speed, resource efficiency and accuracy
- Improve and maintain APIs for AI inference, ensuring system reliability, observability, and scalability
- Benchmark inference systems to identify and resolve performance bottlenecks
- Design and tune inference systems for high concurrency, low latency, and high reliability
- Work with GPU hardware and frameworks like CUDA and Triton for performance optimization
- Implement and monitor GPU utilization, memory usage, throughput, and system health
- Manage resource allocation for real-time and batch inference workloads
- Design and maintain ML infrastructure for training, inference, model and dataset management, and orchestration
- Collaborate with business teams to optimize the integration and use of AI applications across various scenarios
- Research and implement state-of-the-art techniques in LLMs, AIGC, NLP, CV, and ML system engineering
- Stay current with emerging trends in model inference, distributed systems, and AI infrastructure
Qualifications
- Strong understanding of machine learning, deep learning, and inference principles
- Proficiency in Python, C++
- Experience with deep learning frameworks such as PyTorch or TensorFlow
- Familiarity with CUDA, Triton, or similar GPU frameworks
- Hands-on experience with inference optimization techniques (e.g., quantization, speculative decoding, continuous batching)
- Experience with distributed systems, large-scale data processing, and parallel computing
- Familiar with ML infrastructure for large-scale training and deployment
- Ability to analyze and optimize GPU kernel performance
- Advanced experience with frameworks like DeepSpeed, Megatron, FSDP, or GSPMD
- In-depth CUDA programming and tuning using tools like Cutlass or Triton
- Experience in LLM application and agent development
- Research or industry experience in generative AI, NLP, CV, or multimodal models
- Competitive programming accolades (e.g., ACM/ICPC, NOI/IOI, Kaggle, Top Coder)
- Familiar with latest LLM research trends including long context handling, active learning, alignment, and agent ecosystems
About Us
Hyperfusion provides high-performance computing and AI solutions for businesses of all sizes. We focus on innovation, security, and exceptional performance, supported by reliable hardware and certified data centers.