Skip to main content

- Machine Learning Performance Engineer - CUDA Python -

Contract Type: Contract

Posted Date: September 17, 2025

ECCO Select is a talent acquisition and consulting company specializing in people, process and technology solutions. We provide the talent behind the technology enabling our clients to achieve their goals. For more information about ECCO Select, visit us at www.eccoselect.com.

Position Title:     Machine Learning Performance Engineer – CUDA Python –

Location Information: Remote – Travel 20%

Position Responsibilities:

Your role will focus on optimizing the performance of our machine learning models, covering both training and inference processes. This entails ensuring efficient large-scale training, low-latency inference in real-time systems, and high-throughput inference in research settings. You will be expected to take a whole-system approach, including aspects like storage systems, networking, and considerations at both host- and GPU-level. Additionally, you will be responsible for analyzing and enhancing platform performance at the lowest level, addressing factors like optimal throughput, goodput, and cache loading times.

  • An understanding of modern ML techniques and toolsets
  • The experience and systems knowledge required to debug a training run’s performance end to end
  • Low-level GPU knowledge of PTX, SASS, warps, cooperative groups, Tensor Cores, and the memory hierarchy
  • Debugging and optimization experience using tools like CUDA GDB, NSight Systems, NSight Compute
  • Library knowledge of Triton, CUTLASS, CUB, Thrust, cuDNN, and cuBLAS
  • Intuition about the latency and throughput characteristics of CUDA graph launch, tensor core arithmetic, warp-level synchronization, and asynchronous memory loads
  • Background in Infiniband, RoCE, GPUDirect, PXN, rail optimization, and NVLink, and how to use these networking technologies to link up GPU clusters
  • An understanding of the collective algorithms supporting distributed GPU training in NCCL or MPI
  • An inventive approach and the willingness to ask hard questions about whether we're taking the right approaches and using the right tools

Essential Skills, Experience:

– Proficiency in modern machine learning techniques and tools

– Strong experience in debugging and optimizing training processes comprehensively

– In-depth knowledge of low-level GPU concepts such as PTX, SASS, warps, Tensor Cores, and memory hierarchy

– Hands-on experience with CUDA optimization tools like CUDA GDB, NSight Systems, NSight Compute

– Familiarity with libraries including Triton, CUTLASS, CUB, Thrust, cuDNN, and cuBLAS

– Understanding of networking technologies like Infiniband, RoCE, and GPUDirect for connecting GPU clusters

– Proficiency in collective algorithms supporting distributed GPU training in NCCL or MPI

Qualifications:

ECCO Select is committed to hiring and retaining a diverse workforce. Our policy is to provide equal opportunity to all people without regard to race, color, religion, national origin, ancestry, marital status, veteran status, age, disability, pregnancy, genetic information, citizenship status, sex, sexual orientation, gender identity or any other legally protected category. Veterans of our United States Uniformed Services are specifically encouraged to apply for ECCO Select opportunities.

Equal Employment Opportunity is The Law
This Organization Participates in E-Verify