ML Engineer Large - Scale AI Infrastructure
Company: Genbio
Location: Palo Alto
Posted on: June 2, 2025
Job Description:
Headquartered in Silicon Valley, we are a newly established
start-up, where a collective of visionary scientists, engineers,
and entrepreneurs are dedicated to transforming the landscape of
biology and medicine through the power of Generative AI. Our team
comprises leading minds and innovators in AI and Biological
Science, pushing the boundaries of what is possible. We are
dreamers who reimagine a new paradigm for biology and medicine.We
are committed to decoding biology holistically and enabling the
next generation of life-transforming solutions. As the first mover
in pan-modal Large Biological Models (LBM), we are pioneering a new
era of biomedicine, with our LBM training leading to
ground-breaking advancements and a transformative approach to
healthcare. Our exceptionally strong R&D team and leadership in
LLM and generative AI position us at the forefront of this
revolutionary field. With headquarters in Silicon Valley,
California, and a branch office in Paris, we are poised to make a
global impact. Join us as we embark on this journey to redefine the
future of biology and medicine through the transformative power of
Generative AI.Job Description
- GPU Cluster Management: Design, deploy, and maintain
high-performance GPU clusters, ensuring their stability,
reliability, and scalability. Monitor and manage cluster resources
to maximize utilization and efficiency.
- Distributed/Parallel Training: Implement distributed computing
techniques to enable parallel training of large deep learning
models across multiple GPUs and nodes. Optimize data distribution
and synchronization to achieve faster convergence and reduced
training times.
- Performance Optimization: Fine-tune GPU clusters and deep
learning frameworks to achieve optimal performance for specific
workloads. Identify and resolve performance bottlenecks through
profiling and system analysis.
- Deep Learning Framework Integration: Collaborate with data
scientists and machine learning engineers to integrate distributed
training capabilities into GenBio AI's model development and
deployment frameworks.
- Scalability and Resource Management: Ensure that the GPU
clusters can scale effectively to handle increasing computational
demands. Develop resource management strategies to prioritize and
allocate computing resources based on project requirements.
- Troubleshooting and Support: Troubleshoot and resolve issues
related to GPU clusters, distributed training, and performance
anomalies. Provide technical support to users and resolve technical
challenges efficiently.
- Documentation: Create and maintain documentation related to GPU
cluster configuration, distributed training workflows, and best
practices to ensure knowledge sharing and seamless onboarding of
new team members.Job Requirements:
- Master's or Ph.D. degree in computer science, or a related
field with a focus on High-Performance Computing, Distributed
Systems, or Deep Learning.
- 2+ years proven experience in managing GPU clusters, including
installation, configuration, and optimization.
- Strong expertise in distributed deep learning and parallel
training techniques.
- Proficiency in popular deep learning frameworks like PyTorch,
Megatron-LM, DeepSpeed, etc.
- Programming skills in Python and experience with
GPU-accelerated libraries (e.g., CUDA, cuDNN).
- Knowledge of performance profiling and optimization tools for
HPC and deep learning.
- Familiarity with resource management and scheduling systems
(e.g., SLURM, Kubernetes).
- Strong background in distributed systems, cloud computing (AWS,
GCP), and containerization (Docker, Kubernetes).We are an equal
opportunity employer. We celebrate diversity and are committed to
creating an inclusive environment for all employees.
#J-18808-Ljbffr
Keywords: Genbio, Modesto , ML Engineer Large - Scale AI Infrastructure, Engineering , Palo Alto, California
Didn't find what you're looking for? Search again!
Loading more jobs...