Best practices & guides on how to write distributed pytorch training code
-
Updated
Oct 22, 2025 - Python
Best practices & guides on how to write distributed pytorch training code
Mixed-vendor GPU inference cluster manager with speculative decoding
An imperative command-line-interface for AI workload orchestration
Infrastructure IA distribuée Linux — cluster GPU on-premise, 900+ agents autonomes, RGPD natif
Docker Images for the GPU Cluster
AI Inference Gateway - orchestrates Ollama, vLLM, cloud providers, and vision services into a unified, production-ready platform
A simple tool to expose only specified number of GPUs with desired memory to Tensorflow
Complete setup guide for a 2-node NVIDIA DGX Spark cluster — distributed training, CUDA inference with EXO, NCCL tuning for Grace Blackwell, NVMe-TCP shared storage, and 200 Gb/s direct fabric networking.
Cluster GPU multi-nœuds pour LLMs locaux — LM Studio + Ollama, load balancing, failover automatique
Async Bayesian-optimization controller with a persistent Slurm GPU worker pool
ProcPlan is a dependency-free GPU resource planner built entirely on the Python standard library and SQLite
Slack bot for monitoring GPU usage on a server.
Open-source Windows desktop tool for GPU monitoring, conda environment migration, and queue running across multiple Linux servers over SSH. 面向多台 Linux 服务器的开源 SSH GPU 监控、Conda 环境迁移与任务排队工具。
NXO — Distributed AI inference for NVIDIA/Linux. Fork of EXO focused on CUDA, tinygrad, and DGX Spark clusters.
Template for custom docker files on the gpu cluster
AI cluster debugging lab for distributed LLM and HPC workloads: GPU, NCCL, Kubernetes, failure analysis, and tuning recommendations.
Add a description, image, and links to the gpu-cluster topic page so that developers can more easily learn about it.
To associate your repository with the gpu-cluster topic, visit your repo's landing page and select "manage topics."