gpu-cluster

Complete setup guide for a 2-node NVIDIA DGX Spark cluster — distributed training, CUDA inference with EXO, NCCL tuning for Grace Blackwell, NVMe-TCP shared storage, and 200 Gb/s direct fabric networking.

Updated Apr 11, 2026
Python

Turbo31150 / JARVIS-CLUSTER

Star

Cluster GPU multi-nœuds pour LLMs locaux — LM Studio + Ollama, load balancing, failover automatique

Updated May 19, 2026
Python

qpwm06 / BO-Stratified-Random

Star

Async Bayesian-optimization controller with a persistent Slurm GPU worker pool

hpc molecular-dynamics slurm scientific-computing bayesian-optimization gpu-cluster botorch ax-platform async-optimization resident-worker

Updated Apr 28, 2026
Python

ulvgard / procplan

Star

ProcPlan is a dependency-free GPU resource planner built entirely on the Python standard library and SQLite

sqlite3 multi-user cli-app dependency-free gpu-cluster multi-gpu-training resource-planning python311

Updated Jan 26, 2026
Python

calgaryml / gpuslackbot

Star

Slack bot for monitoring GPU usage on a server.

slack slackbot cuda nvidia nvml slack-bot gpu-computing slack-app gpu-monitoring gpu-cluster

Updated Apr 20, 2025
Python

lsjhaha / gpu-server-control

Star

Open-source Windows desktop tool for GPU monitoring, conda environment migration, and queue running across multiple Linux servers over SSH. 面向多台 Linux 服务器的开源 SSH GPU 监控、Conda 环境迁移与任务排队工具。

ssh conda job-queue nvidia-smi linux-server gpu-monitoring gpu-cluster remote-management conda-pack cluster-tools

Updated May 13, 2026
Python

ArgentAIOS / nxo

Star

NXO — Distributed AI inference for NVIDIA/Linux. Fork of EXO focused on CUDA, tinygrad, and DGX Spark clusters.

cuda inference self-hosted nvidia gpu-cluster tinygrad llm distributed-inference dgx-spark grace-blackwell

Updated Apr 11, 2026
Python

acm-uiuc / nvdocker

Star

gpu-cluster

Updated Feb 24, 2019
Python

acm-uiuc / gpu-image-template

Star

Template for custom docker files on the gpu cluster

docker docker-image nvidia nvidia-docker gpu-cluster

Updated Feb 24, 2019
Python

ArttuAn / clusterscope

Star

AI cluster debugging lab for distributed LLM and HPC workloads: GPU, NCCL, Kubernetes, failure analysis, and tuning recommendations.

kubernetes performance-engineering hpc nvidia roce observability distributed-training nccl gpu-cluster failure-analysis llm dcgm

Updated May 21, 2026
Python

Improve this page

Add a description, image, and links to the gpu-cluster topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the gpu-cluster topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu-cluster

Here are 19 public repositories matching this topic...

LambdaLabsML / distributed-training-guide

S-Lab-System-Group / ChronusArtifact

youngharold / tightwad

theoddden / Terradev

Turbo31150 / jarvis-linux

acm-uiuc / gpu-cluster-images

languageseed / valet-gateway

acm-uiuc / gpu-cluster-backend

saravanabalagi / mask-gpu

ArgentAIOS / dgx-spark-cluster

Turbo31150 / JARVIS-CLUSTER

qpwm06 / BO-Stratified-Random

ulvgard / procplan

calgaryml / gpuslackbot

lsjhaha / gpu-server-control

ArgentAIOS / nxo

acm-uiuc / nvdocker

acm-uiuc / gpu-image-template

ArttuAn / clusterscope

Improve this page

Add this topic to your repo