Build software better, together

tensorforger / FluxRT

Real-time stream editing pipeline powered by the FLUX.2-klein-4B model, optimized for consumer GPUs

gpu-optimization diffusion-models real-time-ai

Updated May 16, 2026
Python

GVProf / GVProf

GVProf: A Value Profiler for GPU-based Clusters

machine-learning patterns profiler gpu cuda data-flow instrumentation binary-analysis clusters redundancy gpu-optimization value-profiler

Updated Mar 24, 2024
Python

ai-infra-curriculum / ai-infra-performance-learning

Star

AI Infrastructure Performance Engineer Learning Track - GPU optimization, inference optimization, and cost reduction

learning machine-learning performance curriculum advanced inference profiling tensorrt cost-optimization gpu-optimization ai-infrastructure

Updated May 23, 2026
Python

philtimmes / KeSSie

Star

KeSSie HUGE Context Semantic recall for Large Language Models

Updated Feb 21, 2026
Python

The GPU Optimizer for ML Models enhances GPU performance for machine learning. It offers advanced scheduling, real-time monitoring, and efficient resource management through a user-friendly web interface and robust API, integrating big data technologies for seamless data processing and model optimization. @NVIDIA

model-management gpu-optimization real-time-monitoring secure-api big-data-integration gpu-scheduling

Updated Dec 28, 2025
Python

OriginNeuralAI / OriginNeuralAI

Star

Physics-based computation at scale — Hamiltonian dynamics, spectral theory, and statistical mechanics powering optimization, drug discovery, genomics, molecular proof, and agentic commerce.

genomics drug-discovery ising-model post-quantum-cryptography hamiltonian-dynamics gpu-optimization simulated-bifurcation blockchain-verification spectral-theory physics-based-computation

Updated Apr 8, 2026
Python

AMD-AGI / GPU-Optimization-for-LLM-Inference

Star

This is a short course covering GPU optimization techniques for LLM inference

llamas gpu-optimization llm-inference

Updated May 11, 2026
Python

ZeroKernel798 / Triton-CUDA-Lab

Star

用于复现和优化常见的深度学习算子，基于cuda和triton两种方案，可供学习和参考

triton gpu-optimization cuda-programming

Updated May 22, 2026
Python

ai-infra-curriculum / ai-infra-senior-engineer-learning

Star

AI Infrastructure Senior Engineer Learning Track - Advanced ML infrastructure and technical leadership

kubernetes learning distributed-systems machine-learning performance curriculum advanced gpu-optimization mlops senior-engineer ai-infrastructure

Updated May 22, 2026
Python

JeyaPrakashI / Multi-Cloud-Governance-Ledger-FOCUS-1.3

Star

Executive FinOps dashboard and automated governance engine using FOCUS 1.3 standards for AWS, Azure, and Snowflake.

automation power-bi data-engineering multi-cloud gpu-optimization finops platform-engineering cloud-governance cloud-ops cloud-economics azure-finops aws-finops ai-infrastructure focus-1-3 llmops-finance serverless-governance gcp-finops

Updated Feb 14, 2026
Python

RajTewari01 / image-generation

Star

Lightweight Stable Diffusion engine with plugin-based pipelines, VRAM-safe execution, and full 4GB GPU support.

webgl nextjs pytorch gpu-optimization fastapi ai-art framer-motion stable-diffusion generative-ai low-vram

Updated Mar 31, 2026
Python

petroslamb / hardware-friction-scorecard-dataset

Star

Quantitative dataset of 119 neural architectures (2017-2025) scored on hardware compatibility and ecosystem friction. Validates the Transformer Attractor thesis.

machine-learning dataset transformer gpu-optimization production-ml neural-architecture hardware-compatibility

Updated Dec 16, 2025
Python

flickleafy / ollama_consumer

Star

🤖 Ollama Consumer - A Python-based interactive chat interface for Ollama models with advanced model management, comprehensive benchmarking, vision support, and automatic error recovery. Features dynamic model switching, GPU optimization, and intelligent service monitoring for seamless AI model interactions.

python benchmarking machine-learning automation ai chatbot configuration-management language-models error-recovery model-management cli-tool multimodal gpu-optimization service-monitoring interactive-chat vision-models llm ollama-api moe-models

Updated Aug 6, 2025
Python

anurag2796 / hybrid-ml-scheduler

Star

An advanced hybrid scheduling framework that leverages Reinforcement Learning and ML to dynamically optimize CPU/GPU task allocation in real-time.

python machine-learning reinforcement-learning task-scheduler resource-allocation gpu-optimization

Updated Feb 24, 2026
Python

JonSnow1807 / Fused-LayerNorm-CUDA-Operator

Star

High-performance CUDA implementation of LayerNorm for PyTorch achieving 1.46x speedup through kernel fusion. Optimized for large language models (4K-8K hidden dims) with vectorized memory access, warp-level primitives, and mixed precision support. Drop-in replacement for nn.LayerNorm with 25% memory reduction.

deep-learning cuda pytorch gpu-optimization kernel-fusion layernorm

Updated Aug 17, 2025
Python

leap21ai / autospark

Star

DGX Spark (GB10/SM121) platform support for Meta's KernelAgent — auto-detect, hardware constraints, safe Triton configs

cuda nvidia triton gpu-optimization gb10 dgx-spark sm121 kernel-agent

Updated Mar 14, 2026
Python

Saurabh-66 / LLM-pretraining-Open-AI-Parameter-Golf-Challenge

Star

LLM pretraining from scratch on FineWeb dataset (architecture and all components explained), plus optimal use of GPU on SLURM cluster

rope model-evaluation gpu-optimization gqa layernorm rmsnorm llm-training llm-inference llm-evaluation swiglu bpe-tokenizer flashattention muon-optimizer

Updated May 12, 2026
Python

ikaganacar1 / GPU_FanControl

Star

The NVIDIA driver's fan control logic wasn't doing it for me — too conservative, too opaque — so I built my own. This is a Linux GUI application for independent NVIDIA GPU fan control without requiring Coolbits. Uses pynvml via a root helper subprocess for direct fan management.

gpu nvidia nvidia-gpu cooling-control gpu-optimization gpu-fan

Updated Mar 24, 2026
Python

iamrahulreddy / Prolepsis

Star

Prolepsis is a speculative decoding implementation that accelerates LLM inference by 1.30x on an A100. By pairing a small draft model (Qwen 1.7B) with a larger target (Qwen 8B), it shifts generation workloads into a parallel verification pass. A rigorous rejection sampling pipeline guarantees the output distribution is preserved.

pytorch triton gpu-optimization inference-optimization huggingface a100 llm speculative-decoding

Updated Mar 26, 2026
Python

Gane2122 / nanoGPT_1GPU_SPEEDRUN

Star

🚀 Achieve rapid training of NanoGPT (GPT-2 124M) on a single RTX 4090, targeting a validation loss below 3.28 with FineWeb-Edu data.

open-source benchmark machine-learning natural-language-processing deep-learning text-generation pytorch model-training gpu-optimization ai-research transformer-models single-gpu inference-speed nanogpt fast-training

Updated May 23, 2026
Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu-optimization

Here are 43 public repositories matching this topic...

tensorforger / FluxRT

GVProf / GVProf

ai-infra-curriculum / ai-infra-performance-learning

philtimmes / KeSSie

raj200501 / GPUOptimizerML

OriginNeuralAI / OriginNeuralAI

AMD-AGI / GPU-Optimization-for-LLM-Inference

ZeroKernel798 / Triton-CUDA-Lab

ai-infra-curriculum / ai-infra-senior-engineer-learning

JeyaPrakashI / Multi-Cloud-Governance-Ledger-FOCUS-1.3

RajTewari01 / image-generation

petroslamb / hardware-friction-scorecard-dataset

flickleafy / ollama_consumer

anurag2796 / hybrid-ml-scheduler

JonSnow1807 / Fused-LayerNorm-CUDA-Operator

leap21ai / autospark

Saurabh-66 / LLM-pretraining-Open-AI-Parameter-Golf-Challenge

ikaganacar1 / GPU_FanControl

iamrahulreddy / Prolepsis

Gane2122 / nanoGPT_1GPU_SPEEDRUN

Improve this page

Add this topic to your repo