Real-time stream editing pipeline powered by the FLUX.2-klein-4B model, optimized for consumer GPUs
-
Updated
May 16, 2026 - Python
Real-time stream editing pipeline powered by the FLUX.2-klein-4B model, optimized for consumer GPUs
GVProf: A Value Profiler for GPU-based Clusters
AI Infrastructure Performance Engineer Learning Track - GPU optimization, inference optimization, and cost reduction
KeSSie HUGE Context Semantic recall for Large Language Models
The GPU Optimizer for ML Models enhances GPU performance for machine learning. It offers advanced scheduling, real-time monitoring, and efficient resource management through a user-friendly web interface and robust API, integrating big data technologies for seamless data processing and model optimization. @NVIDIA
Physics-based computation at scale — Hamiltonian dynamics, spectral theory, and statistical mechanics powering optimization, drug discovery, genomics, molecular proof, and agentic commerce.
This is a short course covering GPU optimization techniques for LLM inference
用于复现和优化常见的深度学习算子,基于cuda和triton两种方案,可供学习和参考
AI Infrastructure Senior Engineer Learning Track - Advanced ML infrastructure and technical leadership
Executive FinOps dashboard and automated governance engine using FOCUS 1.3 standards for AWS, Azure, and Snowflake.
Lightweight Stable Diffusion engine with plugin-based pipelines, VRAM-safe execution, and full 4GB GPU support.
Quantitative dataset of 119 neural architectures (2017-2025) scored on hardware compatibility and ecosystem friction. Validates the Transformer Attractor thesis.
🤖 Ollama Consumer - A Python-based interactive chat interface for Ollama models with advanced model management, comprehensive benchmarking, vision support, and automatic error recovery. Features dynamic model switching, GPU optimization, and intelligent service monitoring for seamless AI model interactions.
An advanced hybrid scheduling framework that leverages Reinforcement Learning and ML to dynamically optimize CPU/GPU task allocation in real-time.
High-performance CUDA implementation of LayerNorm for PyTorch achieving 1.46x speedup through kernel fusion. Optimized for large language models (4K-8K hidden dims) with vectorized memory access, warp-level primitives, and mixed precision support. Drop-in replacement for nn.LayerNorm with 25% memory reduction.
DGX Spark (GB10/SM121) platform support for Meta's KernelAgent — auto-detect, hardware constraints, safe Triton configs
LLM pretraining from scratch on FineWeb dataset (architecture and all components explained), plus optimal use of GPU on SLURM cluster
The NVIDIA driver's fan control logic wasn't doing it for me — too conservative, too opaque — so I built my own. This is a Linux GUI application for independent NVIDIA GPU fan control without requiring Coolbits. Uses pynvml via a root helper subprocess for direct fan management.
Prolepsis is a speculative decoding implementation that accelerates LLM inference by 1.30x on an A100. By pairing a small draft model (Qwen 1.7B) with a larger target (Qwen 8B), it shifts generation workloads into a parallel verification pass. A rigorous rejection sampling pipeline guarantees the output distribution is preserved.
🚀 Achieve rapid training of NanoGPT (GPT-2 124M) on a single RTX 4090, targeting a validation loss below 3.28 with FineWeb-Edu data.
Add a description, image, and links to the gpu-optimization topic page so that developers can more easily learn about it.
To associate your repository with the gpu-optimization topic, visit your repo's landing page and select "manage topics."