A Python framework for self-hosted LLM tool-calling and multi-step agentic workflows
-
Updated
May 31, 2026 - Python
A Python framework for self-hosted LLM tool-calling and multi-step agentic workflows
Run larger LLMs with longer contexts on Apple Silicon by using differentiated precision for KV cache quantization. KVSplit enables 8-bit keys & 4-bit values, reducing memory by 59% with <1% quality loss. Includes benchmarking, visualization, and one-command setup. Optimized for M1/M2/M3 Macs with Metal support.
This repo is to showcase how you can run a model locally and offline, free of OpenAI dependencies.
A custom ComfyUI node for MiniCPM vision-language models, supporting v4, v4.5, and v4 GGUF formats, enabling high-quality image captioning and visual analysis.
LLaMA Server combines the power of LLaMA C++ with the beauty of Chatbot UI.
DocMind AI is a powerful, open-source Streamlit application leveraging LlamaIndex, LangGraph, and local Large Language Models (LLMs) via Ollama, LMStudio, llama.cpp, or vLLM for advanced document analysis. Analyze, summarize, and extract insights from a wide array of file formats, securely and privately, all offline.
📚 Local PDF-Integrated Chat Bot: Secure Conversations and Document Assistance with LLM-Powered Privacy
llm-inference is a platform for publishing and managing llm inference, providing a wide range of out-of-the-box features for model deployment, such as UI, RESTful API, auto-scaling, computing resource management, monitoring, and more.
BabyAGI-🦙: Enhanced for Llama models (running 100% local) and persistent memory, with smart internet search based on BabyCatAGI and document embedding in langchain based on privateGPT
Demos of Google's Gemma models running locally on NVIDIA Jetson Orin Nano, from the Tokyo Dev Day (Gemma 2) to the latest Gemma 4 VLA agent with voice + vision.
OpenVitamin is a local-first AI execution platform that unifies Agents, Workflows, and multi-model inference into a single programmable system — designed for building real, production-grade AI applications.
Reproducible local LLM setup and benchmark evidence for AMD Strix Halo / Ryzen AI MAX+ 395: 63-98.5 t/s direct Qwen MoE, 101.1 t/s MTP.
◉ Universal Intelligence: AI made simple.
Local diagnostic CLI for NVIDIA DGX Spark (GB10). Detects power caps, unified memory pressure, thermal risk, Docker/runtime issues, and validates vLLM/Ollama/llama.cpp/SGLang recipes.
Running Mixture of Agents on CPU: LFM2.5 Brain (1.2B) + Falcon-R Reasoner (600M) + Tool Caller (90M). CPU-only, 16GB RAM. Lightweight AI Legion.
GPU-accelerated LLaMA inference wrapper for legacy Vulkan-capable systems a Pythonic way to run AI with knowledge (Ilm) on fire (Vulkan).
Local character AI chatbot with chroma vector store memory and some scripts to process documents for Chroma
Add a description, image, and links to the llama-cpp topic page so that developers can more easily learn about it.
To associate your repository with the llama-cpp topic, visit your repo's landing page and select "manage topics."