Skip to content

Ismail-2001/SQL-Query-Agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

💎 Nexus SQL

Autonomous AI Data Analyst — Query Databases in Plain English


Python LangChain DeepSeek V3 Streamlit Plotly SQLAlchemy PostgreSQL License: MIT


"Ask questions in English. Get answers in charts."

Nexus SQL is an autonomous AI Data Analyst that transforms natural language questions into optimized SQL queries, executes them safely, and instantly visualizes the results — all through a premium Streamlit interface. Powered by DeepSeek-V3 and LangChain's ReAct agent, it reduces time-to-insight by eliminating the need to know SQL entirely.

✨ Features · 🏗️ Architecture · 🚀 Setup · 💡 Examples


📌 The Problem

Business intelligence is bottlenecked by a single skill: SQL fluency.

  • Analysts wait days for engineering to answer data questions
  • Non-technical stakeholders can't self-serve — every insight requires a ticket
  • Writing complex JOINs, GROUP BYs, and subqueries is error-prone even for experienced developers
  • Raw query results are meaningless without visualization — creating another handoff to a BI tool
  • No safety net — a single DROP TABLE or DELETE FROM can cause irreversible damage

Nexus SQL solves all five. Type a question in English, get a chart, a summary, and the full chain of thought — with two layers of SQL injection protection ensuring your data is never at risk.


✨ Key Features

🧠 Autonomous ReAct Reasoning

Powered by LangChain's create_sql_agent with a Zero-Shot ReAct Description agent type:

  • Schema Inspection First: The agent always inspects database structure before writing SQL — using exact table and column names, never guessing
  • Multi-Step Planning: Complex questions trigger multi-step reasoning chains — the agent plans JOINs, decides groupings, and iterates on its own queries before returning an answer
  • Explicit Foreign Key Logic: The system prompt enforces explicit JOINs and correct foreign key resolution (e.g., orders.product_id = products.product_id)
  • Business Interpretation: After returning data, the agent provides an executive summary explaining the business significance

💬 Conversational Memory

Context-aware follow-up queries without repeating yourself:

  • Sliding Window: The agent maintains a history of the last 5 Q&A pairs in memory
  • Context Injection: Follow-up questions automatically receive the last 3 exchanges as context — so "Break that down by country" works after "Show total revenue"
  • Session History Panel: The sidebar displays all previous queries with one-click Re-run buttons

📈 Automatic Visualization Engine

The ResultVisualizer class in visualizer.py automatically selects the optimal chart type:

Data Pattern Detected Chart Type When It Triggers
Date/time column + numeric column 📈 Timeline (spline line) date, time, month in column name + any numeric column
Categorical + numeric (≤5 categories) 🍩 Donut Chart e.g., revenue per country (3–5 rows)
Categorical + numeric (>5 categories) 📊 Bar Chart e.g., product-level breakdown (6+ rows)
Two numeric columns 🔵 Scatter Plot e.g., quantity vs. price correlation
Single numeric column 📊 Bar Chart (fallback) Any single-metric result set

All charts use a futuristic color palette (#00f2ff, #7000ff, #ff007a, #00ffab, #ffea00) with plotly_dark template — transparent backgrounds, hidden grid lines, Inter font, and custom hover labels matching the UI.

Summary Statistics are also auto-generated: totals, averages, and currency formatting (detects columns with price, amount, total in the name).

🛡️ Two-Layer SQL Security

Layer 1 — Agent Pre-Audit (agent.py): Before any query reaches the LLM, _is_potentially_malicious() scans the raw user input for 8 forbidden keywords:

delete, drop, truncate, update, insert, alter, grant, revoke

If any are found, the request is immediately rejected — the agent never sees it.

Layer 2 — Tool-Level Execution Guard (tools.py): The SafeSQLExecutor LangChain tool enforces a second line of defense at execution time:

  • Only SELECT and WITH (CTE) statements are allowed
  • 10 dangerous SQL keywords are blocked: DELETE, DROP, TRUNCATE, UPDATE, INSERT, ALTER, CREATE, EXEC, GRANT, REVOKE
  • All queries are automatically limited to 100 rows if no LIMIT clause is present
  • Results are formatted through Pandas for clean table output

🔍 Query Complexity Analyzer

The QueryComplexityAnalyzer tool (tools.py) scores any SQL query for performance risk:

Factor Score Warning Threshold
Each JOIN +2 >3 JOINs: "High number of joins. May be slow."
SELECT * usage +1 "Consider selecting only necessary columns."
Each subquery (SELECT... +3 >1 subquery: "Multiple subqueries detected."
GROUP BY +1 (no warning)

Rating: Simple (0–4) · Moderate (5–9) · Complex (10+)

🐘 PostgreSQL Ready

Zero-config database migration via environment variables:

DATABASE_TYPE=postgresql
POSTGRES_DB_URL=postgresql://user:password@host:5432/dbname

When DATABASE_TYPE=postgresql is set, the entire system — agent, tools, visualizer — automatically switches from SQLite to the production PostgreSQL connection. The Streamlit UI also updates the Node Status panel and disables the Hard Reset button (which is SQLite-only).

📝 Evaluation & Feedback Loop

Every answer includes 👍 Correct / 👎 Incorrect buttons. Feedback is logged to evaluations.csv with:

  • Timestamp
  • Original question
  • SQL query (if captured)
  • Agent answer
  • Human feedback

This creates a ground-truth dataset for future model fine-tuning.

🎨 Premium Streamlit UI

A fully custom design system injected via st.markdown() CSS:

  • Radial gradient background: #1a1c2c → #0a0b10
  • Glassmorphism sidebar: backdrop-filter: blur(10px) with subtle border
  • Gradient headings: linear-gradient(90deg, #fff 0%, #a5b4fc 100%)
  • Gradient buttons: linear-gradient(135deg, #7000ff 0%, #4a00ff 100%) with hover lift (translateY(-2px))
  • Code blocks: JetBrains Mono font on dark #1e1e2e background
  • Fade-in animation: Results container uses @keyframes fadeIn with translate
  • Three-tab result layout: 📊 Analytics, 📋 Source Data, 🧠 Neural Logic

🏢 Discovery Hub

Pre-loaded example queries in the sidebar for instant exploration:

Label Natural Language Query
📈 Monthly Revenue Trend "Show me total sales by month for the last 6 months"
🌐 Regional Performance "What is the total revenue per country?"
👑 VIP Customers "Who are the top 5 customers by total spending?"
📦 Inventory Health "Which product categories have the highest and lowest average prices?"
🚛 Order Logistics "List the 10 most recent orders with customer name and product details"

🏗️ Architecture

System Flow

┌──────────────────────────────────────────────────────────────────────┐
│                       Streamlit Frontend (app.py)                    │
│                                                                      │
│  ┌────────────────┐   ┌──────────────────┐   ┌──────────────────┐   │
│  │  Query Input   │   │  Discovery Hub   │   │  Session History │   │
│  │  (text_area)   │   │  (5 examples)    │   │  (Re-run btns)  │   │
│  └───────┬────────┘   └────────┬─────────┘   └──────────────────┘   │
│          │                     │                                     │
│          ▼                     ▼                                     │
│  ┌─────────────────────────────────────────────────────────────┐     │
│  │              Layer 1: Pre-Audit Security Check              │     │
│  │              _is_potentially_malicious(question)             │     │
│  │              Blocks: delete, drop, truncate, update...      │     │
│  └─────────────────────────┬───────────────────────────────────┘     │
│                            │ ✅ Clean                                │
│                            ▼                                         │
│  ┌─────────────────────────────────────────────────────────────┐     │
│  │              SQLQueryAgent (agent.py)                        │     │
│  │                                                             │     │
│  │  ┌───────────────────────────┐  ┌────────────────────────┐  │     │
│  │  │  Conversational Memory   │  │  DeepSeek V3 LLM       │  │     │
│  │  │  Last 5 Q&A pairs        │  │  model: deepseek-chat  │  │     │
│  │  │  Last 3 injected as      │  │  temperature: 0        │  │     │
│  │  │  context for follow-ups  │  │  top_p: 0.1            │  │     │
│  │  └───────────────────────────┘  └────────────────────────┘  │     │
│  │                                                             │     │
│  │  LangChain create_sql_agent (zero-shot-react-description)   │     │
│  │  SQLDatabaseToolkit → Schema + Sample Rows (3 per table)    │     │
│  └─────────────────────────┬───────────────────────────────────┘     │
│                            │ ReAct Tool Loop                         │
│                            ▼                                         │
│  ┌─────────────────────────────────────────────────────────────┐     │
│  │              Layer 2: SafeSQLExecutor (tools.py)             │     │
│  │                                                             │     │
│  │  • Validates: SELECT/WITH only                              │     │
│  │  • Blocks: DELETE, DROP, CREATE, EXEC, GRANT... (10 KWs)   │     │
│  │  • Auto-appends: LIMIT 100                                  │     │
│  │  • Executes via: SQLAlchemy + Pandas                        │     │
│  └─────────────────────────┬───────────────────────────────────┘     │
│                            │ DataFrame result                        │
│                            ▼                                         │
│  ┌─────────────────────────────────────────────────────────────┐     │
│  │              ResultVisualizer (visualizer.py)                │     │
│  │                                                             │     │
│  │  • Detects data shape (date, categorical, numeric)          │     │
│  │  • Selects chart: Timeline / Bar / Donut / Scatter          │     │
│  │  • Generates summary stats with currency formatting         │     │
│  └─────────────────────────┬───────────────────────────────────┘     │
│                            │                                         │
│                            ▼                                         │
│  ┌──────────────────────────────────────────────────────────────┐    │
│  │  Three-Tab Result Panel                                      │    │
│  │  📊 Analytics: Chart + Key Metrics + 👍/👎 Feedback         │    │
│  │  📋 Source Data: Raw DataFrame table                         │    │
│  │  🧠 Neural Logic: Full Chain-of-Thought step expansion      │    │
│  └──────────────────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────────────┘

File Responsibility Map

File Lines Responsibility
agent.py 110 SQLQueryAgent class — DeepSeek LLM init, LangChain SQL agent setup, conversation memory (5-entry sliding window), input pre-audit, schema exposure
tools.py 127 SafeSQLExecutor — read-only SQL execution with LIMIT 100 auto-append · QueryComplexityAnalyzer — JOIN/subquery scoring · SchemaInspector — table info dump
app.py 333 Full Streamlit UI — CSS design system, sidebar Discovery Hub, result tabs (Analytics / Source Data / Neural Logic), feedback logger, PostgreSQL toggle, @st.cache_resource agent singleton
visualizer.py 104 ResultVisualizer — auto chart selection (Timeline / Bar / Donut / Scatter), futuristic color palette, get_summary_stats() with currency detection
setup_db.py 119 Demo e-commerce database genesis — 3 tables (customers, products, orders), 40 customers, 12 products, 250 orders with realistic time-series pricing
test_agent.py 35 Smoke test — initializes agent, prints schema, runs a sample query

Demo Database Schema

The setup_db.py script creates a realistic e-commerce dataset:

customers (40 rows)
├── customer_id  INTEGER PRIMARY KEY
├── name         TEXT
├── email        TEXT UNIQUE
├── segment      TEXT     -- Corporate, Retail, Wholesale, Affiliate
├── country      TEXT     -- USA, UK, Germany, Canada, France, Japan, Australia, Brazil
└── signup_date  DATE

products (12 rows)
├── product_id    INTEGER PRIMARY KEY
├── product_name  TEXT     -- Quantum Laptop Pro, Neural Mouse, ErgoDesk 3000...
├── category      TEXT     -- Electronics, Furniture, Appliances
├── sub_category  TEXT     -- Computing, Peripherals, Office, Kitchen, Audio, Mobile
└── base_price    DECIMAL

orders (250 rows)
├── order_id       INTEGER PRIMARY KEY
├── customer_id    INTEGER → FK customers
├── product_id     INTEGER → FK products
├── order_date     DATE     -- Last 180 days
├── quantity       INTEGER  -- 1–5
├── unit_price     DECIMAL  -- ±10% variation for realism
├── total_amount   DECIMAL
└── shipping_status TEXT    -- Delivered, Shipped, Processing, Cancelled

🚀 Local Installation

Prerequisites

  • Python 3.10+
  • A DeepSeek API Keyget one here (uses OpenAI-compatible endpoint)

Step 1 — Clone

git clone https://github.com/Ismail-2001/SQL-Query-Agent.git
cd SQL-Query-Agent

Step 2 — Virtual Environment

python -m venv venv
source venv/bin/activate          # macOS / Linux
# venv\Scripts\Activate.ps1      # Windows (PowerShell)

Step 3 — Install Dependencies

pip install -r requirements.txt

Step 4 — Configure API Key

Create a .env file in the project root:

DEEPSEEK_API_KEY=your_deepseek_api_key_here

# Optional — for production PostgreSQL:
# DATABASE_TYPE=postgresql
# POSTGRES_DB_URL=postgresql://user:password@host:5432/dbname

Step 5 — Initialize & Run

# Create the demo e-commerce database (40 customers, 12 products, 250 orders)
python setup_db.py

# Launch the Streamlit app
streamlit run app.py

The app will open at http://localhost:8501.

Note: If the demo database doesn't exist when you launch app.py, it will be automatically created on first load — no need to run setup_db.py manually.


💡 Usage & Example Queries

Natural Language Queries

Type any of these into the query box:

Category Example Query What Happens Under the Hood
Aggregation "What is the total revenue for each country?" SELECT country, SUM(total_amount) FROM orders JOIN customers... GROUP BY country
Top-N "Who are the top 5 customers by spending?" Multi-table JOIN with ORDER BY ... DESC LIMIT 5
Trend Analysis "Show monthly revenue for the last 6 months" Date grouping + Timeline chart auto-generated
Filtering "List all cancelled orders from the USA" Filtered JOIN with WHERE shipping_status = 'Cancelled' AND country = 'USA'
Comparison "Compare average order values between Corporate and Retail segments" GROUP BY segment + Bar/Donut chart
Follow-up "Break that down by product category" Conversational memory injects previous context — no need to repeat the original question

Reading the Results

Every successful query produces a three-tab result panel:

Tab Content
📊 Analytics Executive summary (text), auto-generated Plotly chart, key metrics summary, and 👍/👎 feedback buttons
📋 Source Data The raw DataFrame rendered as an interactive Streamlit table
🧠 Neural Logic Full Chain-of-Thought — every ReAct step the agent took, including schema inspections, SQL drafts, and field observations

Running the Test Suite

python test_agent.py

This initializes the agent, prints the full database schema, and runs a sample question ("How many customers do we have in total?") — useful for validating your API key and database setup before launching the UI.


📂 Project Structure

SQL-Query-Agent/
│
├── agent.py             # SQLQueryAgent class
│                        #   • DeepSeek V3 via OpenAI-compatible endpoint
│                        #   • LangChain create_sql_agent (zero-shot-react)
│                        #   • 5-entry conversational memory (sliding window)
│                        #   • Pre-audit malicious input scanner (8 forbidden KWs)
│                        #   • Custom system prompt enforcing schema-first reasoning
│
├── tools.py             # LangChain BaseTool implementations
│                        #   • SafeSQLExecutor: read-only guard + LIMIT 100
│                        #   • QueryComplexityAnalyzer: JOIN/subquery scoring
│                        #   • SchemaInspector: table info with sample rows
│
├── app.py               # Streamlit application
│                        #   • Full CSS design system (radial gradients, glassmorphism)
│                        #   • 3-tab result panel (Analytics / Source Data / Neural Logic)
│                        #   • Sidebar: Discovery Hub, Session History, Node Status
│                        #   • @st.cache_resource agent singleton
│                        #   • @st.cache_data query caching (1hr TTL)
│                        #   • Feedback logger → evaluations.csv
│                        #   • PostgreSQL env var toggle
│
├── visualizer.py        # ResultVisualizer class
│                        #   • Auto chart selection: Timeline / Bar / Donut / Scatter
│                        #   • Futuristic 5-color palette on plotly_dark template
│                        #   • Currency-aware summary statistics
│
├── setup_db.py          # Demo database generator
│                        #   • 3 tables: customers, products, orders
│                        #   • 40 customers × 12 products × 250 orders
│                        #   • Time-series friendly order dates (last 180 days)
│                        #   • Realistic pricing with ±10% variance
│
├── test_agent.py        # Smoke test: schema dump + sample query
├── requirements.txt     # Python dependencies (13 packages)
├── evaluations.csv      # (Generated) Human feedback log
└── .env                 # Private API key configuration

🛠️ Tech Stack

Category Library Purpose
Language Python 3.10+ Core runtime
LLM Provider langchain-openai → DeepSeek V3 SQL generation and reasoning via OpenAI-compatible API (api.deepseek.com/v1)
Agent Framework LangChain create_sql_agent Zero-Shot ReAct agent with SQLDatabaseToolkit
Schema Access langchain-community SQLDatabase Schema introspection with 3 sample rows per table
Tool Validation Pydantic BaseModel Typed args_schema for all custom tools
Database SQLAlchemy + SQLite (demo) / PostgreSQL (prod) Universal database abstraction
Data Processing Pandas Query result formatting + DataFrame manipulation
Visualization Plotly Express + Graph Objects Interactive chart generation with dark theme
Frontend Streamlit App framework with @st.cache_resource and @st.cache_data
Config python-dotenv Secure .env file loading
PostgreSQL Driver psycopg2-binary Production database driver
Columnar Engine DuckDB Available for analytical query acceleration

🌐 Deployment

Streamlit Cloud (Fastest)

  1. Push your repo to GitHub
  2. Go to share.streamlit.io
  3. Select your repository → set Main file path to app.py
  4. Add Secrets in the Streamlit Cloud dashboard:
    DEEPSEEK_API_KEY = "your_key_here"
  5. Click Deploy

Render / Railway

# render.yaml (Render)
services:
  - type: web
    name: nexus-sql
    runtime: python
    buildCommand: pip install -r requirements.txt && python setup_db.py
    startCommand: streamlit run app.py --server.port $PORT --server.headless true
    envVars:
      - key: DEEPSEEK_API_KEY
        sync: false

Docker

FROM python:3.11-slim
WORKDIR /app
COPY . .
RUN pip install --no-cache-dir -r requirements.txt
RUN python setup_db.py
EXPOSE 8501
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.headless=true"]

🗺️ Roadmap

✅ Phase 1 — Core Intelligence (Complete)

  • DeepSeek V3 via OpenAI-compatible endpoint
  • LangChain Zero-Shot ReAct SQL agent with SQLDatabaseToolkit
  • Two-layer SQL security (input pre-audit + execution guard)
  • Automatic LIMIT 100 injection on all queries
  • Conversational memory (5-entry sliding window, last 3 as context)
  • Auto-visualization engine (Timeline / Bar / Donut / Scatter)
  • Query Complexity Analyzer with JOIN/subquery scoring
  • Premium glassmorphism Streamlit UI with 3-tab results
  • Demo e-commerce database (3 tables, 250 orders)
  • Discovery Hub with 5 pre-loaded example queries
  • Human feedback loop → evaluations.csv
  • PostgreSQL environment toggle
  • @st.cache_resource agent singleton + @st.cache_data query caching (1hr TTL)

🔨 Phase 2 — Multi-Agent & Advanced Analytics (Next)

  • Multi-Agent Orchestration: Separate agents for SQL generation vs. data visualization strategy
  • Vector-Search Schema Pruning: Embedding-based table selection for 1000+ table databases
  • Export Engine: PDF and Excel report generation for business meetings
  • Query History Dashboard: Full session analytics with re-run, compare, and trend views

📋 Phase 3 — Enterprise Features (Planned)

  • Role-Based Access Control: Read-only vs. analyst vs. admin permission tiers
  • Preditive Anomaly Detection: Proactive alerts when revenue, orders, or metrics deviate from trend
  • Slack / Teams Integration: Query your database from a chat message
  • Scheduled Reports: Cron-based automated briefings delivered via email

🔭 Phase 4 — Platform Vision

  • Multi-Database Connections: Query across multiple databases in a single session
  • Data Lineage Visualization: Trace column dependencies and transformation chains
  • Fine-Tuned SQL Model: Train on evaluations.csv feedback data for domain-specific accuracy
  • Natural Language Dashboard Builder: Generate full Streamlit dashboards from conversation history

🤝 Contributing

Contributions are welcome! High-impact areas:

  • New visualization types in visualizer.py (e.g., heatmaps, treemaps, funnel charts)
  • Additional safety rules in tools.py SafeSQLExecutor
  • New example queries for the Discovery Hub in app.py
  • PostgreSQL test coverage with real connection handling
  • New demo datasets via setup_db.py (e.g., healthcare, logistics, SaaS metrics)

To contribute:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-feature
  3. Commit with Conventional Commits: git commit -m "feat: add heatmap chart type to visualizer"
  4. Push and open a Pull Request against main

📄 License

Distributed under the MIT License. See LICENSE for details.


Built so anyone can talk to their data — no SQL required.

If Nexus SQL changed how you think about data access, star ⭐ the repo.

GitHub Stars

Built with ❤️ by Ismail Sajid

About

Nexus SQL is an autonomous AI Data Analyst that lets you query databases in plain English. Built with LangChain and DeepSeek-V3, it automates complex SQL generation, provides instant data visualizations, and ensures enterprise security, reducing time-to-insight by 80%.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages