💎 Nexus SQL

Autonomous AI Data Analyst — Query Databases in Plain English

"Ask questions in English. Get answers in charts."

Nexus SQL is an autonomous AI Data Analyst that transforms natural language questions into optimized SQL queries, executes them safely, and instantly visualizes the results — all through a premium Streamlit interface. Powered by DeepSeek-V3 and LangChain's ReAct agent, it reduces time-to-insight by eliminating the need to know SQL entirely.

✨ Features · 🏗️ Architecture · 🚀 Setup · 💡 Examples

📌 The Problem

Business intelligence is bottlenecked by a single skill: SQL fluency.

Analysts wait days for engineering to answer data questions
Non-technical stakeholders can't self-serve — every insight requires a ticket
Writing complex JOINs, GROUP BYs, and subqueries is error-prone even for experienced developers
Raw query results are meaningless without visualization — creating another handoff to a BI tool
No safety net — a single DROP TABLE or DELETE FROM can cause irreversible damage

Nexus SQL solves all five. Type a question in English, get a chart, a summary, and the full chain of thought — with two layers of SQL injection protection ensuring your data is never at risk.

✨ Key Features

🧠 Autonomous ReAct Reasoning

Powered by LangChain's create_sql_agent with a Zero-Shot ReAct Description agent type:

Schema Inspection First: The agent always inspects database structure before writing SQL — using exact table and column names, never guessing
Multi-Step Planning: Complex questions trigger multi-step reasoning chains — the agent plans JOINs, decides groupings, and iterates on its own queries before returning an answer
Explicit Foreign Key Logic: The system prompt enforces explicit JOINs and correct foreign key resolution (e.g., orders.product_id = products.product_id)
Business Interpretation: After returning data, the agent provides an executive summary explaining the business significance

💬 Conversational Memory

Context-aware follow-up queries without repeating yourself:

Sliding Window: The agent maintains a history of the last 5 Q&A pairs in memory
Context Injection: Follow-up questions automatically receive the last 3 exchanges as context — so "Break that down by country" works after "Show total revenue"
Session History Panel: The sidebar displays all previous queries with one-click Re-run buttons

📈 Automatic Visualization Engine

The ResultVisualizer class in visualizer.py automatically selects the optimal chart type:

Data Pattern Detected	Chart Type	When It Triggers
Date/time column + numeric column	📈 Timeline (spline line)	`date`, `time`, `month` in column name + any numeric column
Categorical + numeric (≤5 categories)	🍩 Donut Chart	e.g., revenue per country (3–5 rows)
Categorical + numeric (>5 categories)	📊 Bar Chart	e.g., product-level breakdown (6+ rows)
Two numeric columns	🔵 Scatter Plot	e.g., quantity vs. price correlation
Single numeric column	📊 Bar Chart (fallback)	Any single-metric result set

All charts use a futuristic color palette (#00f2ff, #7000ff, #ff007a, #00ffab, #ffea00) with plotly_dark template — transparent backgrounds, hidden grid lines, Inter font, and custom hover labels matching the UI.

Summary Statistics are also auto-generated: totals, averages, and currency formatting (detects columns with price, amount, total in the name).

🛡️ Two-Layer SQL Security

Layer 1 — Agent Pre-Audit (agent.py): Before any query reaches the LLM, _is_potentially_malicious() scans the raw user input for 8 forbidden keywords:

delete, drop, truncate, update, insert, alter, grant, revoke

If any are found, the request is immediately rejected — the agent never sees it.

Layer 2 — Tool-Level Execution Guard (tools.py): The SafeSQLExecutor LangChain tool enforces a second line of defense at execution time:

Only SELECT and WITH (CTE) statements are allowed
10 dangerous SQL keywords are blocked: DELETE, DROP, TRUNCATE, UPDATE, INSERT, ALTER, CREATE, EXEC, GRANT, REVOKE
All queries are automatically limited to 100 rows if no LIMIT clause is present
Results are formatted through Pandas for clean table output

🔍 Query Complexity Analyzer

The QueryComplexityAnalyzer tool (tools.py) scores any SQL query for performance risk:

Factor	Score	Warning Threshold
Each JOIN	+2	>3 JOINs: "High number of joins. May be slow."
`SELECT *` usage	+1	"Consider selecting only necessary columns."
Each subquery `(SELECT...`	+3	>1 subquery: "Multiple subqueries detected."
`GROUP BY`	+1	(no warning)

Rating: Simple (0–4) · Moderate (5–9) · Complex (10+)

🐘 PostgreSQL Ready

Zero-config database migration via environment variables:

DATABASE_TYPE=postgresql
POSTGRES_DB_URL=postgresql://user:password@host:5432/dbname

When DATABASE_TYPE=postgresql is set, the entire system — agent, tools, visualizer — automatically switches from SQLite to the production PostgreSQL connection. The Streamlit UI also updates the Node Status panel and disables the Hard Reset button (which is SQLite-only).

📝 Evaluation & Feedback Loop

Every answer includes 👍 Correct / 👎 Incorrect buttons. Feedback is logged to evaluations.csv with:

Timestamp
Original question
SQL query (if captured)
Agent answer
Human feedback

This creates a ground-truth dataset for future model fine-tuning.

🎨 Premium Streamlit UI

A fully custom design system injected via st.markdown() CSS:

Radial gradient background: #1a1c2c → #0a0b10
Glassmorphism sidebar: backdrop-filter: blur(10px) with subtle border
Gradient headings: linear-gradient(90deg, #fff 0%, #a5b4fc 100%)
Gradient buttons: linear-gradient(135deg, #7000ff 0%, #4a00ff 100%) with hover lift (translateY(-2px))
Code blocks: JetBrains Mono font on dark #1e1e2e background
Fade-in animation: Results container uses @keyframes fadeIn with translate
Three-tab result layout: 📊 Analytics, 📋 Source Data, 🧠 Neural Logic

🏢 Discovery Hub

Pre-loaded example queries in the sidebar for instant exploration:

Label	Natural Language Query
📈 Monthly Revenue Trend	"Show me total sales by month for the last 6 months"
🌐 Regional Performance	"What is the total revenue per country?"
👑 VIP Customers	"Who are the top 5 customers by total spending?"
📦 Inventory Health	"Which product categories have the highest and lowest average prices?"
🚛 Order Logistics	"List the 10 most recent orders with customer name and product details"

🏗️ Architecture

System Flow

┌──────────────────────────────────────────────────────────────────────┐
│                       Streamlit Frontend (app.py)                    │
│                                                                      │
│  ┌────────────────┐   ┌──────────────────┐   ┌──────────────────┐   │
│  │  Query Input   │   │  Discovery Hub   │   │  Session History │   │
│  │  (text_area)   │   │  (5 examples)    │   │  (Re-run btns)  │   │
│  └───────┬────────┘   └────────┬─────────┘   └──────────────────┘   │
│          │                     │                                     │
│          ▼                     ▼                                     │
│  ┌─────────────────────────────────────────────────────────────┐     │
│  │              Layer 1: Pre-Audit Security Check              │     │
│  │              _is_potentially_malicious(question)             │     │
│  │              Blocks: delete, drop, truncate, update...      │     │
│  └─────────────────────────┬───────────────────────────────────┘     │
│                            │ ✅ Clean                                │
│                            ▼                                         │
│  ┌─────────────────────────────────────────────────────────────┐     │
│  │              SQLQueryAgent (agent.py)                        │     │
│  │                                                             │     │
│  │  ┌───────────────────────────┐  ┌────────────────────────┐  │     │
│  │  │  Conversational Memory   │  │  DeepSeek V3 LLM       │  │     │
│  │  │  Last 5 Q&A pairs        │  │  model: deepseek-chat  │  │     │
│  │  │  Last 3 injected as      │  │  temperature: 0        │  │     │
│  │  │  context for follow-ups  │  │  top_p: 0.1            │  │     │
│  │  └───────────────────────────┘  └────────────────────────┘  │     │
│  │                                                             │     │
│  │  LangChain create_sql_agent (zero-shot-react-description)   │     │
│  │  SQLDatabaseToolkit → Schema + Sample Rows (3 per table)    │     │
│  └─────────────────────────┬───────────────────────────────────┘     │
│                            │ ReAct Tool Loop                         │
│                            ▼                                         │
│  ┌─────────────────────────────────────────────────────────────┐     │
│  │              Layer 2: SafeSQLExecutor (tools.py)             │     │
│  │                                                             │     │
│  │  • Validates: SELECT/WITH only                              │     │
│  │  • Blocks: DELETE, DROP, CREATE, EXEC, GRANT... (10 KWs)   │     │
│  │  • Auto-appends: LIMIT 100                                  │     │
│  │  • Executes via: SQLAlchemy + Pandas                        │     │
│  └─────────────────────────┬───────────────────────────────────┘     │
│                            │ DataFrame result                        │
│                            ▼                                         │
│  ┌─────────────────────────────────────────────────────────────┐     │
│  │              ResultVisualizer (visualizer.py)                │     │
│  │                                                             │     │
│  │  • Detects data shape (date, categorical, numeric)          │     │
│  │  • Selects chart: Timeline / Bar / Donut / Scatter          │     │
│  │  • Generates summary stats with currency formatting         │     │
│  └─────────────────────────┬───────────────────────────────────┘     │
│                            │                                         │
│                            ▼                                         │
│  ┌──────────────────────────────────────────────────────────────┐    │
│  │  Three-Tab Result Panel                                      │    │
│  │  📊 Analytics: Chart + Key Metrics + 👍/👎 Feedback         │    │
│  │  📋 Source Data: Raw DataFrame table                         │    │
│  │  🧠 Neural Logic: Full Chain-of-Thought step expansion      │    │
│  └──────────────────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────────────┘

File Responsibility Map

File	Lines	Responsibility
`agent.py`	110	`SQLQueryAgent` class — DeepSeek LLM init, LangChain SQL agent setup, conversation memory (5-entry sliding window), input pre-audit, schema exposure
`tools.py`	127	`SafeSQLExecutor` — read-only SQL execution with `LIMIT 100` auto-append · `QueryComplexityAnalyzer` — JOIN/subquery scoring · `SchemaInspector` — table info dump
`app.py`	333	Full Streamlit UI — CSS design system, sidebar Discovery Hub, result tabs (Analytics / Source Data / Neural Logic), feedback logger, PostgreSQL toggle, `@st.cache_resource` agent singleton
`visualizer.py`	104	`ResultVisualizer` — auto chart selection (Timeline / Bar / Donut / Scatter), futuristic color palette, `get_summary_stats()` with currency detection
`setup_db.py`	119	Demo e-commerce database genesis — 3 tables (`customers`, `products`, `orders`), 40 customers, 12 products, 250 orders with realistic time-series pricing
`test_agent.py`	35	Smoke test — initializes agent, prints schema, runs a sample query

Demo Database Schema

The setup_db.py script creates a realistic e-commerce dataset:

customers (40 rows)
├── customer_id  INTEGER PRIMARY KEY
├── name         TEXT
├── email        TEXT UNIQUE
├── segment      TEXT     -- Corporate, Retail, Wholesale, Affiliate
├── country      TEXT     -- USA, UK, Germany, Canada, France, Japan, Australia, Brazil
└── signup_date  DATE

products (12 rows)
├── product_id    INTEGER PRIMARY KEY
├── product_name  TEXT     -- Quantum Laptop Pro, Neural Mouse, ErgoDesk 3000...
├── category      TEXT     -- Electronics, Furniture, Appliances
├── sub_category  TEXT     -- Computing, Peripherals, Office, Kitchen, Audio, Mobile
└── base_price    DECIMAL

orders (250 rows)
├── order_id       INTEGER PRIMARY KEY
├── customer_id    INTEGER → FK customers
├── product_id     INTEGER → FK products
├── order_date     DATE     -- Last 180 days
├── quantity       INTEGER  -- 1–5
├── unit_price     DECIMAL  -- ±10% variation for realism
├── total_amount   DECIMAL
└── shipping_status TEXT    -- Delivered, Shipped, Processing, Cancelled

🚀 Local Installation

Prerequisites

Python 3.10+
A DeepSeek API Key — get one here (uses OpenAI-compatible endpoint)

Step 1 — Clone

git clone https://github.com/Ismail-2001/SQL-Query-Agent.git
cd SQL-Query-Agent

Step 2 — Virtual Environment

python -m venv venv
source venv/bin/activate          # macOS / Linux
# venv\Scripts\Activate.ps1      # Windows (PowerShell)

Step 3 — Install Dependencies

pip install -r requirements.txt

Step 4 — Configure API Key

Create a .env file in the project root:

DEEPSEEK_API_KEY=your_deepseek_api_key_here

# Optional — for production PostgreSQL:
# DATABASE_TYPE=postgresql
# POSTGRES_DB_URL=postgresql://user:password@host:5432/dbname

Step 5 — Initialize & Run

# Create the demo e-commerce database (40 customers, 12 products, 250 orders)
python setup_db.py

# Launch the Streamlit app
streamlit run app.py

The app will open at http://localhost:8501.

Note: If the demo database doesn't exist when you launch app.py, it will be automatically created on first load — no need to run setup_db.py manually.

💡 Usage & Example Queries

Natural Language Queries

Type any of these into the query box:

Category	Example Query	What Happens Under the Hood
Aggregation	"What is the total revenue for each country?"	`SELECT country, SUM(total_amount) FROM orders JOIN customers... GROUP BY country`
Top-N	"Who are the top 5 customers by spending?"	Multi-table JOIN with `ORDER BY ... DESC LIMIT 5`
Trend Analysis	"Show monthly revenue for the last 6 months"	Date grouping + Timeline chart auto-generated
Filtering	"List all cancelled orders from the USA"	Filtered JOIN with `WHERE shipping_status = 'Cancelled' AND country = 'USA'`
Comparison	"Compare average order values between Corporate and Retail segments"	`GROUP BY segment` + Bar/Donut chart
Follow-up	"Break that down by product category"	Conversational memory injects previous context — no need to repeat the original question

Reading the Results

Every successful query produces a three-tab result panel:

Tab	Content
📊 Analytics	Executive summary (text), auto-generated Plotly chart, key metrics summary, and 👍/👎 feedback buttons
📋 Source Data	The raw DataFrame rendered as an interactive Streamlit table
🧠 Neural Logic	Full Chain-of-Thought — every ReAct step the agent took, including schema inspections, SQL drafts, and field observations

Running the Test Suite

python test_agent.py

This initializes the agent, prints the full database schema, and runs a sample question ("How many customers do we have in total?") — useful for validating your API key and database setup before launching the UI.

📂 Project Structure

SQL-Query-Agent/
│
├── agent.py             # SQLQueryAgent class
│                        #   • DeepSeek V3 via OpenAI-compatible endpoint
│                        #   • LangChain create_sql_agent (zero-shot-react)
│                        #   • 5-entry conversational memory (sliding window)
│                        #   • Pre-audit malicious input scanner (8 forbidden KWs)
│                        #   • Custom system prompt enforcing schema-first reasoning
│
├── tools.py             # LangChain BaseTool implementations
│                        #   • SafeSQLExecutor: read-only guard + LIMIT 100
│                        #   • QueryComplexityAnalyzer: JOIN/subquery scoring
│                        #   • SchemaInspector: table info with sample rows
│
├── app.py               # Streamlit application
│                        #   • Full CSS design system (radial gradients, glassmorphism)
│                        #   • 3-tab result panel (Analytics / Source Data / Neural Logic)
│                        #   • Sidebar: Discovery Hub, Session History, Node Status
│                        #   • @st.cache_resource agent singleton
│                        #   • @st.cache_data query caching (1hr TTL)
│                        #   • Feedback logger → evaluations.csv
│                        #   • PostgreSQL env var toggle
│
├── visualizer.py        # ResultVisualizer class
│                        #   • Auto chart selection: Timeline / Bar / Donut / Scatter
│                        #   • Futuristic 5-color palette on plotly_dark template
│                        #   • Currency-aware summary statistics
│
├── setup_db.py          # Demo database generator
│                        #   • 3 tables: customers, products, orders
│                        #   • 40 customers × 12 products × 250 orders
│                        #   • Time-series friendly order dates (last 180 days)
│                        #   • Realistic pricing with ±10% variance
│
├── test_agent.py        # Smoke test: schema dump + sample query
├── requirements.txt     # Python dependencies (13 packages)
├── evaluations.csv      # (Generated) Human feedback log
└── .env                 # Private API key configuration

🛠️ Tech Stack

Category	Library	Purpose
Language	Python `3.10+`	Core runtime
LLM Provider	`langchain-openai` → DeepSeek V3	SQL generation and reasoning via OpenAI-compatible API (`api.deepseek.com/v1`)
Agent Framework	LangChain `create_sql_agent`	Zero-Shot ReAct agent with `SQLDatabaseToolkit`
Schema Access	`langchain-community` `SQLDatabase`	Schema introspection with 3 sample rows per table
Tool Validation	Pydantic `BaseModel`	Typed `args_schema` for all custom tools
Database	SQLAlchemy + SQLite (demo) / PostgreSQL (prod)	Universal database abstraction
Data Processing	Pandas	Query result formatting + DataFrame manipulation
Visualization	Plotly Express + Graph Objects	Interactive chart generation with dark theme
Frontend	Streamlit	App framework with `@st.cache_resource` and `@st.cache_data`
Config	`python-dotenv`	Secure `.env` file loading
PostgreSQL Driver	`psycopg2-binary`	Production database driver
Columnar Engine	DuckDB	Available for analytical query acceleration

🌐 Deployment

Streamlit Cloud (Fastest)

Push your repo to GitHub
Go to share.streamlit.io
Select your repository → set Main file path to app.py
Add Secrets in the Streamlit Cloud dashboard:
```
DEEPSEEK_API_KEY = "your_key_here"
```
Click Deploy ✅

Render / Railway

# render.yaml (Render)
services:
  - type: web
    name: nexus-sql
    runtime: python
    buildCommand: pip install -r requirements.txt && python setup_db.py
    startCommand: streamlit run app.py --server.port $PORT --server.headless true
    envVars:
      - key: DEEPSEEK_API_KEY
        sync: false

Docker

FROM python:3.11-slim
WORKDIR /app
COPY . .
RUN pip install --no-cache-dir -r requirements.txt
RUN python setup_db.py
EXPOSE 8501
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.headless=true"]

🗺️ Roadmap

✅ Phase 1 — Core Intelligence (Complete)

🔨 Phase 2 — Multi-Agent & Advanced Analytics (Next)

Multi-Agent Orchestration: Separate agents for SQL generation vs. data visualization strategy
Vector-Search Schema Pruning: Embedding-based table selection for 1000+ table databases
Export Engine: PDF and Excel report generation for business meetings
Query History Dashboard: Full session analytics with re-run, compare, and trend views

📋 Phase 3 — Enterprise Features (Planned)

Role-Based Access Control: Read-only vs. analyst vs. admin permission tiers
Preditive Anomaly Detection: Proactive alerts when revenue, orders, or metrics deviate from trend
Slack / Teams Integration: Query your database from a chat message
Scheduled Reports: Cron-based automated briefings delivered via email

🔭 Phase 4 — Platform Vision

Multi-Database Connections: Query across multiple databases in a single session
Data Lineage Visualization: Trace column dependencies and transformation chains
Fine-Tuned SQL Model: Train on evaluations.csv feedback data for domain-specific accuracy
Natural Language Dashboard Builder: Generate full Streamlit dashboards from conversation history

🤝 Contributing

Contributions are welcome! High-impact areas:

New visualization types in visualizer.py (e.g., heatmaps, treemaps, funnel charts)
Additional safety rules in tools.py SafeSQLExecutor
New example queries for the Discovery Hub in app.py
PostgreSQL test coverage with real connection handling
New demo datasets via setup_db.py (e.g., healthcare, logistics, SaaS metrics)

To contribute:

Fork the repository
Create a feature branch: git checkout -b feature/your-feature
Commit with Conventional Commits: git commit -m "feat: add heatmap chart type to visualizer"
Push and open a Pull Request against main

📄 License

Distributed under the MIT License. See LICENSE for details.

Built so anyone can talk to their data — no SQL required.

If Nexus SQL changed how you think about data access, star ⭐ the repo.

Built with ❤️ by Ismail Sajid

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
agent.py		agent.py
app.py		app.py
requirements.txt		requirements.txt
setup_db.py		setup_db.py
test_agent.py		test_agent.py
tools.py		tools.py
visualizer.py		visualizer.py

Folders and files

Latest commit

History

Repository files navigation

💎 Nexus SQL

Autonomous AI Data Analyst — Query Databases in Plain English

📌 The Problem

✨ Key Features

🧠 Autonomous ReAct Reasoning

💬 Conversational Memory

📈 Automatic Visualization Engine

🛡️ Two-Layer SQL Security

🔍 Query Complexity Analyzer

🐘 PostgreSQL Ready

📝 Evaluation & Feedback Loop

🎨 Premium Streamlit UI

🏢 Discovery Hub

🏗️ Architecture

System Flow

File Responsibility Map

Demo Database Schema

🚀 Local Installation

Prerequisites

Step 1 — Clone

Step 2 — Virtual Environment

Step 3 — Install Dependencies

Step 4 — Configure API Key

Step 5 — Initialize & Run

💡 Usage & Example Queries

Natural Language Queries

Reading the Results

Running the Test Suite

📂 Project Structure

🛠️ Tech Stack

🌐 Deployment

Streamlit Cloud (Fastest)

Render / Railway

Docker

🗺️ Roadmap

✅ Phase 1 — Core Intelligence (Complete)

🔨 Phase 2 — Multi-Agent & Advanced Analytics (Next)

📋 Phase 3 — Enterprise Features (Planned)

🔭 Phase 4 — Platform Vision

🤝 Contributing

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages