Skip to content

PR202111/RAGEngine

Repository files navigation

📄 RAG Pipeline (PDF → Chunking → Embedding → Vector Search)

A modular Retrieval-Augmented Generation (RAG) pipeline that processes PDFs, generates embeddings, and supports semantic search using FAISS (in-memory) or PGVector (PostgreSQL).


🚀 Features

  • 📚 PDF ingestion and parsing
  • ✂️ Multiple chunking strategies
  • 🧠 SentenceTransformer embeddings
  • ⚡ Fast similarity search with FAISS
  • 🗄️ Persistent storage using PGVector
  • 🔄 Hybrid-ready architecture (extendable to reranking, BM25, etc.)

📂 Project Structure

RaG_Project/
│
├── chunking.py        # Text chunking strategies
├── embedding.py       # Embedding model wrapper
├── store.py           # Vector store (FAISS + PGVector)
├── models.py          # SQLAlchemy models
├── database.py        # DB connection
├── init_db.py         # DB initialization
├── ingest.py          # PDF ingestion pipeline
├── query.py           # Query + retrieval
├── delete.py          # Delete PDFs from store
├── utils.py           # PDF reading utilities
│
├── VectorStore/       # Saved FAISS indexes
|           test.pdf
|           test2.pdf
└── config.py

⚙️ Installation

1. Create virtual environment

python -m venv .venv
source .venv/bin/activate

2. Install dependencies

pip install -r requirements.txt

3. (Optional) Setup PostgreSQL + PGVector

  • Install PostgreSQL
  • Enable pgvector extension:
CREATE EXTENSION vector;

🧩 Chunking Strategies

Defined in chunking.py:

Strategy Description
fixed Fixed-size chunks
semantic Sentence similarity-based
recursive Hierarchical splitting
sentence-chunking Sentence grouping
hybrid-semantic Semantic + recursive fallback

🔄 Pipeline Flow

PDF → Text Extraction → Chunking → Embedding → Vector Store → Search

📥 Ingestion

Run:

python ingest.py

Example:

def ingest(pdf_path: str, chunking_stratergy: str = "recursive"):
    pages = open_read_pdf(pdf_path)
    chunks = Chunking(chunking_stratergy, pages).chunk()

Output format:

{
    "content": "...",
    "page": 3,
    "chunk_id": 12
}

🧠 Embeddings

Using:

SentenceTransformer("all-MiniLM-L6-v2")
  • Normalized embeddings
  • Batch processing supported

🗄️ Vector Store

1. FAISS (default)

  • In-memory
  • Fast
  • Saved locally in /VectorStore

2. PGVector

  • Persistent storage
  • SQL filtering support
  • Scalable

🔍 Search

results = store.search(query_embedding, pdf_id, top_k=5)

Output:

{
    "pdf_id": 1,
    "chunk_id": 12,
    "page": 3,
    "content": "...",
    "score": 0.87
}

💾 Metadata Design

Each chunk stores:

{
    "pdf_id": int,
    "chunk_id": int,
    "page": int
}
  • content stored separately
  • avoids redundancy
  • enables clean citations

🧹 Delete Data

python delete.py

Supports:

  • FAISS deletion (local files)
  • PGVector deletion (DB rows)

📌 Key Design Decisions

  • ✅ Global chunk_id per document
  • ✅ Cosine similarity (normalized embeddings)
  • ✅ Backend abstraction (FAISS / PGVector)
  • ✅ Clean metadata separation

⚠️ Known Limitations

  • No reranking yet
  • No hybrid search (BM25 + vector)
  • Token estimation is approximate
  • No streaming ingestion

🧠 Example Use Case

  • Chat with PDFs
  • Semantic document search
  • Knowledge base retrieval
  • LLM grounding

👨‍💻 Author

Built as a modular RAG system for experimentation


📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

About

RagEngine that has full pipeline from uploading the pdf to make the vector store or storing as a Database

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages