A modular Retrieval-Augmented Generation (RAG) pipeline that processes PDFs, generates embeddings, and supports semantic search using FAISS (in-memory) or PGVector (PostgreSQL).
- 📚 PDF ingestion and parsing
- ✂️ Multiple chunking strategies
- 🧠 SentenceTransformer embeddings
- ⚡ Fast similarity search with FAISS
- 🗄️ Persistent storage using PGVector
- 🔄 Hybrid-ready architecture (extendable to reranking, BM25, etc.)
RaG_Project/
│
├── chunking.py # Text chunking strategies
├── embedding.py # Embedding model wrapper
├── store.py # Vector store (FAISS + PGVector)
├── models.py # SQLAlchemy models
├── database.py # DB connection
├── init_db.py # DB initialization
├── ingest.py # PDF ingestion pipeline
├── query.py # Query + retrieval
├── delete.py # Delete PDFs from store
├── utils.py # PDF reading utilities
│
├── VectorStore/ # Saved FAISS indexes
| test.pdf
| test2.pdf
└── config.py
python -m venv .venv
source .venv/bin/activatepip install -r requirements.txt- Install PostgreSQL
- Enable pgvector extension:
CREATE EXTENSION vector;Defined in chunking.py:
| Strategy | Description |
|---|---|
fixed |
Fixed-size chunks |
semantic |
Sentence similarity-based |
recursive |
Hierarchical splitting |
sentence-chunking |
Sentence grouping |
hybrid-semantic |
Semantic + recursive fallback |
PDF → Text Extraction → Chunking → Embedding → Vector Store → Search
Run:
python ingest.pyExample:
def ingest(pdf_path: str, chunking_stratergy: str = "recursive"):
pages = open_read_pdf(pdf_path)
chunks = Chunking(chunking_stratergy, pages).chunk(){
"content": "...",
"page": 3,
"chunk_id": 12
}Using:
SentenceTransformer("all-MiniLM-L6-v2")- Normalized embeddings
- Batch processing supported
- In-memory
- Fast
- Saved locally in
/VectorStore
- Persistent storage
- SQL filtering support
- Scalable
results = store.search(query_embedding, pdf_id, top_k=5){
"pdf_id": 1,
"chunk_id": 12,
"page": 3,
"content": "...",
"score": 0.87
}Each chunk stores:
{
"pdf_id": int,
"chunk_id": int,
"page": int
}contentstored separately- avoids redundancy
- enables clean citations
python delete.pySupports:
- FAISS deletion (local files)
- PGVector deletion (DB rows)
- ✅ Global
chunk_idper document - ✅ Cosine similarity (normalized embeddings)
- ✅ Backend abstraction (FAISS / PGVector)
- ✅ Clean metadata separation
- No reranking yet
- No hybrid search (BM25 + vector)
- Token estimation is approximate
- No streaming ingestion
- Chat with PDFs
- Semantic document search
- Knowledge base retrieval
- LLM grounding
Built as a modular RAG system for experimentation
This project is licensed under the MIT License - see the LICENSE file for details.