📄 RAG Pipeline (PDF → Chunking → Embedding → Vector Search)

A modular Retrieval-Augmented Generation (RAG) pipeline that processes PDFs, generates embeddings, and supports semantic search using FAISS (in-memory) or PGVector (PostgreSQL).

🚀 Features

📚 PDF ingestion and parsing
✂️ Multiple chunking strategies
🧠 SentenceTransformer embeddings
⚡ Fast similarity search with FAISS
🗄️ Persistent storage using PGVector
🔄 Hybrid-ready architecture (extendable to reranking, BM25, etc.)

📂 Project Structure

RaG_Project/
│
├── chunking.py        # Text chunking strategies
├── embedding.py       # Embedding model wrapper
├── store.py           # Vector store (FAISS + PGVector)
├── models.py          # SQLAlchemy models
├── database.py        # DB connection
├── init_db.py         # DB initialization
├── ingest.py          # PDF ingestion pipeline
├── query.py           # Query + retrieval
├── delete.py          # Delete PDFs from store
├── utils.py           # PDF reading utilities
│
├── VectorStore/       # Saved FAISS indexes
|           test.pdf
|           test2.pdf
└── config.py

⚙️ Installation

1. Create virtual environment

python -m venv .venv
source .venv/bin/activate

2. Install dependencies

pip install -r requirements.txt

3. (Optional) Setup PostgreSQL + PGVector

Install PostgreSQL
Enable pgvector extension:

CREATE EXTENSION vector;

🧩 Chunking Strategies

Defined in chunking.py:

Strategy	Description
`fixed`	Fixed-size chunks
`semantic`	Sentence similarity-based
`recursive`	Hierarchical splitting
`sentence-chunking`	Sentence grouping
`hybrid-semantic`	Semantic + recursive fallback

🔄 Pipeline Flow

PDF → Text Extraction → Chunking → Embedding → Vector Store → Search

📥 Ingestion

Run:

python ingest.py

Example:

def ingest(pdf_path: str, chunking_stratergy: str = "recursive"):
    pages = open_read_pdf(pdf_path)
    chunks = Chunking(chunking_stratergy, pages).chunk()

Output format:

{
    "content": "...",
    "page": 3,
    "chunk_id": 12
}

🧠 Embeddings

Using:

SentenceTransformer("all-MiniLM-L6-v2")

Normalized embeddings
Batch processing supported

🗄️ Vector Store

1. FAISS (default)

In-memory
Fast
Saved locally in /VectorStore

2. PGVector

Persistent storage
SQL filtering support
Scalable

🔍 Search

results = store.search(query_embedding, pdf_id, top_k=5)

Output:

{
    "pdf_id": 1,
    "chunk_id": 12,
    "page": 3,
    "content": "...",
    "score": 0.87
}

💾 Metadata Design

Each chunk stores:

{
    "pdf_id": int,
    "chunk_id": int,
    "page": int
}

content stored separately
avoids redundancy
enables clean citations

🧹 Delete Data

python delete.py

Supports:

FAISS deletion (local files)
PGVector deletion (DB rows)

📌 Key Design Decisions

✅ Global chunk_id per document
✅ Cosine similarity (normalized embeddings)
✅ Backend abstraction (FAISS / PGVector)
✅ Clean metadata separation

⚠️ Known Limitations

No reranking yet
No hybrid search (BM25 + vector)
Token estimation is approximate
No streaming ingestion

🧠 Example Use Case

Chat with PDFs
Semantic document search
Knowledge base retrieval
LLM grounding

👨‍💻 Author

Built as a modular RAG system for experimentation

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 RAG Pipeline (PDF → Chunking → Embedding → Vector Search)

🚀 Features

📂 Project Structure

⚙️ Installation

1. Create virtual environment

2. Install dependencies

3. (Optional) Setup PostgreSQL + PGVector

🧩 Chunking Strategies

🔄 Pipeline Flow

📥 Ingestion

Output format:

🧠 Embeddings

🗄️ Vector Store

1. FAISS (default)

2. PGVector

🔍 Search

Output:

💾 Metadata Design

🧹 Delete Data

📌 Key Design Decisions

⚠️ Known Limitations

🧠 Example Use Case

👨‍💻 Author

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
VectorStore		VectorStore
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chunking.py		chunking.py
config.py		config.py
database.py		database.py
delete.py		delete.py
embedding.py		embedding.py
ingest.py		ingest.py
init_db.py		init_db.py
models.py		models.py
query.py		query.py
requirements.txt		requirements.txt
store.py		store.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

📄 RAG Pipeline (PDF → Chunking → Embedding → Vector Search)

🚀 Features

📂 Project Structure

⚙️ Installation

1. Create virtual environment

2. Install dependencies

3. (Optional) Setup PostgreSQL + PGVector

🧩 Chunking Strategies

🔄 Pipeline Flow

📥 Ingestion

Output format:

🧠 Embeddings

🗄️ Vector Store

1. FAISS (default)

2. PGVector

🔍 Search

Output:

💾 Metadata Design

🧹 Delete Data

📌 Key Design Decisions

⚠️ Known Limitations

🧠 Example Use Case

👨‍💻 Author

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages