Samvaad

Problem Statement
We have all been in situations where even after reading and uploading docs to LLMs and querying, some topics just don't click. In that position, we often switch to asking friends, colleagues, or peers, who can explain things via voice-dialogues, and in our own native language.
However, not everyone has access to such a support system at all times, and traditional search doesn't fill the gap.
Solution
To address this problem, I built Samvaad, a (primarily) voice-based conversational AI system that ingests documents (PDFs, Docs, etc.) and allows users to have dialogues with them, in their native tongue.
What's Unique About This?
You might think, what's special about this? Just get API keys that handle chunking, embedding, storage, ASR, TTS, and generation for you, and everything is set.
But that's not the case here. I have deliberately tried to only use OSS lightweight models that can be run on even low-end hardware with no GPUs.
Architecture
- Document Parsing and Chunking: I have used
Doclingto parse and chunk various document formats. - Embedding: For generating chunk and query embeddings, I used the q4 version of
onnx-community/embeddinggemma-300m-ONNXmodel. - Vector Storage: I used
ChromaDBto store the chunk embeddings and perform similarity search. - ASR: For Automatic Speech Recognition, I used the
Systran/faster-whisper-smallmodel. - TTS: For Text-to-Speech, I used the q8 version of
Systran/faster-whisper-smallmodel. - LLM: For response generation, I used Gemini API.
- Reranking: Used
sentence-transformers/all-MiniLM-L6-v2model to rerank retrieved chunks to improve response accuracy. - Searching ALgorithm: Used BM25 in combination with semantic search to improve retrieval quality.
Note: No LangChain, LlamaIndex, or any such framework is used here. Everything is built from scratch to keep it lightweight and efficient.
Problems Faced
- During very initial development, I was using
BAAI/bge-m3for embeddings. It took too much time to ingest documents. I then switched toonnx-community/embeddinggemma-300m-ONNX, which reduced ingestion time by a third. - I have a resource constrained setup, therefore, running models efficiently was a challenge, which I overcame by quantization and using ONNX versions.
- For any voice interaction, latency is a key factor, and optimizing the pipeline to reduce end-to-end latency was tough. It's still not good enough to be called real-time, and I am working on it.
- Since my number one objective is to use open-source models, finding the right models that balance performance, accuracy, and latency was a challenge.
- The LLMs return answer in markdown. Rendering markdown in TTS while ensuring correct pronunciation of code snippets, special characters, etc. was tricky.
Learnings
Before starting this project, I only knew what RAG stands for.
Since I started Samvaad, I learned:
- How to design and orchestrate a RAG system for specific problems.
- How to optimize a RAG system for high recall and accuracy.
- How to right tests (PyTest) and evals.
- How to find the right models for different tasks.
- How to work with ASR and TTS.
Future Vision
Immediate Plans
- Build a frontend and deploy it so anyone, anywhere can use it.
- Optimize the pipeline to reduce latency.
- Add support for more Indic languages.
Long-term Plans
- Build mobile apps for Android and iOS.
Some Screenshots


