Virtual Expo 2025

PULSE

Year Long Project CompSoc

Aim

To design and implement a modular voice-based assistant capable of performing automation tasks, executing context-aware responses using retrieval-augmented generation (RAG), and supporting naturalistic conversation with speech I/O, all tailored for a college environment.

Introduction

Voice-based AI assistants have transformed human-computer interaction by enabling intuitive access to services and automation. However, commercial solutions are resource-intensive and often closed-source. PULSE explores the possibility of building a decentralized, modular, and technically competent voice assistant leveraging LLMs, RAG pipelines, and speech APIs to address the productivity needs of students.

Literature survey 

Speech-to-Text and Text-to-Speech Systems

Google’s Speech-to-Text API is among the most accurate and low-latency offerings in production, based on WhisperNet-like architectures combining convolutional layers and transformer encoders with Connectionist Temporal Classification (CTC) decoders. These are covered extensively in literature, notably in “Whisper: Robust Speech Recognition via Large-Scale Weak Supervision” by Radford et al., OpenAI (2022), and “Conformer: Convolution-augmented Transformer for Speech Recognition” by Gulati et al., Google Research (2020).

Transformer Language Models

Mistral and CodeLlama represent the evolution of open-source transformer models tuned for task specialization. Mistral’s architecture follows decoder-only transformers with rotary positional embeddings and sparse attention routing, making it efficient yet capable of general understanding across domains. Thus, it builds off of “Attention is All You Need” (Vaswani et al., 2017) and subsequent derivatives like LLaMA and MPT.

CodeLlama is a finetuned variant of LLaMA-2 trained on code corpora and instruction-tuning paradigms. It is optimized for high-fidelity completions and chatting and can be traced to "Llama 2: Open Foundation and Fine-Tuned Chat Models" (Touvron, et al., 2023).

Retrieval-Augmented Generation

RAG is a paradigm that enhances LLMs by grounding them in retrieved knowledge. The approach combines vector search (typically via FAISS or other approximate nearest neighbor methods) with prompt-based augmentation. Key literature includes “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (Lewis et al., Facebook AI, 2020) and “Dense Passage Retrieval for Open-Domain Question Answering” (Karpukhin et al., 2020).

 

Technologies Used

  • WhisperNet STT Model: Real-time audio transcription.
  • Groq TTS API: Converts assistant responses to natural-sounding speech.
  • Mistral 3B: Multi-role LLM used for orchestration logic, and powering the RAG agent.
  • CodeLlama 7B: Specialized in code and bash script generation.
  • FAISS: High-speed vector similarity search used in the RAG pipeline.
  • RAG Chat: Custom-built QA system for document-based question answering.
  • Multimodal Parser: Extracts content from formats like PDFs, PPTs, Markdown, etc., for embedding.

Methodology

System Architecture

PULSE Architecture

1. User Input: Microphone captures the user's voice.

2. Audio Processing: WhisperNet STT transcribes spoken commands.

3. Orchestration Layer: Mistral interprets intent and routes to appropriate downstream agents.

4. Agent Dispatch:

  • Code Agent (CodeLlama): For writing simple code snippets.
  • Bash Agent (CodeLlama): For generating shell commands.
  • RAG Agent (Mistral): For answering queries using contextual retrieval.

 5. Output Rendering: Groq TTS synthesizes the response to audio.

Model Architectures

WhisperNet Architecture

Whisper STT Model

  • WhisperNet- usess deep neural speech recognizers
  • Architecture: Convolutional front-end + transformer-based encoder-decoder
  • Features: Real-time streaming, multilingual, noise robustness, word alignment
  • Use: Entry point for user interaction enabling continuous voice input

Mistral 3B

  • Architecture: Decoder-only dense transformer with rotary positional embeddings
  • Size: 3B parameters
  • Tokenizer: SentencePiece
  • Pretraining Data: Web, coding, instructions; fine-tuned with alignment and RLHF
  • Roles in PULSE:
    • Task classification using zero-shot prompting
    • RAG response generation using embedded documents
  • Special Note: Used in both orchestrator and generative roles (unified LLM framework)

CodeLlama 7B

  • Architecture: Decoder-only LLaMA-2 derived transformer
  • Tokenizer: Byte Pair Encoding (BPE)
  • Training: Source code (C/C++, Python, Bash); instruction-tuned
  • Roles in PULSE:
    • Code agent for programming task completion
    • Bash agent for CLI automation
  • Execution Strategy: Prompt-based generation with deterministic control tokens

RAG Pipeline (Mistral-powered)

  • Parser: Extracts structured text from PDFs, DOCX, PPT, etc.
  • Chunking: 512-token windows with overlap
  • Embedding: MiniLM via SentenceTransformers
  • Retrieval: FAISS flat cosine similarity index
  • Response Generation:
    • Mistral receives [Top-k Chunks + User Query]
    • Generates grounded answers with source awareness

Results

  • Achieved a working smart speech driven, agent powered RAG system.
  • Multiple agents to execute some basic tasks.
  • RAG to parse a directory of documents, pdfs, presentations, slides decks, etc. and can chat fairly well with them.
  • STT/TTS works fairly clearly, when using proper enunciation. Can interpret uncommon words from specific contexts as well.

Conclusions

PULSE demonstrates the feasibility of a robust, real-time AI assistant using speech pipelines, retrieval-based question answering, and open-source LLMs. With each agent modularized and tied to task-specialized models like CodeLlama and Mistral, the system achieves some accuracy and speed. The architecture reflects modern advances in transformer models and demonstrates how STT and TTS stacks can be integrated with intelligent task routing to create efficient, human-friendly systems.

Future Scope

  • Frontend Integration: Web-based interface for visual context and accessibility
  • Bash Agent Contextualization: Include Linux man pages as prompt context
  • Dockerization: Modular containers with REST APIs
  • More Agents: schedule assistant, web query handler, etc.

References

 

Mentors mentees details

Mentees

Mentors

Report Information

Explore More Projects

View All 2025 Projects