Parrotlet-a-en-5b: Releasing our purpose-built LLM for english ASR in Indian Healthcare

on
July 24, 2025

Medical scribe systems can dramatically accelerate India's healthcare digitization by eliminating the documentation burden that currently hinders adoption. These systems can automatically convert doctor-patient conversations into structured digital records in real-time, removing the need for manual data entry that healthcare providers find time-consuming and disruptive to patient care.

One of the fundamental challenges in automatic speech recognition (ASR) applications is handling out-of-vocabulary words—a problem that becomes particularly acute in medical contexts. While reasonably accurate ASR models exist, including open-weight alternatives, their performance on clinically rich real-world consultations remains significantly limited. The challenge intensifies in the Indian context due to the prevalence of localised medical concepts such as branded drug names, colloquial expressions of medical concepts. There is a clear need to develop specialised ASR models that can accurately recognise medical vocabularies within India's healthcare environment.

In this blog, we introduce one of our latest developments, a Speech Large Language Model (LLM) specifically designed for transcribing medical speech in Indian English. We are open-sourcing this specialised model to advance healthcare digitisation across the country. Our architecture combines a robust speech encoder with a custom multimodal projection layer and a domain-specific language decoder, enabling accurate and context-aware transcription of clinical conversations, medical dictations, and patient interactions. This represents a crucial step toward making healthcare more accessible, efficient, and digitally integrated across India's diverse medical landscape.

Motivation for a Multimodal Approach

Domain Adaptation Without Speech Data: The multimodal approach enables training on extensive medical text corpora without requiring corresponding audio data. This is particularly valuable given the scarcity and expense of creating high-quality medical speech datasets, specifically in low resource languages in India. By leveraging text-only medical data during fine-tuning, we can incorporate vast amounts of domain knowledge that would be impractical to collect in audio format.

Pre-trained Domain Expertise: The architecture allows us to utilize existing domain-specific language models as decoders. Our implementation employs MedGemma 3 4B IT, which has been pre-trained on over 2.5 billion tokens of medical text. This pre-trained medical knowledge significantly enhances transcription accuracy for clinical terminology without starting from scratch.

Multilingual Medical Coverage: A multilingual decoder can handle multiple Indian languages while leveraging the fact that medical terminology is predominantly English-based. This unified approach provides comprehensive coverage across linguistic boundaries, making it particularly effective for India's multilingual healthcare environment where practitioners often code-switch between local languages and English medical terms. However, in this release we only train and evaluate on english datasets.

By training only a lightweight projection layer between the frozen speech encoder (Whisper V3 Large) and the pre-trained medical decoder, Parrotlet-A achieves high accuracy with minimal computational overhead while addressing the unique challenges of Indian medical transcription.

Model Architecture

Parrotlet-a-en is engineered with a multimodal architecture comprising three core components:

  1. Speech Encoder
    The speech encoder is built upon OpenAI’s Whisper V3 Large, a state-of-the-art multilingual model renowned for its high-capacity audio processing. This encoder has been widely adopted in advanced multimodal systems, including Qwen-Audio by Alibaba, Shuka by Sarvam AI, LLaMA-Omni by ICT Lab, and Audio Flamingo by NVIDIA, owing to its robust ability to extract intricate speech features across diverse linguistic contexts, such as Indian English accents.
  2. Audio Projector
    A lightweight projection layer serves as the bridge between the speech encoder and the language decoder. Employing a hybrid architecture that combines convolutional layers with a multi-layer perceptron, this component efficiently transforms temporal speech representations into language-compatible vector embeddings, ensuring seamless cross-modal integration with minimal computational overhead. Total number of trainable parameters in our projector layer were around 30M.
  3. Decoder
    The language decoder leverages MedGemma 3 4B IT, a specialized variant of the Gemma 3 series large language model, selected for its exceptional performance in medical and multilingual tasks. MedGemma3 is based on Gemma3 architecture, a versatile model and is successful adapted  by several companies including Sarvam AI’s Sarvam-Translate, a state-of-the-art translation model for Indian languages built by fine-tuning, demonstrating its efficacy for domain-specific applications.

Training Methodology

Stage 1: Continually pretraining of individual components

  • Speech Encoder: We begin by fine-tuning Whisper V3 large encoder with 1800 hours of labelled audio data with specific focus on medical content. Within this Indian english spans 1000+ hours.
  • Decoder: We fine-tuned decoder component of MedGemma 3 4B IT on a corpus of 2.5+B tokens including anonymised and de-identified clinical notes, medical textbooks, and medical terminologies (specifically medication and generic names). Even though MedGemma 3 is already fine-tuned with medical content, localised concepts such as medication names were not well understood by the model. Our CPT process further enriches the understanding and generational accuracy of clinical terminologies, specifically of drug and generic names. 

For the CPT stage we use parameter efficient LoRA technique with rank 128 and alpha of 32. In addition to using LoRA adapters we also fully train the text embedding layer and language modelling head. We use a learning rate of 1e-5 with cosine decay type and bf16 precision. Later the LoRA adapter layers were merged to the base model.

Stage 2: Projector Training

  • Dataset: We trained the projector using 100+ hours of english medical speech in Indian accent carefully transcribed and aligned at the utterance level.

  • Setup: During this stage, encoder and decoder layers were frozen. Only the projector layer is trained to bridge the modalities effectively.

Evaluation & Results

Evaluation of this model and benchmarking with other SOTA models is done using KARMA evaluation toolkit. The error rates mentioned here are over aggregated word and character counts across the entire dataset and not mean of per file level metrics.

WER: word error rate, CER: character error rate, semWER: semantic WER, kwWER: medical entity keyword WER

semWER and kwWER are explained in depth in this blog.

Conclusion

The evaluation results for Parrotlet-a-en-5b, derived from the Eka Med ASR English benchmark, demonstrate impressive performance with a Word Error Rate (WER) of 0.109, Character Error Rate (CER) of 0.047, Semantic WER (semWER) of 0.072, and Medical Entity Keyword WER (kwWER) of 0.062. This outperforms established models such as GPT-4o, Gemini 2.0/2.5 Flash, and Eleven Labs. Our model effectively handles Indian English accents and complex medical terminology, including branded drug names. The model maintains low error rates for semantic content and medical keywords, ensuring accurate clinical documentation.

A key advantage is its efficient design, training only a lightweight projection layer while leveraging pretrained components, which minimizes computational demands and supports real-time application even in resource-constrained settings. By open-sourcing Parrotlet-a-en-4b, we invite the research and development community to build upon these models, developing innovative applications to improve patient care. In addition to this model, we also make our EkaScribe application available through API.