Beyond traditional WER: The critical need for semantic WER in ASR for Indian healthcare

on
July 24, 2025

Automatic Speech Recognition (ASR) systems have made remarkable strides in recent years, but evaluating their performance using traditional Word Error Rate (WER) often falls short in specialized domains. This is particularly evident in Indian healthcare settings, where linguistic diversity, medical terminology, and domain-specific requirements create unique challenges that conventional metrics fail to capture.

At Eka, our efforts to build ASR solutions for Indian healthcare involve developing evaluation metrics that incorporate semantic understanding in WER calculations. We propose two modified WER metrics - Semantic WER and entity/keyword WER. In this post, we'll explore why Semantic WER and Keyword WER are not just useful additions but essential metrics for building robust ASR systems in Indian medical contexts.

The limitations of traditional WER

Traditional WER treats all words equally, counting substitutions, deletions, and insertions without considering the semantic or functional importance of different words. Consider this example from a Hindi medical consultation:

Reference: "मरीज़ को 375एमजी की गोली दिन में तीन बार लेनी है"
Hypothesis: "मरीज़ को तीन सौ पचहत्तर मिलीग्राम की गोली दिन में 3 बार लेनी है"

Traditional WER would show significant errors due to:

  • "375एमजी" vs "तीन सौ पचहत्तर मिलीग्राम" (different number formats)
  • "एमजी" vs "मिलीग्राम" (abbreviation vs full form)
  • "तीन" vs "3" (word vs digit)

Despite these "errors," the semantic meaning is identical – the dosage, frequency, and medication remain perfectly understood. A traditional WER of 40-50% would severely underestimate the system's actual performance in conveying critical medical information.

Semantic WER: Understanding intent over form

Semantic WER addresses the core limitation of traditional metrics by evaluating meaning rather than exact word matches. Our implementation leverages traditional and advanced normalization techniques that address:

Number and unit variations

The core of the semantic WER aims to reduce the differences occurring between variations of numbers and units.

Indian speakers often use a mix of:

  • Devanagari numerals (३७५) and Arabic numerals (375)
  • Spelled-out forms ("तीन सौ पचहत्तर") and numeric forms ("375")
  • Abbreviations ("मि.ग्रा.", "एमजी") and full forms ("मिलीग्राम")

Example:

Reference: "BP 120/80 है"

Hypothesis: "बीपी एक सौ बीस बटा अस्सी है"

The traditional WER for this example is 60%, indicating significant surface-level mismatches. However, the Semantic WER is 0%, reflecting perfect semantic alignment.

Orthographic Normalization for Indic Text

ASR outputs in Indic scripts often vary due to optional diacritics, multiple Unicode representations, or inconsistent spellings. These inconsistencies can cause traditional WER metrics to overestimate errors, even when the intended meaning is preserved. To ensure fair semantic comparison, we support normalization techniques as a pre-processing step, which includes:

  • Nukta Removal: e.g., "ज़्यादा" → "ज्यादा"

  • Bindi/Chandrabindu Normalization: treating nasal markers consistently

  • Zero-width joiner/non-joiner removal: e.g., invisible characters that affect rendering but not semantics

  • Unicode normalization (NFC): to collapse decomposed forms

This reduces false mismatches and enables accurate CER computation even when ASR systems output slightly different spellings or orthographic forms.

Code-switching normalization

Indian medical conversations frequently mix Hindi and English:

Reference: "patient को diabetes की medicine देनी है"

Hypothesis: “पेशेंट को डायबिटीज की मेडिसिन देनी है”

The WER for the above case would be 0%, which is achieved by transliterating any English words to the local script as a pre-processing step.

Keyword WER: Focusing on what matters most

While Semantic WER improves overall evaluation, medical ASR systems require even more targeted assessment. Critical medical information often resides in specific keywords such as drug names, dosages, vital signs, symptoms, and diagnoses. Missing or misrecognizing these keywords can have life-threatening consequences.

Drug names and dosages

Consider this prescription scenario:

Reference: “स्टारक्लेव 625mg 10 टैबलेट्स की स्ट्रिप”

Keywords: [“स्टारक्लेव 625mg”, “10 टैबलेट्स”]

Hypothesis: “स्टारक्लाव छह सौ पच्चीस एमजी टेन टैबलेट्स की स्ट्रिप”

This illustrates why evaluating both semantic equivalence and keyword accuracy is vital for clinical safety.

  • Regular WER would be 50%, corresponding to the erroneous words from reference: “स्टारक्लेव”, “625mg”, and “10”. 
  • Semantic WER would consider “625mg” to be semantically equivalent to “छह सौ पच्चीस एमजी” and “10 टैबलेट्स” to be semantically equivalent to “टेन टैबलेट्स”, leading to an error rate of 8.3%.
  • However, 8.3% underestimates the critical error in the transcript, which is the mistranscription of the drug name “स्टारक्लेव”.
  • A keyword error rate of 33% (1 out of 3 keywords is right) seems to be a fair error estimate.  

Vital signs and measurements

Medical consultations are rich with numerical data that traditional WER handles poorly. Consider the following example:

Reference: VITAMIN E, the value is 14117.4 mg/dL

Hypothesis: Vitamin E, the value is fourteen thousand one hundred seventeen point four mg per dL,

Both semWER and kwWER would be 0% in this case, whereas the traditional WER would be significantly higher, resulting in insertion errors. Although inverse text normalization to the hypothesis would address such problems, it has drawbacks that are discussed later.

Implementation 

The implementation of semWER is based on the following principles: character error rate (CER), semantic expansion of words, and multi-span matching.  CER determines the alignment quality, while another variable, “alignment score,” drives the dynamic programming by choosing the appropriate path. Based on the CER value, an alignment is considered a match or a substitution for the WER calculation. 

In the semantic alignment framework, only the reference text is expanded into overlapping word spans (e.g., unigrams, bigrams, trigrams) to account for alternate phrasings, multi-word expressions, and common synonyms. The hypothesis is retained in its original form and matched against these expanded reference spans using a dynamic programming alignment based on character error rate (CER). When there is an exact match between any reference span and the hypothesis segment (CER = 0), the alignment is classified as a perfect semantic match and awarded a high score (+20). For low CER values (0.05 to 0.5), the alignment is treated as a semantic match, with scores ranging from +0.15 to +2.0 based on both CER and alignment type, including single-to-single, single-to-multi, and multi-to-multi matches. The algorithm prefers single-to-single mappings to ensure precision. If the CER is moderate (0.05 to 0.4), the match is treated as a substitution but still receives a positive score (typically between 0.3 and 1.95). Alignments with CER > 0.4 are penalized, receiving lower or even negative scores (–0.5 to +0.1). Table 1 summarizes these empirical score ranges and reflects the model’s bias toward semantically faithful transcriptions even when lexical or surface forms differ.

Table 1: CER ranges, Similarity scores, Alignment types, and corresponding Matching Scores for semWER calculations

kwWER is measured using the alignments obtained from semWER. To measure kwWER, an annotation field containing the regions to be evaluated is specified by character offsets. The Eka Care medical evaluation dataset can be referred to for the format.

Results and discussions

ASR models were evaluated using Semantic WER (semWER), and the differences between the traditional WER and semWER for the different systems are shown in Table 2. A relative difference of about 30-50% is observed across these systems, highlighting the importance of using semWER for evaluation. The wide variation in relative WER vs semWER (up to 50%) shows that traditional WER can misrepresent ASR accuracy, underscoring the importance of semWER for fair comparison.

Table 2: WER and semWER with their relative difference, along with kwWER across ASR Systems on the Eka medical ASR evaluation dataset (all values in %).

The evaluation framework also supports integrating a GLM (Global Mapping List), which accounts for morphologically equivalent words, similar to traditional evaluation tools like SCLITE. 

Inverse Text Normalization (ITN) is often employed to map spoken forms to written text, refining WER estimates in ASR evaluations. However, it has several limitations that Semantic WER overcomes.

  • ITN converts spoken form → written form in one direction only ("forty milligram" → "40mg"). Semantic WER handles bidirectional mappings and finds optimal alignment regardless of direction ("40mg" ↔ "forty milligram" ↔ "चालीस मिलीग्राम").
  • ITN operates on the entire text without considering alignment context, while semantic WER makes alignment decisions at the token level with position awareness, enabling precise error attribution.
  • Semantic WER uses character-level edit distance with a fallback to approximate matching, handling ASR recognition errors gracefully, while ITN fails if the input doesn't exactly match the normalization rules.
  • Unlike ITN, which requires language-specific preprocessing, Semantic WER natively handles mixed Hindi-English medical consultations ("patient को 375mg twice daily") without the need for separate normalization pipelines.
  • Semantic WER allows configurable CER thresholds (0.4 for medical domains) to balance between strict accuracy and semantic equivalence, while ITN uses binary exact-match criteria that cannot be tuned for domain requirements.
  • Provides both word-level WER and character-level weighted CER metrics, enabling medical teams to distinguish between cosmetic transcription differences and clinically significant errors for targeted quality improvement.
  • On-the-fly semantic expansion eliminates the need for extensive text preprocessing pipelines required by ITN systems, enabling real-time evaluation during live medical consultations.
  • ITN applies fixed rules regardless of context, while Semantic WER considers surrounding tokens and medical context when deciding expansion strategies (e.g., "2x" → "twice" in frequency context vs "two times" in quantity context).
  • Semantic WER provides detailed alignment traces showing exactly how each reference token was matched, with specific CER values and alignment types (MATCH/SUBSTITUTION/DELETION/INSERTION), while ITN offers no visibility into transformation decisions or confidence levels.
  • Unlike ITN's black-box normalization, Semantic WER enables precise identification of error sources through character-level analysis, allowing medical teams to trace whether errors stem from acoustic confusion, vocabulary gaps, or semantic misunderstanding for targeted model improvement.
  • Semantic WER's modular expansion framework allows easy addition of new medical abbreviations, regional terminology, and specialty-specific vocabularies without rebuilding the entire system, unlike ITN, which requires comprehensive rule rewriting for new domains.
  • The architecture seamlessly extends to new Indian languages and mixed scripts through configurable expansion dictionaries and character-level normalization, while ITN requires separate language-specific models and conversion rules.

Conclusion

SemWER and kwWER represent a significant advancement over traditional WER and ITN, especially in high-stakes domains like healthcare. By capturing semantic equivalence and ensuring critical information retention, these metrics enable more accurate and clinically meaningful evaluations of ASR systems. We encourage the ASR community to explore and contribute to KARMA, our open-source evaluation toolkit tailored for Indian healthcare. KARMA supports Hindi, English, and code-switched inputs, with plans to extend to more Indian languages and clinical domains.