Generalized Keyword Spotting using ASR embeddings

R. Kirandevraj, Vinod K. Kurmi, Vinay P. Namboodiri, C. V. Jawahar

Research output: Chapter or section in a book/report/conference proceedingChapter in a published conference proceeding

4 Citations (SciVal)


Keyword Spotting (KWS) detects a set of pre-defined spo ken keywords. Building a KWS system for an arbitrary set re quires massive training datasets. We propose to use the text transcripts from an Automatic Speech Recognition (ASR) sys tem alongside triplets for KWS training. The intermediate rep resentation from the ASR system trained on a speech corpus is used as acoustic word embeddings for keywords. Triplet loss is added to the Connectionist Temporal Classification (CTC) loss in the ASR while training. This method achieves an Average Precision (AP) of 0.843 over 344 words unseen by the model trained on the TIMIT dataset. In contrast, the Multi-View re current method that learns jointly on the text and acoustic em beddings achieves only 0.218 for out-of-vocabulary words. This method is also applied to low-resource languages such as Tamil by converting Tamil characters to English using transliteration. This is a very challenging novel task for which we provide a dataset of transcripts for the keywords. Despite our model not generalizing well, we achieve a benchmark AP of 0.321 on over 38 words unseen by the model on the MSWC Tamil keyword set. The model also produces an accuracy of 96.2% for classifi cation tasks on the Google Speech Commands dataset.

Original languageEnglish
Title of host publicationProceedings Interspeech 2022
Number of pages5
Publication statusPublished - 22 Sept 2022
Event23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of
Duration: 18 Sept 202222 Sept 2022

Publication series

NameProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
ISSN (Print)2308-457X


Conference23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022
Country/TerritoryKorea, Republic of


  • keyword spotting
  • low-resource languages
  • speech recognition

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation


Dive into the research topics of 'Generalized Keyword Spotting using ASR embeddings'. Together they form a unique fingerprint.

Cite this