Benchmarks

Speech-to-Text Accuracies

From our testing against Singaporean accented English audio, ins8.ai achieved the best accuracy when compared to other models. It scored the lowest WER values despite maintaining a small model size. Below, we provide further details such WER, processing steps, models used, limitations, etc.

stt-werstt-models

Word Error Rate(WER)

WER scores serve as a common metric for evaluating the performance of speech-to-text models. WER quantifies the disparity between a model's textual output and the ground truth text, taking into account the total number of insertions, deletions, and substitutions relative to the total number of words. Lower WER values indicate higher accuracy.

The accuracy of transcription outcomes is influenced by various characteristics of recorded audio, including background noise, varying accents and speaking speeds, overlapping voices, audio quality, and the use of expressions or abbreviations unique to specific local environments. Ins8.ai addresses these challenges by leveraging different components within its inference pipeline. This approach enables the model to achieve consistently low WER scores.


Test Data and Processing Steps

To ensure a fair evaluation across different speech-to-text (STT) models that may generate texts with varying formatting, it is considered best practice to implement normalization and standardization before computing WER. In our testing, we employed OpenAI's Whisper text normalization methods to address differences in punctuation, uppercase and lowercase conversions, numerical representations, British to American spelling, and other similar aspects. Following normalization, we utilized HuggingFace's evaluate library to compute WER scores.

The testing dataset comprised a total of 30 audio clips with a combined duration of 150 minutes and 47 seconds (approximately 2.5 hours). Each audio clip is between 2 to 17 minutes and covered diverse scenarios, including speeches, interviews, conversations between multiple persons, and more. To ensure the highest quality during benchmarking, ground truth texts were meticulously prepared through manual transcription. The codes and transcription outputs are available here and the above mentioned results can be reproduced easily.



STT Models

Below are short descriptions of the other models evaluated.


Google - Chirp

Released in May 2023, Chirp is a two billion parameter speech model built via self-supervised training on millions of hours of audio and 28 billion sentences of text spanning 100+ languages. It is the latest comprehensive update to the set of speech to text models offered by Google. The research paper was released in March 2023 and cloud product announced at Google I/O in May 2023.

OpenAI - Whisper

Released in Sep 2022, Whisper is an open source machine learning model for speech recognition and transcription. It is a weakly-supervised deep learning acoustic model, made using an encoder-decoder transformer architecture, trained on 680,000 hours of multilingual and multitask supervised data. Release dates include September 2022 (original series), December 2022 (large-v2), and November 2023 (large-v3).

HuggingFace - Distil Whisper

Released in Nov 2023, Distil Whisper is a compact speech recognition model for resource-constrained environments. Pseudo-labelling and knowledge distillation techniques are applied to Whisper models as part of model compression. A total of 22k hours of pseudo-labelled audio data, spanning 10 domains with over 18k speakers was used as part of distillation. Model types include large-v2, medium and small.


WER Limitations

It should be noted that there are limitations to WER scores. WER lacks the distinction between crucial words that convey the sentence's meaning and words that do not. It also does not account for difference, whether minor or complete, between two words even when they may differ by just one character. For example, consider "The quick brown fox jumps over the dog." and "The quick brown box jumped over the dog.". The errors 'fox/box' and 'jumps/jumped' is accounted by WER uniformly even though the former error has greater impact in terms of conveying meaning. Users of STT models should be aware of such limitations and test accordingly.


Filler Sounds

The default outputs from Whisper-based models excluded certain filler sounds and disfluencies such as "hmm", "mm", "mhm", "mmm", "uh" and "um"; this led to a potential bias when evaluating Whisper-based transcripts.

The ground truth text, along with predictions from Google's Chirp model and the ins8.ai model, did include such filler words. To assess the impact of these disfluencies on the evaluation, we removed these filler words from all transcripts and re-evaluated the mean WER scores. Interestingly, the WER values did not exhibit drastic changes, and ins8.ai still emerged as the best-performing model.

In addition, we conducted another round of WER comparisons where we filtered out Singaporean-based expressions and filler sounds such as "lah", "la", "leh", "lo" and "lor" from all transcripts before recalculating the mean WER score. Even after this linguistic filtering, the final result remained consistent - ins8.ai maintained the highest accuracy and the lowest WER score. These findings emphasize ins8.ai's robust performance across and suggest that the initial omission of certain filler words in Whisper-based transcripts did not significantly impact the overall evaluation.

stt-wer2
WER values with English filler sounds removed
stt-wer3
WER values with both English and Singaporean filler sounds removed


References

Google - Chirphttps://cloud.google.com/speech-to-text/v2/docs/chirp-model
OpenAI - Whisperhttps://github.com/openai/whisper/blob/main/model-card.md
HuggingFace - Distil Whisperhttps://github.com/huggingface/distil-whisper
HuggingFace - Evaluate (WER)https://huggingface.co/spaces/evaluate-metric/wer
OpenAI - English Text Normalizerhttps://github.com/openai/whisper/tree/main/whisper/normalizers


Last Updated: 2023/12/21