Using WebSpeechAPI for recording transcripts

Kevin Wong
4 min readFeb 3, 2024

--

Audio to Text — Bing AI Image Generator

You may have come across Machine Learning (ML) services like Amazon Transcribe, Google Cloud Speech-to-Text, and OpenAI’s Whisper. However, you might be surprised to learn that some of the latest web browsers now come equipped with built-in transcription features, eliminating the need to connect to external ML transcribing services.

The Web Speech API comprises two main parts: the Speech Recognition API and the Speech Synthesis API. In this article, we will explore the Speech Recognition API, a browser feature designed to convert audio streams into text

As of the time of writing, let’s explore the browser support for speech recognition in the most recent modern browsers. You can refer to the Speech Recognition API for the latest support.

As illustrated in the support table, this feature is available in some of the latest WebKit-based browsers, but it’s not currently supported in browsers such as Edge or Firefox.

Demo

Here’s a straightforward example illustrating how transcription functions within the browser. It’s important to note that, at the time of writing, the Web Speech Recognition API is compatible only with Chrome. Nonetheless, this example provides a glimpse of the capabilities offered by this browser feature.

Setting up voice recognition can involve a fair amount of boilerplate code. Fortunately, there’s a solution — someone has likely developed a React hook to streamline the process. One such library is speech-recognition-react, which abstracts away the complexities associated with voice recognition implementation.

By utilizing this library, you can significantly reduce the code needed to integrate voice recognition into your React applications. This not only simplifies the development process but also enhances the overall user experience. Let’s take a closer look at how you can leverage this React hook to seamlessly incorporate voice recognition capabilities into your projects.

Taking voice recognition one step further, it’s essential to address compatibility across various browsers. While some solutions may not work universally, one effective approach is to explore the concept of “ponyfill” techniques.

One project worth considering is react-speech-recognition, which employs ponyfill methodology to ensure a consistent voice recognition experience across browsers. However, it's crucial to note that I encountered difficulties with speechify, as it is now acquired by Roblox. Unfortunately, this acquisition has led to challenges in setting up a developer account and obtaining an API key.

In my exploration, I also experimented with the Microsoft Azure Cognitive Services polyfill, as documented in their guide. However, I faced issues with broken dependencies and discovered that the library was non-functional in the latest version.

In search for a more reliable voice-to-text conversion solution in every browser, I explored Whisper, developers community’s renowned as one of the most popular server-side options. My experience has been promising, especially when employing the polyfill approach using the use-whisper project.

Utilizing the examples provided with the use-whisper project has proven easy and effective, offering a seamless integration of Whisper for voice-to-text conversion. This robust server-side solution showcases its capabilities and demonstrates how it can be used efficiently within your applications.

Final Thoughts

Upon thorough exploration and implementation of the Speech Recognition API, several quirks have come to light. Notably, the API lacks universal browser support, necessitating the use of a polyfill to emulate its functionality. This introduces potential inconsistencies in the interpretation of the same audio stream to text results, depending on the specific polyfill implementation chosen.

Additionally, the reliance on an active internet connection poses a limitation, as the feature relies on network availability. Users may encounter difficulties in scenarios where internet connectivity is unreliable or lack thereof.

Another challenge emerges when dealing with multiple audio input devices, as selecting the correct stream and managing microphone permissions within the browser leading to various issues. These intricacies of this feature exhibits suboptimal user experience with the current Speech Recognition API.

It’s important to note that, as of the time of writing, the Speech Recognition API is still in draft mode. At this developmental stage, it’s understandable that the feature isn’t yet ready for widespread adoption. Despite the present challenges, web developer can look into a future web where this API is full-featured across all modern browsers and additionally operate seamlessly in offline scenarios. But in the current snapshot in time, using Whisper or any capable server-side ML solution guarantees to most consistent and most reliable output.

Looking ahead, the promise of better and lighter Large Language Models (LLMs), such as Gemini Nano, suggests a future where client-side LLMs can efficiently transcribe audio to text within the browser. This is showcased by the capabilities demonstrated in applications like the Google Pixel recorder, demoing the feasibility of transcription within a mobile application. The prospect of a fully-featured Speech Recognition API in Chrome appears promising, contingent on hardware advancements like GPUs or SOCs with tensor cores to support its execution. As the landscape evolves, it’s worthwhile to keep a keen eye on developments in this space.

--

--

Kevin Wong

Software Engineer and Technology Enthusiast based out of Vancouver, British Columbia. (https://www.linkedin.com/in/kevinkswong/)