ASR Archives - TranscribeMe

Evaluating Automatic Speech Recognition Technology

Transcribe Me — Thu, 07 Jul 2022 23:28:31 +0000

Let’s talk about Automatic Speech Recognition (ASR) technology: the state of the art; user/customer expectations; ASR output vs user/customer expectations; and ask whether there is one ASR engine that meets all requirements? Spoiler alert–the answer to that last question is, no. There’s no single ASR Engine that can satisfy all industry needs. Why not? We will dive into that answer in a bit.

Here’s another question to ask about ASR Technology. Why should I, the consumer, look beyond the big three–Google, Apple, Microsoft…make that four, IBM to assist me in meeting all my ASR Requirements? Obviously they have the biggest R&D budgets and attract the best talent so their technology should be the best, right?

The answer is, it depends.

ASR Technology Hits and Misses

For example, you want Google to turn on the lights–”Google! Turn on the driveway lights.” Or, “Siri! Play my, I’m really depressed mix.” Or “Alexa, I need a Vegan pizza, light pepperoni and cheese.” All of these technologies that use ASR to pick up on voice commands work pretty well.

However, there are a number of cases where these ASR technologies have challenges. One pretty simple example is when I use the speech to text feature on a phone. Between auto correct and incorrect words, it’s definitely not perfect. In fact, what is most frustrating is that it doesn’t learn. I always have to correct my daughters’ names as well as that of my engineering VP–EVERY TIME! This is a slightly different use case than query response, but it’s similar. Typically short sentences, real time transcription and the errors are because the context is free form so there isn’t the possibility of comprehensive training.

How TranscribeMe Uses ASR

The TranscribeMe use case for ASR is neither of these. Eg, “Ok Google! Listen to this one hour audio file and transcribe it with timestamps for every speaker change.” As they say colloquially, “that dog don’t hunt.” Why not? That’s not the use case for Google.

Simplistically, the ASR industry breaks down into two use cases, query/response and audio to text transcription. TranscribeMe continually tests vendors’ speech engines and the big 3 or 4 are never at the top of the list in terms of word error rate for our use case–and that makes sense–audio to text, where audio, not ‘spoken speech’ is not their design target.

An example of a TranscribeMe virtual request might be, “Transcribe this six hour legal deposition with five speakers using the state of Iowa output format and include speaker IDs and speaker change timestamps.” Well, truth time, no ASR engine is going to get that right. But some may be better than others.

So that’s where ASR analysis becomes more sophisticated. We’re not simply looking at word error rates but at other factors, such as which engine punctuates or capitalizes best? Which works best w/ crosstalk? Which is stellar with single speaker or multichannel vs that which can handle multiple speakers on a single channel?

Why do these qualifications matter? Because the speech engine is not going to produce the final output that will be acceptable to the customer. Maybe it will produce output that’s 90% correct–that’s pretty good. What if your car worked 90% of the time–pretty good or totally unacceptable?

No ASR Engine’s Output is Perfect

No ASR engine with a few caveats can produce output that will be acceptable to the customer as a finished product. The ASR engine produces an output that then requires human review and correction for completion. And that human in the loop dictates which engine we use for various customer and use cases–those distinctions I mentioned above: dial up the ASR that excels at single speaker clear audio; or dial up the ASR that accurately timestamps speaker changes; or we need the engine that doesn’t insert gibberish when it doesn’t understand the audio.

In summary, the TranscribeMe use case requires different engines for different types/qualities of audio and for specific use cases. Since we don’t build our own ASR we can shop and use any vendor that fits our needs and provides the best output for human review and correction.

I mentioned a caveat where there are cases that a one pass ASR output can satisfy customer requirements and in our case we have a customer who does further analytics on the ASR output. That analysis may be keyword spotting or sentiment analysis, or other.

As an aside, be wary of any company using their own home grown ASR to process files–one size does not fit all and companies that do produce their own ASRs continually narrow the niches where they play.

Do you have some examples of ways where you have found ASR technology challenging with a project you have worked on? We’d love to know. Are you looking for a company like TranscribeMe to help you with any of your Transcription or AI Datasets and Machine Learning needs?

The post Evaluating Automatic Speech Recognition Technology appeared first on TranscribeMe.

How TranscribeMe Strives to Build Better Structured Data for More Accurate ASR

Transcribe Me — Fri, 22 Apr 2022 17:14:37 +0000

TranscribeMe uses automatic speech recognition (ASR) technology in order to auto-complete audio to text transcripts.

When the audio is of very high quality and the completion requirements are less than 100% word accuracy, an ASR can provide a pretty accurate finished transcript in a short amount of time. This accuracy is usually accomplished when there’s either a single speaker or where there’s a dialogue between two speakers each with a separate mic–which could be two phones. Though you would think that this would be the best way to create data sets for more accurate ASR, this is not always the typical case for several reasons, the primary one being audio quality.

Audio Quality Limitations

Audio quality is not always that straightforward. In fact, there are many factors that can occur beyond simple clear recording. The audio may be very clear, but there are multiple speakers speaking over each other, the speakers may have accents that confuse the ASR or the recording may contain significant background noise

These examples as well as other quality issues can limit ASR usability.

ASRs need to do more than just simple word transcription. Many use cases require additional features that most ASRs just don’t have. The two most common requirements are timestamping (per word and/or speaker change) and the other in ASR terms, diarization, which is when a speaker identification, (typically not by name), but simply by identifying speaker 1, speaker 2, speaker 3, etc.

Since TranscribeMe does not create its own ASR, they constantly test all available options; the one common failure is diarization. TranscribeMe has yet to find a speech model that can do this consistently.

Speech Technology Design

A quick word about speech technology. TranscribeMe recently met with a potential partner who asked if “Bigs”(Google, Amazon, Microsoft, IBM) technology was used. The assumption from the potential partner was that these companies have the resources to provide the best technology and smaller companies like TranscribeMe could not be competitive. The thing is, the use case matters and these companies have a specific niche, mostly, for which they design, which is for query responses ie. “ok google”; “Alexa, play my tunes” etc. These companies are not trying to autocomplete six hours of legal deposition.

TranscribeMe constantly tests ASRs to be able to include the best options for customers. The “Bigs” are not at the top of the list for that reason. In fact, no single ASR is consistently top of the list to be able to meet all requirements. There are variations in language support; the ability to understand English in various dialects or accents; the ability to provide a runtime that lives in our domain for customer security requirements; the ability to add a dictionary of expected terms for niche audio. Also, is the ASR tuned for call centers; is it tuned for business dialogue, eg, earnings calls; is it tuned for management consultants, etc.

TranscribeMe has yet to find a single ASR that works best in all use cases so they employ multiple engines. That said, as alluded to before, the ASR alone, except in highly constrained cases, can’t do the job on its own; it also needs help from humans.

Why ASR Technology Still Needs Human Help

TranscribeMe calls the process of helping ASRs with a humans’ help, “Blend”. You might also hear the phrase, “human in the loop”. Whatever it’s called, it simply means that an audio file is first processed through an ASR and then sent to a human for correction and completion.

But wait! There’s more! Back to the quality issue. Poor quality audio processed by an ASR produces a transcript that’s so poor, it takes longer to correct than it does for a transcriptionist to do from scratch. To limit ASR processing to “good enough quality audio” a confidence score is used by running a snippet and getting an assumption of the overall audio quality. If that assumption or confidence is at or above a certain threshold then the full audio is processed through Blend, otherwise, it’s sent to a manual workflow.

So, now there’s ASR only and Blend. That’s still not enough in some cases to build a good enough data set. Additional processing is required which can include timestamping, per word, and speaker change. In cases that require per word microsecond stamping, not possible with any ASR it’s accomplished through a dedicated QA UI tool built by the TranscribeMe crowd.

Customer requirements/style guides require further post-processing which, again, can’t be done by an ASR. The styles may require numbers to either be spelled out or not. “Ahs and Ums” either need to be included or not. For these styles, TranscribeMe adds scripting per project to fine-tune the transcript before it’s returned.

Why Automated Speech Recognition (ASR) is Still So Limited

The question someone might ask is, why is an ASR so limited? The answer lies in what is required to create an ASR. There’s a term, unsupervised learning, which is one of those nirvana terms–the ultimate goal–the ASR trains itself, just like any artificial intelligence you see in movies. (They learn on their own and eventually take over the world!)

In real life and in each limited case niche, an AI must be exhaustingly taught every possible case–known by humans–to be able to function. In the case of speech automation, annotated datasets must be created and then fed into deep learning algorithms to produce an ASR, and then it needs to be done again, iteratively until it’s good enough, and then it needs to be done some more.

TranscribeMe has been employed to build these types of structured data sets in order to have insight into what’s required to build an accurate ASR. The dataset can be turned to a specific niche/dialogue/dictionary set or in some cases tuned to a specific customer.

TranscribeMe has actually been employed by specific customers who want to create their own ASR specifically trained on their own specific audio. This generic engine has serious limitations in what it can provide to users looking for specific results. But regardless of how sophisticated the engines become, for the foreseeable future, humans will continue to be involved, either in the creation of training data or for transcription, in order to have a more accurate completion of the final product.

The post How TranscribeMe Strives to Build Better Structured Data for More Accurate ASR appeared first on TranscribeMe.