In a recent blog post, Google announced their Cloud Speech API has reached General Availability (GA). The Cloud Speech API allows developers to include pre-trained machine learning models for cognitive tasks such as video, image and text analysis in addition to dynamic translation. The Cloud Speech API was launched, in open beta, last summer.
Cloud Speech API takes advantage of Google’s neural network-based speech recognition that has its roots in Google’s own voice offerings, including Google Assistant and Google Home. The Cloud Speech API currently supports language services in more than 80 languages and variants. It is also able to ingest audio in two modes:
Real-time streaming that will provide prompt text results while a person is speaking
Batch for pre-recorded transcript functionality
The service is able to operate in noisy environments by filtering out background noise and can also learn through word and phase hints by adding new words or phrases to a dictionary.
As part of this GA launch, Google has added some new features and improved performance in the areas of:
Transcription accuracy for long-form audio
Faster processing, typically 3x faster than the prior version for batch scenarios
Expanded file format support, now including WAV, Opus and Speex
In a recent presentation at Google Cloud Next ’17, Dan Aharon, product manager at Google, described some of the use cases behind Cloud Speech API including human-computer interactions using mobile, web and IoT applications. The service can also be used to generate speech analytics for businesses in customer service scenarios.
Aharon also discussed the momentum behind speech and why it has reached an inflection point:
Voice is faster (150 words per min vs 20-40 for typing)
Easier (does not require a hierarchical UI)
More convenient (allows hands free operation)
Over 20% of all Android app searches are now done through voice
Always-listening devices (Google Home, Google Pixel, Amazon Echo) becoming mainstream
Google has showcased a couple customer scenarios that demonstrate the capability of the Cloud Speech API. The first example is a mobile chat application called Azar. In this mobile chat application, users are able to communicate with others in real time using video chat. In addition to streaming video and audio, a transcript is provided to users in the language of their choice. Thus far, Azar has made more than 15 billion discovery matches and is operating the service at scale.
Another use case that Google is showcasing focuses on customer service. Nowadays, most organizations providing customer service, over the phone, provide a prompt indicating the conversation is about to be recorded for customer satisfaction purposes. But what do organizations do with that data? Gary Graves, CTO of InteractiveTel, indicates those conversations are usually reviewed only after a customer dispute. But, Graves feels that organizations, including car dealerships, are missing out on many opportunities as a result:
Not only are our car dealership customers making more sales, but it’s causing a shift in mentality because now everyone in the dealership is being held accountable. It’s one thing to have a recording or monitoring solution in place, and people know it’s there. But that’s reactive, meaning the only time that information is ever going to be leveraged is if there’s a situation that calls that into question. Whereas using Cloud Speech we are able to mine these conversations for actionable intelligence to allow us to empower dealers to be proactive and provide a higher level of customer service.
Within InteractiveTel’s offering, they provide car dealerships with a transcription and sentiment analysis solution. As a phone conversation takes place in real time, InteractiveTel is able to run those conversations through their platform that leverages the Google Cloud Speech API. As a result, car dealerships can create actionable insight to their salesforce and also determine customer sentiment on a per call basis.
As part of InteractiveTel’s demo at Google Cloud Next ’17, Graves demonstrated how their technology can be used to provide real time speech to text translation, keyword detection and sentiment analysis. Graves feels that even if customers are unwilling to provide their contact information, there is still a lot of product demand information that can be captured without relying upon a salesperson to accurately capture this information in a system.