Voice App Development
The Innovator's Guide to Voice
- Alexa, Alexa Voice Service, Amazon Echo, audio-only interfaces, Automatic speech recognition, Baidu, Bixby 2.0, Bixby developer, cloud-based voice processing, conversational voice interfaces, Conversational VUI, Cortana, Developing voice applications, digital assistants, Garmin Speak, Google Assistant, Google Cloud Speech Recognition, Google Home, Home-based voice control, Jasper, machine learning, MondlyVR, Mozilla DeepSpeech, Natural language understanding, natural user interfaces, siri, siri development, SiriKit, smart speaker, smart speaker tasks, smart speakers, SoundHound Houndify, speech patterns, speech recognition, speech to text, talk to text, the future of voice, Transcription, voice command, voice commanding, voice commands to issue commands, voice control, voice devices, voice dictation, voice navigable, voice platform, Voice recognition, voice search, voice to text, voice typing, voice user interfaces, voice-audio, voice-based payments, voice-enabled app, voice-enabled IoT device, Voice-only, voice-recognition AI, Voice-touch, voice-visual, VoiceOver, wake word
Siri changed how we interact with phones for good. Now, we’re on the cusp of an even greater voice revolution. But what is a voice interface, exactly? And how do you implement one? We take a look at the voice UI trend and dig into how it promises to fundamentally and permanently change how we interact with the tech around us.
If you have an Amazon Echo, Google Home, dictate speech-to-text or use your voice to tell your car to change the volume, you’re already living in the world of voice and speech recognition. We use our voices to control our favorite devices every day without thinking about it. Welcome to the world of voice.
At its most basic, “voice” is just a way of issuing commands or even dictating words. But it goes beyond this. It’s also quickly becoming an interface — the primary way to interact with some devices. In fact, voice as a user interface has its own term: VUI. It’s part of an entire class of user interfaces that are considered so natural as to be invisible: NUI, or natural user interfaces.
Voice’s vast potential makes it one of the most exciting emerging technologies. Combined with recent machine learning advances, this is an impactful area that simply can’t be ignored.
Ready to get started adding voice to your business? We’re here to help.
The different types of voice
- Voice recognition typically refers to recognizing a specific person’s, often for biometric security reasons.
- Speech recognition refersto recognizing speech patterns, words and extracting meaning from those words.
- And voice refers to the overarching field, where spoken sounds control devices, apps, and experiences.
How does voice work?
Simply put, you take sound, process it and take an action or make a decision based upon that. Many neural networks will chop up sound into smaller pieces and compute probabilities on these soundbites.
Case in point: to make “Hey Siri” possible, the iPhone is always listening for its “wake phrase”. It turns voice audio into sub-millisecond-size snippets, transforming them and feeding them into a deep neural network (DNN). This evaluates the probability that you actually said “Hey Siri!” versus something else.
Device-side voice processing like this is becoming more common, especially with AI chips in flagship mobile devices. But, in some cases, speech recognition sends voice snippets to cloud servers for remote data processing. Transcribed or translated text is often sent back. If you’ve ever used voice-based software that required an internet or data connection, it was probably using the cloud.
Within voice, there are a few subtypes:
- Automatic speech recognition (ASR): ASR recognizes different voices and accents and converts dictated speech into written text. It doesn’t understand the meaning or context of words; it simply strives to take audio and turn it into text.One example is Google Translate’s speech-to-text mode, which transcribes audio to text without necessarily understanding meaning. Any transcription software falls into this category, too.
- Natural language understanding (NLU): NLU actually seeks to understand meaning. It extracts key phrases and makes sense of the context. Ideally, this means that search results returned by NLU should deliver exactly what you’re looking for.One example is voice-based Google Search, which actually understands meaning to return accurate results. Smart speakers are another example, as they’re backed by search and digital assistants which much parse context, key phrases and meaning to return the most relevant result (or subset of results).
No matter what type of voice recognition is being used, if there’s machine learning like a neural network involved, the system will ideally adapt, improve and learn over time.
Voice in today’s world
In the right cases, voice means never touching a screen. While it doesn’t always make sense to completely replace touch experiences with voice, there are many places where it creates additional ease, especially in use cases where people can’t easily access their devices. Whether you’re going for a run or your hands are full of groceries, say the word and you’ve got the news, music, an Uber or Lyft, even groceries.
We already use our other senses for device interactions: touch, hearing, sight. Voice promises to be the next big interface because its hands-free commanding stretches the possibilities beyond menus and UI. As we wrote in The Short Stack Vol #4 , voice means the in a lot of cases it could be the best UX will no longer be constrained by squeezing a limited number of features onto a mobile screen.
It also expands our expressiveness as consumers and device-users. As Jason Amunwa writes in The UX of Voice: The Invisible Interface, registering for website email updates once meant entering an address into a popover text field. With voice, there’s a nearly-infinite array of natural language possibilities, all of them conversational. Much easier than hunting for a button or menu.
Better yet, voice simplifies our complex digital world. Even a simple text search can return hundreds of pages of results, but an audio-vocal UI doesn’t have this luxury. It has to curate your options.
On one hand, this increases design and development complexity. It’s a completely new user experience and requires new best practices. But, done properly, it reduces user complexity. With natural, spoken language, you can express yourself like talking to a friend and spend less time searching for information.
Looking to improve your technologies users experience with your business, talk to us.
Other benefits include:
User experience design seeks frictionless experiences. Voice is called an “invisible interface” because there are no buttons, no menus, no manuals. No UI means fewer stumbling blocks. No menus means no digging for a hard-to-find button. And, ideally, voice supports an open domain of possible inputs rather than rote command memorization.
Voice- and audio-only interfaces
Eyeballs and fingers need not apply; voice UI (VUI) is useful in a variety of situations when a user interface is impractical, unwieldy or unsafe, including driving, cooking and medical care. Voice interfaces’ rising popularity calls into question when touch UI is ever needed.
We’ll get into this more soon. However, in a world that’s still largely designed for sighted and able-bodied people, voice offers huge accessibility benefits.
The limitations of voice
Voice is incredibly powerful as an interface. That said, today’s voice tech still has some drawbacks and special considerations, including:
Initiating voice modes with touch is annoying. So is being told, “Sorry, I don’t understand that.” Voice must avoid unnecessary physical constraints and be robust enough to support a variety of commands. And, while voice can feel incredibly natural for many interactions, shoehorning it into places where it doesn’t work is a surefire way to lose your users’ trust.
Loud surroundings are still an issue, and different languages do better or worse. English fares better, while tonal languages like Cantonese may struggle. Sounds with reverberation have a different challenge (“reverb” is a sound reflection with no perceived gap, unlike an echo. It’s common in music and on microphones with special audio effects.) And the methods used, including noise reduction and subsequent speech restoration, still need improvement.
It’s not always acceptable to talk on (or to) your phone. Walking down the street: yes. Busy subway: no. Movie theater: definitely not. Voice app designers need to consider when and how their apps can actually be used. If users might frequently feel social pressure not to speak to their app, it could be a serious problem for market success.
Natural language processing (NLP) has made huge advances, but language is complex. Regional differences in accent, pronunciation and vocabulary cause issues, along with sentence-level ambiguity. One famous example is the Groucho Marx quote: “This morning I shot an elephant in my pajamas. How he got in my pajamas I don’t know.” Humans still often do better than algorithms with these nuances.
Context and follow-up questions
Human interactions and language rely heavily on context. Ask, “Who played Han Solo in Star Wars ?” and it’s understood you’re talking about a movie. Maybe you’d follow up with, “What year was episode four released?” or “Who played Leia?” You don’t have to specify that you’re talking about Star Wars; it’s implied. In human discussions, context is generally continuous until the topic changes. In fact, we have a term for changing the subject suddenly: it’s a “non sequitur”. But some AI treats every question and interaction independently; continuous context isn’t always guaranteed.
Voice Across Seven Industries
Voice has many specific—and revolutionary—applications, from the expected to surprising. Some you may have already encountered, while others are still on the horizon.
Voice recognition has been massively successful in call centers. Just think about the last time you called your bank. Was a nice robot lady your first point of contact? There’s a good reason for that: speech recognition systems can handle up to 85% of incoming calls . This improves efficiency and lowers costs. In the future, voice user interfaces (VUIs) will get smarter and better, routing customers to self-service modules and freeing customer service and front-desk staff up for only the most critical issues and tasks.
Who’s doing it?
- IBM Watson’s speech-to-text turns customer service calls into transcripts.
- Nuance is an industry leader in customer service speech recognition , and their transcription software is powerful, too.
Smart and connected homes
Home-based voice control goes beyond controlling lights. It’s a matter of security, mobility and accessibility. Imagine locking your door remotely, lowering the blinds, closing the garage door, adjusting the thermostat or even having a single word that wakes your house up or powers it down. Today’s home voice control is also mostly limited to single-room solutions, but the whole-house tech isn’t far off.
Who’s doing it?
- Amazon Echo/Echo Dot and Google Home/Home Mini control a variety of smart home devices, including the smart, color-changing Philips Hue bulbs.
- The Ecobee thermostat lets you use Amazon Alexa skills to adjust room temperature.
Payments and banking
Voice recognition can biometrically identify people based on speech patterns, vocal pitch and intonation. This extra level of security is like personalized two-factor authentication. (In the future, your unique vocal “fingerprint” could even be stored as part of a unique, immutable ID on the blockchain.) Until then, ask Amazon Echo to order groceries or have a virtual assistant pay your credit card bill. One thing’s for sure: with payments, saying is easier than swiping.
Who’s doing it?
- Mobile-only bank Atom uses voice and facial recognition for identity verification
- Starbucks, Square and others have built voice-based payments into the Alexa and Google Assistant platforms.
Virtual and augmented reality
Virtual reality (VR) headsets totally obscure your vision, replacing everything with a virtual world and body. Unwieldy controllers have been the solution for interacting with virtual UI. But imagine issuing voice commands to an in-game squad without lifting a virtual finger. Or, in a VR teleconferencing session, having all your speech automatically translated and app-related voice commands automatically muted. Augmented reality (AR) combines virtual and physical worlds by superimposing digital objects on top of what you see, often using smartphones. However, they also break immersion by asking users to tap on their screens to interact with these digital objects. Voice could solve this problem by removing the need for touch, creating a more seamless command interface.
Who’s doing it?
- MondlyVR is using virtual reality plus speech recognition to teach people new languages in conversational settings.
- IBM and Ubisoft added voice commands to the Star Trek VR game.
- Google Cloud Speech Recognition adds speech support for cross-platform, Unity-based AR/VR apps.
Voice in cars is still distracting. Error-prone systems draw drivers’ attention away from roads and many systems require interacting with a phone or in-dash screen. But, in the future, voice may be one of the only controls you need. You can already ask Alexa to unlock your car. But in an autonomous, self-driving car, who needs pedals? Or any buttons at all? Suddenly the perfectly-smooth interiors of science fiction transportation pods don’t seem so far-fetched.
Who’s doing it?
- Muse integrates all 25,000 Alexa skills with a smartphone and USB, Bluetooth, or AUX.
- Garmin Speak recently put Alexa into cars’ center-dash screens (aka “ head-units”).
- Amazon announced deals with BMW and Ford to directly integrate Alexa in their cars.
Apple Watch, Fitbit, Microsoft Band and other wearables have been hot commodities for a while. It’s only natural that voice is the next step. In the future, imagine making voice and video calls with nothing but a wearable and wireless headphones, or even using your voice to validate a purchase at the register.
Who’s doing it?
- MIT developed a wearable that measures emotion and tone in speech.
- Fujitsu has a medical speech translation wearable for foreign patients to communicate with doctors.
Voice is making tech more accessible and navigable for everyone, but especially people with disabilities. Individuals with partial or full motor impairments are often unable to use traditional touch, keyboard and mouse-based interfaces. Likewise, sight-impaired individuals have historically used screen-readers, which can be slow and error-prone.
Who’s doing it?
- Google’s Voice Access makes Android Phones voice – navigable.
- iPhone’s award-winning VoiceOver, which narrates almost everything you can do on iOS.
Who’s Using Voice?
Voice users are dedicated and growing quickly. Looking to reach them? Let us help.
In 2016, 20% percent of queries on a mobile app and on Android devices were voice searches and by 2020, this number is predicted to reach 50% . And by 2021, Ovum predicts that the native voice agent base will more than double from 3.6 billion to 7.5 billion . There will be more digital assistants in the world than people.
According to Forbes, Answer Lab and Juniper Research , 63% of smart speaker users plan to buy another and 70% use theirs daily. By 2022, 55% of American households will have one; many will have more than two.
With adoption and engagement numbers like these, it’s worth taking a look at all the players in the field.
Amazon Echo (70.6% market share) and Alexa (3.9% market share)
Echo is Amazon’s line of smart speakers, while Alexa is the cloud-powered, AI voice agent powering them.
The smart speaker listens for the “Alexa” wake word, then you provide a command. For example: “Alexa, how is my commute looking?” Alexa’s various speech-powered abilities are called “skills”, of which it has over 15,000 and growing. Non-English skills are limited, so, for now, Echo (and Alexa) mostly dominate in North America. Third-party apps for Lyft, Uber and Domino’s pizza are just a few among these thousands of skills.
Since Amazon introduced the Alexa-enabled Echo in 2014, they’ve gained over 70% of the market share. An April 2017 eMarketer study found that 35.7 million Americans used voice-enabled digital assistants at least once monthly. 22 million units were sold in 2017 and business is booming, especially thanks to holiday gift purchases.
Aside from usual smart speaker tasks, you can naturally can also order from Amazon. Alexa and Cortana are now friends, meaning Alexa will soon reach Microsoft’s 145-million user B2B and B2C market. In 2018, new first-party devices and third-party integrations will make Alexa available on even more devices beyond the Echo. Alexa is even part of a Delta smart faucet, making “Alexa, turn on the hot water,” an actual possibility.
- Add skills with the Alexa Skills Kit.
- Incorporate Alexa into hardware with the Alexa Voice Service.
- Or connect internet-capable devices.
Google Assistant (23.3%) and Home (23.8%)
Google Home and Home Mini are the search giant’s answer to the Amazon Echo smart speaker. Introduced in 2016, Google Assistant is the smart virtual helper powering both. It’s also found on Android and Google feature phones.
You wake the device with “OK, Google,” or “Hey, Google”, followed by a command, or “action”. While Google Assistant doesn’t support as many commands as Amazon Alexa, it does have some differentiators. Actions support a variety of native and third-party integrations, including Uber, Domino’s and Quora.
Google has the the second most popular smart speaker and trounces everyone else is the virtual assistant marketplace, where they have the lion’s share of the marketplace thanks to Android phones and the Google Assistant app.
You can order from a wide variety of of e-commerce sites, including Costco and Walmart. Google also supports Chromecast integration, multiple commands in a sentence and contextual follow-up questions.
Extend Google Assistant by building conversational actions using their SDK (30 minute Google Home tutorial).
Samsung Bixby (14.5%)
Bixby, Samsung’s smart digital assistant, was introduced with the Samsung Galaxy S8 and S8+ in early 2017. It’s one of the newest voice agents on this list.
Unlike the other assistants, Bixby doesn’t have a wake word—it uses a physical button. Needless to say, this has led to many scathing critiques, including that it’s a button at all .
This is one place where percent market share (indicating the number of installed devices) obviously doesn’t speak to actual usage. The dedicated button is so reviled that people have remapped it to Google Assistant instead. Sorry, Bixby.
A physical button, a terrible user experience, near-universal hatred and a promise from Samsung that Bixby is coming soon to a refrigerator near you.
Bixby 2.0’s developer SDK is still in private beta. Information is sparse on when it’s going public.
Apple HomePod and Siri (13.1%)
HomePod, Apple’s upcoming smart speaker, incorporates Siri, their ubiquitous smart digital assistant. While the speaker has been in development since 2014, Siri was first introduced on the iPhone 4S in 2011—one of the first modern examples of speech-enabled digital assistants.
Since its release, Siri has shown up in many Apple products. In turn, they’ve made many improvements, including using AI to make Siri sound more human .
Popularity remains to be seen for the HomePod, although marketing is playing up the music angle heavily. Meanwhile, Siri still has a solid contingent, but usage and engagement has been dropping—the platform lost 7 million users from May 2016 to 2017.
The HomePod heavily focuses on music, including automatic adjustments for surroundings like walls. While there are many concerns about Siri’s future, one of its biggest advantages is definitely multi-language support (Siri has 21 languages localized for 36 countries, including regional accents.)
HomePod, iOS, watchOS, and Siri all use SiriKit for third-party apps and HomeKit for smart home integration.
Microsoft Cortana (2.3%)
Cortana is Microsoft’s Halo -inspired voice agent, available across Windows 10 devices and Xbox One. A Cortana app puts the assistant on iPhone and Android, while Alexa integration means Cortana can access Alexa, and vice versa.
Cortana learns about your behavior, including preferred contacts, quiet hours, travel itineraries and calendar information. All this information is stored in an editable “Notebook”. So, unlike many digital assistants, you can tweak what’s being stored for greater accuracy.
For now, Cortana has just 2.3% of the digital assistant market, but that’s still at least 145 million users. Most importantly, Microsoft has been experimenting with Android, iPhone and Alexa integrations. And usage and engagement has dramatically increased.
Cortana is focused on OS-based productivity scenarios rather than smart home or internet of things (IoT). The digital assistant also understand contexts and handles follow-up questions well. On iPhone, it’s contextually smarter about many things than Siri, especially if you roam Cortana across devices .
- Add skills to Cortana (30 minute tutorial).
- Use the Devices SDK to add Cortana to hardware.
Baidu is China’s search giant. It’s also one of the world’s largest internet and AI companies. In early 2017, they acquired Raven Tech, a Y-Combinator startup whose app and voice assistant integrates seamlessly with third-party service providers like Uber. Baidu is all-in on voice; in fact, an employee told Tech in Asia that, “2017 will be a year of conversational computing.”
So, what’s in store for Baidu? They recently announced three smart speakers. Baidu’s AI has shown 97% voice recognition accuracy in tests, so it won’t be surprising if these have voice capabilities. Meanwhile, DeepVoice imitates any accent with just 30 minutes of speech data.
And, as a Chinese company, Baidu has the domain knowledge needed to tackle the linguistic challenges other brands struggle with. Baidu is definitely one to watch worldwide.
Mozilla’s open source speech software, DeepSpeech, is very good, with just a 6.5% error rate (Microsoft’s is 5.5%). Open source solutions are worth considering if Amazon and Google’s solutions don’t work for you, but you need to build your own speech recognition AI and don’t want to start from scratch.
If you’re looking for something to run on a Raspberry Pi, look no farther than Jasper . It’s an always-on voice interface. It’s open source, too. It can do everything from getting information from Wikipedia to updating your social networks. Raspberry Pi’s flexibility means you can theoretically add voice functionality to anything from a smart mirror to a beer fridge.
SoundHound’s Houndify promises conversational voice interfaces for any internet-connected device or app, while letting you keep your brand and users. It can handle multiple questions in a single sentence, plus context awareness and follow-up questions (some of which you can see in action here .) By combining both speech recognition and language understanding, SoundHound differentiates itself from the competition and is one of the most accurate platforms , too.
Developing Voice Applications
First: is voice justified?
At Jakt, we’re always asking whether the solution is appropriate for the given problem. For example, is touch unwieldy? Does voice address a key issue? Does it actually simplify the interaction somehow? Voice shouldn’t just be something cool to add; it should be something that benefits people and eases their interactions.
If you’re ready to start implementing voice, shoot us a message.
From a business perspective, there also need to be enough people to justify using a particular voice platform. We might love voice, but creating a voice app for its own sake isn’t just a waste of resources, it’s potentially alienating for your users (as Samsung is learning with Bixby).
So, when should you use voice on its own versus pairing it with touch and a screen? This is a question that even user experience (UX) experts are still exploring . It’s one that isn’t likely to be perfectly answered any time soon. Nonetheless, here are some ideas to get your gears turning.
Voice-only and voice-audio interfaces are best…
- When it’s impractical or dangerous for people to look at and touch their phone, like when their hands are full, they’re rollerblading or they’re driving a car.
- In situations where you’d normally have a conversational interaction, anyway (think customer service, asking a friend to look something up show times or even a rapid-fire trivia game).
- Where it’s more efficient to use your voice, like when numerous results are returned but it’s tedious to pore over the options. If it’s low-stakes to curate just a few top candidates for a person, voice can be a great solution.
- When it’s unnecessary to see what you’re talking about. You probably don’t need to see or touch movie showtimes to make a decision about them.
Voice-touch and voice-visual interfaces are best…
- In fast-paced environments, like a VR, AR or MR game, where you’ll be engaging all your senses and just want to supplement sight and touch (for example, imagine using voice commands to issue commands to an AI-controlled game character).
- In public or quiet environments where using VUI isn’t appropriate or private enough. You could imagine optionally using voice for confirmations like “yes” and “no” and switching to a touchscreen for longer or more private interactions.
- Where you need to see what you’re talking about. For example, no matter how well Alexa describes those shoes or that tie, there are just some things you (probably) won’t buy sight-unseen.
- When you need visual cues for confidence, like self-driving cars. Smashing Magazine ’s article on combined VUI-GUI interfaces covers how trust and self-driving cars is a big issue. Giving up control from the driver’s seat doesn’t come easily. In the world of autonomous cars, screens and touch-interfaces provide crucial insight into what’s happening .
Touch-only interfaces are best…
- When it’s impractical, embarrassing or socially inappropriate to use voice. Your movie theater app probably shouldn’t be voice-only, unless your users enjoy being pelted with popcorn.
- Inconsistently loud environments (i.e. requiring users to raise their voices or shout for sustained periods, which causes vocal strain), or tasks requiring significant talking and input (aside from dictation, where someone is intentionally choosing not to type).
- Where voice is a security issue. Google went through a pretty serious controversy in late 2017 when a Google Mini Home bug meant it was silently recording everything users said and sending the audio to the cloud. There’s definitely something to be said for carefully considering whether voice (especially cloud-based voice processing) is necessary and what that means for your user demographic.
Second: what capabilities will you need?
Before picking a platform, it’s key to consider your voice recognition needs. Every voice recognition has its own strengths and weaknesses. What you’re looking for will determine what you choose (or what you choose to develop).
Context: If your voice system needs to retain context and answer follow-up questions instead of treating each interaction like an isolated instance, that means you either need to choose a platform which supports context-awareness or design your own context-aware system.
As we’ve already covered, context-awareness requires typically requires understanding what entity is being discussed across multiple interactions (for example, Star Wars ), as well as what that entity is (a movie) and potentially even how it would be classified or tagged (science fiction).
The most context-friendly platform? Google Home and Google Assistant.
- Language, accent and dialect: If your voice system is a globetrotter, language, accent, and dialect will all be factors. But does it need to speak different languages or just recognize them? Disambiguation between different regional accents speech patterns and grammatical conventions will be a big factor, too.
The most well-localized platform? Siri (but Baidu wins in China by a mile).
- Noise cancelling and accuracy: Depending on your scenario, noise removal may be more or less important. If your voice app will be used while walking down the street, in restaurants or while driving, background noise is likely. Accuracy in these settings will be more challenging than a quiet house.
The most accurate platform? Baidu, whose 96% accuracy is quickly approaching the golden standard of 99% — as accurate as people. Siri also performs extremely well in noisy environments.
- Conversational complexity: Your scenarios will strongly determine your product’s complexity. Transcription doesn’t require understanding words’ meaning, but voice commanding and conversational voice interfaces do. Voice recognition won’t understand what you’re saying, but it’ll understand who’s saying it. If you need a combination of all of these, that’s the most complex of all.
The platform with the most conversational complexity? It depends on your goals. Amazon Echo and Google Home both have personalization based on voice recognition , while Google has excels with speech-to-text.
Third: what platform will you use?
Do you want your voice app to be on mobile phones and tablets? Smart speakers? Wearables? VR? Are you building your own hardware? Once you’ve made this decision, you’ll know more about whether you want to develop for Amazon Alexa and Echo, Google Assistant and Home, experiment with an open-source solution or develop your own voice-recognition AI. Need help deciding between all of them? Check out the breakdowns of the voice platforms below or shoot us an email with your specific questions.
As with any platform choice, each of these choices comes with its own trade-off.
- Private platform (like Amazon Echo, Google Home, etc.): You get the documentation and support of that platform but also the inherent limitations. If it doesn’t do what you need it to, you can put in a feature or change request, but you might be out of luck until they add support.
- Independent platform (like SoundHound Houndify): Similar to a private platform, but you don’t have to hitch your wagon to the Amazon or Google brands.
- Open platform (like Mozilla DeepSpeech): You gain access to a community of like-minded developers working toward a common goal. Also, you can usually modify the source code for your needs. However, there are potential security risks or flaws.
- Custom platforms (DIY or work with an agency): Developing it yourself means full control over every aspect of design and development, but this also comes at the cost of time and money, like finding the right talent. If you work with an agency or full-stack development agency, (like Jakt) they can bring the talent and find voice-recognition experts for you.
The Future of Voice
The current trend of voice through digital assistants and smart speakers is just the beginning. Not too long ago, the greatest hurdle was speech recognition and machine learning. Now, the floodgates are open.
Moving forward, advances will include:
- Individual voice recognition: Recognizing who’s speaking for security, identity, home automation and more. Amazon and Google’s offerings both already have individual, voice-based personalization. Expect even more of this in the future, plus biometrics.
- Smarter contexts: Text-based search already uses location, time of day and personal preferences, but in many cases voice search doesn’t. Expect this to change soon.
- Natural conversations: Alexa may have 15,000 skills, but they rely on specific phrasing. No one can memorize that many commands, nor should they have to. Conversational VUI is the future. Someday, discussions with virtual assistants will be practically indistinguishable from human ones.
- Bridging ecosystems and platforms: Collaborations are already emerging, including between Alexa and Cortana. The more integration points and ecosystem crossovers, the more people can control various devices and services from a single endpoint. The same goes for various voice-enabled IoT devices and their unifying protocols (like Bluetooth, Z-Wave and ZigBee).
- Voice everywhere: From toilets and faucets to refrigerators and TVs, 2018 is the year of voice-enabled-everything.
- Security considerations: As more voice devices are added to homes, pockets and purses with the ability to recognize who’s talking, expect the conversation about security and user data to intensify.
Jakt’s Bold Vision for Voice
Today, search is the main service provider. It’s the first step in the user journey. Searching for sushi in Williamsburg means picking a service (Postmates, Caviar, Uber Eats), then searching for sushi. If you’ve ever spent twenty minutes comparing between different delivery services, you’ve gotten pulled into the same time sink. The onus of evaluating what’s cheapest, tastiest and fastest has been put on us for far too long. But, really, we should be asking for sushi, and a service layer should be the one deciding which option is the best, regardless of the delivery provider. It’s time for analysis paralysis to be offloaded from our brains to voice search.
At Jakt, our ultimate vision—and prediction—for the future is that voice will become the new layer between us and the oversaturated service economy. Navigating endless options is simply too unwieldy for audio/voice interfaces. This is why voice will commoditize services and drive prices down as service providers compete for business. Instead of asking Alexa to launch Uber Eats, you’ll ask for sushi and get a curated set of options. Instead of ordering an Uber, you’ll ask for a ride and the cheapest ride share service will show up. Most people are ambivalent to the services they’re using; they just want the best option for the lowest price.
Our vision for a voice-first world is all about context, curation and healthy competition. Best of all, interactions will be simple and swift; no more sifting through endless options or analysis paralysis.
Excited about the developments in voice app development and technology, let’s get in touch. We’d love to help you bring your next voice-enabled app—and a voice-powered future—to life.