When Amazon’s first Alexa-enabled smart speaker debuted in 2014, it was a novelty: a voice-activated natural language processing interface that could perform many simple tasks.
Fast forward to today, the Internet connection platform has rapidly expanded and become its own electronic ecosystem. With tens of thousands of Alexa-enabled devices available and hundreds of millions of units sold, Alexa has become an almost ubiquitous virtual assistant.
However, although Alexa is now integrated into everything from TVs to microwave ovens to headsets, Amazon’s vision for environmental computing is still in its infancy. Although other areas of natural language processing and artificial intelligence have made great progress in order to serve the potential market of billions of users, there is still a lot of room for improvement.
Looking to the future, Amazon hopes to eventually enable these devices to understand and support users like human assistants. But in order to do this, significant progress needs to be made in multiple areas, including contextual decision-making and reasoning.
To gain a deeper understanding of the potential of Alexa and environmental computing, I asked Alexa Rohit Prasad’s senior vice president and chief scientist about the future of the platform and what Amazon’s goal is for an increasingly intelligent virtual assistant platform.
Richard Yank: Alexa is sometimes referred to as “environmental computing.” What are some examples or use cases for environmental AI?
Rohit Prasad: Environmental computing is a technology that is there when you need it, and disappears in the background when you don’t need it. It can anticipate your needs and make life easier by being always available without disturbing you.For example, for Alexa, you can use Routine Automate your home, for example turn on the lights at sunset, or you can use Alexa Guard Let Alexa proactively notify you when it detects sounds like broken glass or smoke alarms.
Yongk: In your recent CogX presentation, You mentioned that Alexa “does reasoning and autonomy on your behalf.” Compared with where we are now, what examples are there in the near future?
Prasad: Today, we have such a function Hunch, Alexa will suggest actions for abnormal sensor data, from reminding you that the garage door is open while you sleep, to easily reordering when your printer is low on ink. Recently, owners of Ring Video Doorbell Pro can choose to have Alexa act on their behalf, greet visitors, and provide reception information or provide package delivery routes.
Overall, we have made more situational decisions and made initial progress in reasoning and autonomy through self-study, or Alexa’s ability to improve and expand its capabilities without human intervention. Last year, we introduced a new Alexa feature that can infer the potential goals of customers. Assuming a customer asks about the weather on the beach, Alexa may use the request and other contextual information to infer that the customer may be interested in traveling to the beach.
Yongk: Edge computing is a method of performing partial calculations near the device rather than in the cloud. Do you think that Alexa processing can finally be done at the edge to fully reduce latency, support joint learning, and solve privacy issues?
Prasad: From the moment we launched Echo and Alexa in 2014, our approach has combined processing in the cloud, on the device, and at the edge. This relationship is symbiotic. Where the calculation occurs depends on several factors, including connectivity, latency, and customer privacy.
For example, we learned that customers want basic functions to work even if they happen to lose their network connection. Therefore, we introduced a hybrid mode in 2018, even if the connection is lost, smart home intentions (including controlling lights and switches) can continue to work. This also applies to taking Alexa with you, including in cars where the connection may be intermittent.
In recent years, we have been seeking various techniques to make neural networks efficient enough to run on devices and minimize memory and computing space without loss of accuracy. Now, with the help of neural accelerators like our AZ1 Neural Edge processor, we are creating new experiences for customers, such as natural rotation conversion. This is a feature we will bring to customers this year, which uses algorithm integration on the device. Acoustic and visual cues to infer whether the participants in the conversation are interacting with each other or interacting with Alexa.
Yongk: You have described several functions that we need in your social robots and task robots The artificial intelligence pillar of the futureCan you share the projected timelines for any of them, even if they are extensive?
Prasad: Open domains and multiple rounds of dialogue are still an unresolved problem. However, I am very happy to see students in academia advancing conversational artificial intelligence through the Alexa Prize competition track. Participating teams improve the state-of-the-art technology by developing improved natural language understanding and dialogue strategies to achieve more attractive dialogues. Some people are even committed to recognizing humor and generating humorous reactions or choosing jokes that are relevant to the context.
These are all artificial intelligence problems that need time to solve. Although I believe we have 5 to 10 years to achieve the goals of these challenges, one area that I am particularly excited about in conversational AI is that the Alexa team recently won the Best Paper Award: Explicitly injecting common sense knowledge graphs And it is implicit in the large-scale pre-training language model to give the machine greater intelligence. Such work will make Alexa more intuitive and smart for our customers.
Yongk: For open-domain conversations, you mentioned combining Transformer-based neural response generators with knowledge selection to generate more attractive responses. It’s very simple. How does knowledge selection work?
Prasad: We are pushing boundaries through open-domain dialogue, including as part of the Alexa Award SocialBot Challenge, we continue to invent for participating university teams. One of the innovations is a language generator based on neural converters (ie Neural Response Generator or NRG). By integrating dialogue policies and converging world knowledge, we have expanded NRG to generate better responses. This strategy determines the best form of response—for example, where appropriate, the next round of AI should confirm the previous round and then ask questions. In order to integrate knowledge, we are indexing publicly available knowledge on the Internet and retrieving sentences that are most relevant to the context of the conversation. The goal of NRG is to produce the best response that conforms to policy decisions and includes knowledge.
Yongk: For the sake of nature, you’d better have a lot of conversational context foundation. Learn, store and access large amounts of personal information and preferences in order to provide each user with a unique and personalized response. This feels very computationally and storage intensive. Relative to the position needed to ultimately achieve this goal, where is Amazon’s hardware now?
Prasad: This is where edge processing comes into play. In order to provide the best customer experience, certain processing must be done locally (such as computer vision to determine who in the room is addressing the device). This is an active area of research and invention, and our team is working hard to make machine learning (reasoning and model updates) more efficient on the device. In particular, I am excited about large pre-trained deep learning-based models that can be efficiently refined for efficient processing at the edge.
Yongk: What do you think is the biggest challenge in achieving a fully developed environment AI?
Prasad: The biggest challenge in realizing our vision is to change from passive response to active help, where Alexa can detect anomalies and remind you (for example, your premonition of opening the garage door) or predict your need to complete potential goals. Although AI can be pre-programmed to provide this type of proactive assistance, this will not scale due to countless use cases.
Therefore, we need to move to more general intelligence, which is the ability of AI to: 1) perform multiple tasks without requiring important task-specific intelligence, 2) adapt to the variability in a set of known tasks, and 3) learn brand new Task.
In the context of Alexa, this means that it is more self-learning without the need for human supervision; by making it easier to integrate Alexa into new devices, it significantly reduces the developer’s responsibility for building a conversational experience, and even enables customers to customize Alexa directly teaches new concepts and personal preferences, so as to achieve more self-service; and better understand the state of the environment, to proactively predict customer needs and seamlessly provide them with help.