So, now we’ve outlined the fundamentals, below are some of our technical learnings on how to approach development, and some of the technical lessons that will help to ensure a robust, production ready Skill.
Defining a development approach
1. Get focused: great Skills are highly reliable, while unexpected answers are jarring. We have found that it’s better to not overload your Skill with functionality. The more complexity you add to your Skill, the greater the margin for error. We recommend that you focus your Skill around two or three well understood outcomes.
2. Design the voice user interface flow: building a voice interface is complicated. There are many possibilities in flight, which ‘branch’ exponentially. In order to develop Skills efficiently, invest some time up front to map out the Skill and its flow. We found one of the best ways to start this process is to sit down with some representative users and have a conversation - where you pretend to be Alexa. This is a very valuable exercise if you want to map out the potential conversational flows that a person naturally wants to use to achieve certain tasks. You can then use these flows as the basis to design your voice interface.
3. Conversations over commands: it sounds obvious for a voice-enabled speaker, but conversation really is key for the Amazon Echo (or any Alexa-enabled device). By that I mean there is a difference between conversation, and verbal command. Using the Amazon Echo should feel natural - a user can ask a question, for example, without thinking about how to frame it and receives a helpful, accurate response in reply.
One of the main challenges of non-graphical interfaces is precisely this issue of discoverability - helping the user find and access the information or service they want without visual elements and cues to guide them. If a Skill requires a user to memorise multiple, relatively fixed verbal commands, it puts a great deal of responsibility on the user to remember the precise phrasing and syntax needed - just to find out the latest weather, for example, or what time the next train is.
This is where conversation comes in. You should design your Skill in a way that lets the user know what they can do, and guide them gently, naturally, along the way. One simple way to do this is for your Skill to remind the user, at launch, of the things they can ask it to do.
4. Prepare for failure: it’s inevitable that things will occasionally go wrong when interacting with an Alexa Skill. During our work to date, we have seen that there are a few common causes of error. Inaccurate recognition of voice commands, for example, can be the result of the human misunderstanding of the question being asked, and this can throw a conversation off on a tangent.
Equally, the API that the Skill is using to do its work - such as book a ticket - may be offline. What you need to do is design the Skill to manage these errors, and provide alternative responses that get the user back on track as quickly as possible. As a principle, responding with a reply like "Sorry something has gone wrong." or even quitting without acknowledgement is a sign of a weak voice interface design.
5. Make sure you have plenty of sample Utterances: in order for Alexa to adapt to different ways users can ask for things, you need a good collection of sample Utterances. Whilst having more than five sample Utterances per Intent is recommended, our experience also shows that there can be an upper limit where too much overlap causes collisions and confusion.
6. Testing in the real world is vital: Amazon provides a number of tools to test Skills, and the associated web service, using text commands. However, nothing beats testing with real people in order to gain a proper understanding of how they will actually interact with your Skill. This will help to refine your Utterances. Testing with representative users on real hardware will enable you to identify the most likely points for failure within the voice interface, and should allow you to identify the parts that need to be made more robust.
The technical lessons
7. Validate slots: it's important to be aware that Slots do not act like enumerations. They do not filter user inputs: instead, Slots represent a user input. In custom slots, the list of examples is used to help Amazon Voice Service (AVS) to match speech to specific information. For example, the built-in date Slot lets Alexa know to look out for a date in that part of an Utterance. As a result, if you need to classify or categorise user inputs (for example knowing what item a user is ordering from a catalogue), you should use Slots to capture that input - but use traditional enums to give them a type.
8. Simplify intents with context: to successfully make your Skill conversational, you will need to deal with many potential branches of conversation flow. If you treat each branch as a separate intent, then the Skill will quickly swell and become difficult to manage. To get around this challenge, you need to harness the power of context within a conversation. For example, we found it useful to have a single positive and single negative intent to handle any time the user is replying to a question. The intent then knows what the user is responding to based on context and state as described earlier.
9. You need a state machine: anytime that you give Alexa a new command using the Wake Word, you are starting a new session. The session object is exposed in the Alexa Skills Kit (ASK) and has a key value store to cache any important session scope information. However, this is not enough to keep track of conversations in more complex Skills, and to handle generic Intents properly. We have found it vital to implement some form of finite state machine in order to track and maintain the state of your Alexa session.
For example, after Alexa gives the user an Ask Response, you can then update the state machine with the question Alexa just asked. That way, when a generic ‘yes’ or ‘no’ intent is triggered in response to the user answering the question, the Skill just needs to check the state machine to determine what the context behind the yes or no answer was.
10. Use a phrase map: Your Skill’s web service interacts with the hardware by sending text-based responses to the AVS. Rather than have methods in your Skill to own and contain the text output, and pass it straight to a user’s Echo, we recommend the usage of some kind of phrase map to tokenize the response against the actual text. This could be as simple as a key value storage, which tracks the state against what response Alexa should give.
Putting an extra layer of abstraction between the text, your Skill, and Alexa, enables you to write more readable, robust code with proper separation of concerns. It also allows you to standardise all the text in one location. With a view to applying a Skill to new languages as they become available, this pattern ensures the minimal friction for translation and per-language tweaking of your Skill.
These lessons mark the early start of our prototyping and research into Alexa, and with voice assistants in general. What is clear is that whilst it is early days for the Amazon Echo and Alexa, they do represent an exciting opportunity. The potential is enormous, and is only growing as voice activated assistants become more sophisticated - and as Alexa’s own Skills become more complex.
If you would like to see some of our working demonstrations, or if you would like to know more about the technical opportunities and constraints of Alexa and the Amazon Echo, get in touch.