Large Language Models + The Future Of Digital Assistants Tom Hewitson - Founder @ labworks.io
A presentation at WIAD London 2023 in March 2023 in London, UK by The Research Thing
Large Language Models + The Future Of Digital Assistants Tom Hewitson - Founder @ labworks.io
About me + labworks.io Founder + principle Conversation Designer since 2015. Worked on 200+ conversational experiences in more than 14 countries. Partnered with clients such as Unilever, AB InBev, Lloyds, Sage, Rightmove and the UK Government. Projects include IVR systems, AI accountants + NPCs in games. Webby Award winner, Alexa champion, Game AI Developer of the year 2021 + Chair of Conversation Design Institute Ethics Panel. Previously @ GDS, Meta, BBC and various others.
What are Large Language Models? A LLM takes a sequence of words and predicts the next most likely words using enormous amounts of data (GPT-3 is trained on ~350bn words). LLMs have demonstrated an uncanny level of fluency and a partial appearance of understanding / reasoning in some cases, often producing text that is indistinguishable from that written by humans. Imagine being able to ask Reddit a question and get an immediate answer. The response you can get from GPT will be very similar to that first answer. In recent months they’ve been harnessed to make conversational assistants such as ChatGPT and Bing.
Why should we use them in conversational experiences? Traditional conversational experiences are effectively a decision tree - each time the user makes a choice they head off down a new branch. This gets very difficult to design, build and manage as you try to account for all the different possible permutations a conversation can follow. To prevent this your bot has to have a very tight scope which leads to them being limited to high traffic repetitive tasks. But the benefit of a bot is in the long tail where existing site / product architecture falls down and the user can’t find the information another way. LLMs allow us to build conversations without the complexity by letting them handle the permutations of the dialog on a case-by-case basis.
What’s the catch? Getting an LLM to do what you want is an art and making it 100% replicable is impossible. Like all ML products they are statistical rather than deterministic systems. LLMs have a bad habit of making things up (known as ‘hallucination’ in the industry). Even worse, they can sound terribly convincing while doing so. Integrating live / personalised data into the responses LLMs generate isn’t straightforward. How can you trust them not to say something offensive? LLMs are still comparatively expensive and can be slow.
What is prompt engineering? ‘Prompt engineering’ is the task of writing the perfect prompt for an LLM to get it to produce the output you want. Finding the right one involves a serious amount of trial and error and sometimes even minor changes in the prompt can completely change the results eg regions in Andalusia vs regions in andalusia. Often it’s best to try to break the request down into component steps or at least ask the LLM to go through a ‘chain of reasoning’ process. Almost every case is unique but here’s a neat, general purpose, trick: telling the LLM that it is the world expert in whatever you are asking it to do will significantly improve the answers it gives.
What are embeddings? The knowledge encoded in an LLM is a byproduct of the original objective which was to generate fluent natural language. To fix ‘hallucinations’ you need to separate your knowledge source from your natural language generation process. In practice that involves inserting the information you want your NLG response to be based on into the prompt before generating a response. The main way of doing this is to ‘embed’ your knowledge base + the user’s request and use the vector of the user’s request to find all the relevant knowledge. This also provides a practical strategy for providing live / personalised data to your assistant to use.
How can we manage the content? While you can insert any information you want into the prompt, being able to provide it in a structured way with all the right context will allow the LLM to generate a much better response. Many people working on LLM assistants believe that knowledge graphs are the way to do this but it is still an area of active research. In the meantime people settle for looking up blobs of text containing the concept and hoping the LLM can interpret the bits it needs itself. New job role: Conversational Knowledge Graph Architect?
What if it goes rogue? These models are trained not to produce hateful or offensive content but adversarial users may well be able to bid them into doing so with the right set of prompts. Moderation APIs exist which can scan the output of your model for anything that violates guidelines so you can block that response and tell the LLM to produce something else instead. Interesting work being done on ‘Constitutional AI’ which should make it easier to control the behaviour of your bot in the future but this again is an area of active research. Is human level trustworthiness enough?
What about costs? ChatGPT is 1/10th of the price of davinci-003. Costs have gone down ~98% in a year. Not everything needs the fanciest models. The smaller the model, the faster and cheaper it will be eg Github Copilot. As a general rule you should include as much information in your prompt as possible but experiment with removing things and measuring if it impacts the quality of the bot’s responses. If you decide to ‘fine tune’ a model it will cost more but you may be able use a shorter prompt so the impact on costs could go either way.
How can I / we get started? Building a prototype LLM-based assistant is pretty simple. Building the confidence to use it on real users is much harder. The best way to do this is by being able to run a suite of automated tests that can verify that, for at least your key paths, it is saying (and doing) what you want. These tests will also give you the data to know what impact model provider, model variant, prompt, data insertion etc have so you can make an informed decision on cost vs efficacy. Defining and building these tests will become the core part of the Conversation Designer job description.
Final thoughts LLM-based assistant are going to radically change how we access and use digital information. It’s already happening and the pace is only accelerating. The skills needed to build and evaluate LLM-based assistants are quite different from traditional Conversation Design and content management skills. There’s very few people who know how to do this so there’s plenty of opportunity for those willing to make the leap.
Would you like to know more? I run a 1 day intensive “LLM Bootcamp” where we run through all of this stuff in much more detail and give you a chance to actually play with some of the tech. Ready to go a little deeper? I also run a 5 day “Digital Assistant Discovery Sprint” where you learn about all the stuff and then we build a prototype project so you can see how it actually works in practice. In the past I’ve run these for companies but I’ve had a few requests from individuals as well so I’m thinking of grouping people together and running a couple of sessions. If you’re interested let me know at tom@labworks.io I’m also running a conference in the summer (www.unparsedconf.com) where many of the leading figures in this space will be attending and speaking on this topic.
Any questions?