The conversational AI, and specifically, the natural language understanding (NLU) market has changed significantly over the last 5 years. Nuance’s proprietary NLU dominated the market prior to that, and there were few real alternatives. Loquendo, based in Europe, seemed a distant second. All deployments required specialized coding.
Within the last 5 years, low-code cloud-based NLU options such as Google Dialogflow, IBM Watson, Microsoft LUIS, Amazon Lex, and Facebook wit.ai began democratizing NLU. Even more recently, open source options have emerged with RASA leading the pack. There are also countless other proprietary platforms in the mix. As a result, conversational artificial intelligence (AI) has become the hottest buzz word of late.
Many IT and business leaders now find themselves asking “which is the best AI?” and “how do I choose?”. The answer…..it depends. This article is an analysis of research done in the last 4 years to help IT and business leaders best approach this decision.
Faculty at the Technical University of Munich published a quantitative study of the accuracy of several NLU platforms, focused on two critical NLU tasks – intent and entity recognition. Several intents were included in the areas of public transit and use of web and computer applications, with the number of training examples varying between 1 and 100. The Munich study concluded 1:
- Domain or use case may influence the best performers.
- NLU services change over time and should improve.
- LUIS performed the best overall.
- RASA finished second.
- Watson and api.ai (now Google Dialogflow) lagged. api.ai, in particular, struggled most with entity recognition.
- Lex and wit.ai were excluded from the study because they did not support batch import/export of training data.
A different study published by proprietary provider Snips found that the number of training examples has a great impact on accuracy. When increasing the number of training phrases for each of 7 intents from 70 to 2000, the Snips F1 (accuracy) score went from 79% to 93% 2. So training data is important, not only for the validity of a study, but for your own deployments.
A faculty group from Heriot-Watt University completed probably the most comprehensive quantitative study to date. This study involved intents to be used by a personal assistant, much like Alexa or Siri, ranging from alarms, music, and search, to movie recommendations. There were more intents and more entities covered in this study which also used significantly more training phrases. Just one year after the Munich study, the Heriot-Watt study concluded 3:
- Watson performed best at intent recognition.
- Behind Watson – RASA, Dialogflow, LUIS, and Watson were all similar.
- LUIS performed best at entity recognition.
- Watson lagged all others in entity recognition. Note: Watson has since added a contextual entity feature which may have addressed this gap.
- None of the NLUs supported capturing multiple intents in a single utterance.
Both the Munich and Heriot-Watt studies demonstrated how NLU’s may rank differently on intent recognition vs. entity recognition. Therefore it is important to consider how critical entities are to your likely use cases.
Two Italian university professors performed a comparison of NLU platforms, more focused on usability and features. They concluded 4:
- Usability and Features – Dialogflow was the most complete NLU platform due to its unique automatic (rather than programmable) features including default fallback intents, automatic context handling, and linkable intents.
- Accuracy – Watson performed the best, but their experiment was quite limited with test intents only using 5 training phrases, essentially the minimum.
HFS Research ranked several NLU providers based on their qualitative analysis of the vendors’ execution and innovation. In the Developer Tools and Platforms category, the rankings were as follows with key notes on each 5.
- Dialogflow – #1 for functionality and ease of use with context management out of the box; #1 for languages
- LUIS – highly customizable and well suited for complex conversations; no intuitive graphical interface
- Lex – long learning curve; lack of languages
- Watson – supports advanced conversational features like digression; good interface/UI but some things are tricky to set up and requires a longer learning curve
A few conversational AI industry consultants also published comparison papers.
Corbus Greyling highlighted what he sees to be key cross-industry trends emerging in conversational AI and NLU platforms 6:
- The merging of intents and entities
- Contextual entities – detected by their context within a user utterance
- Deprecation of the State Machine – towards a more conversational like interface
- Complex entities – introducing entities with properties, groups, roles etc.
With this context in mind, Greyling analyzed and ranked several of the most common platforms. Rankings on NLU Features alone (ignoring other categories like Cost and Ecosystem Maturity) are as follows with key notes on each 6.
- LUIS 100% – most advanced entity detection options; contextual linking between intents and entities; machine learning for entities; data structure – entities that can be decomposed into entities, sub-entities, etc.
- Dialogflow 90% – serves well as an NLU API, especially for larger implementations; entities detected automatically and contextually; context can be defined within each intent; well suited for voice
- RASA 90% – highly configurable; strong machine learning; contextual and compound entities
- Watson 80% – contextual entities which can be annotated; intent conflict resolution built-in
- Lex 70% – not as feature-rich as LUIS or Watson; more suited to voice and not for multi-turn interactions
Aravind Mohanoor agreed that NLU services will change over time and improve. In his comparison of Dialogflow and RASA, he points out that Dialogflow is 5 years ahead of RASA in R&D time. However, the best suited NLU depends on your requirements. Dialogflow is the better closed-source, hosted bot framework. RASA is the better open-source, self-hosted bot framework. Here are some strengths and weaknesses of each 7.
- Easier for non-technical programmers that want to participate fully in the bot building process.
- Good for multi-turn dialogs and speech/voice apps due to the ease of handling explicit contexts and built-in machine learning.
- Allows for on-premise, highly secure deployments.
- Being open source, is a lot more customizable than Dialogflow but is also much harder for technical non-programmers to use. Need NLU expertise to enable contexts.
Van Baker in Gartner’s 2020 “Cool Vendors” analysis of RASA agreed that the RASA platform makes it possible to build HIPAA-compliant assistants that can be deployed both on-premises and in the cloud.
However, the platform is very conversational AI developer-centric. It does not offer low code capabilities that equal other conversational AI platform vendors 8.
So how do you choose?
We’ve seen unbiased scholarly studies have different results over time for different domains and use cases and differing levels of training.
Features continue to differ as some platforms are better suited to business users and others better suited for developers. Some are closed-source in the cloud and some are open-source and self-hosted.
Use case and modality are important to consider in evaluating the importance of entities, context, and speech. There is also varying coverage for languages.
The truth is that the winner in 2021 may be different than in 2020.
The best decision: Hedge your bets and use conversational middleware
Conversational AI middleware most commonly uses the conversational platforms that have been reviewed so far – Google Dialogflow, Amazon Lex, Microsoft LUIS, IBM Watson Assistant or RASA Open Source — as the foundation for NLU and basic dialogue management, then enables additional, specialized functionality on top. Such functionality can include the fulfillment of intents, integrations to back-end systems, rich UI, and more advanced dialogue management 9.
Conversational middleware can be helpful for enterprises 9:
- Where no single platform is capable of delivering all the necessary languages and language variants that are required.
- That wish to keep siloed chatbots and virtual assistants that were built on different platforms for different use cases (ex. customer service, tech support, help desk, sales, agent/employee assistant, etc.) while centralizing and standardizing development moving forward.
- Where there is a desire to support different domains and use cases using the optimal NLU for each.
- That are unsure which NLU will win in the long term and may want or need to change down the road.
- To enable scale and flexibility while minimizing any re-work.
- Where there is a high future need for integration both toward messaging channels and back-end systems. Integration of multiple chatbots to all back-end systems is costly, and a solution for better integration into the current architecture is being sought.
Gartner recommends to “buy or build conversational AI middleware, with the possibility of multiple conversational platform back ends, as an alternative to standardizing a platform too early.” 9
Instead of making a more permanent decision based on an impression of the current state of the NLU market, the optimal approach should be to build implementations of chatbots and virtual assistants for all use cases on a common conversational AI middleware even if different NLUs are used and may change over time.
1 Daniel Braun, Adrian Hernandez Mendez, Florian Matthes and Manfred Langen (2017) Evaluating Natural Language Understanding Services for Conversational Question Answering Systems. In: Proceedings of SIGDIAL 2017,174–185.
2 Alice Coucke, Adrien Ball, Clment Delpuech, Clment Doumouro, Sylvain Raybaud, Thibault Gisselbrecht and Joseph Dureau (2017) Benchmarking Natural Language Understanding Systems: Google, Facebook, Microsoft, Amazon, and Snips. https://medium.com/snipsai/benchmarking-natural-language-understanding-systems-google-facebook-microsoft-andsnips-2b8ddcf9fb19
3 Xingkun Liu, Arash Eshghi, Pawel Swietojanski and Verena Rieser (2019) Benchmarking Natural Language Understanding Services for building Conversational Agents. In arXiv:1903.05566v3
4 Massimo Canonico and Luigi De Russis (2018) A Comparison and Critique of Natural Language Understanding Tools. In: Proceedings of CLOUD COMPUTING 2018.
5 Melissa O’Brien and Emily Coates (2020) HFS Top 10 Digital Associate Products. https://www.hfsresearch.com/research/hfs-top-10-digital-associate-products/
6 Corbus Greyling (2020) Updated: A Comparison Of Eight Chatbot Environments. https://cobusgreyling.medium.com/updated-a-comparison-of-eight-chatbot-environments-7f57d4e2dc09
7 Aravind Mohanoor (2020) Dialogflow vs. RASA NLU. https://miningbusinessdata.com/dialogflow-vs-rasa-nlu/
8 Van Baker, et al. (2020) Cool Vendors in Conversational AI Platforms. https://www.gartner.com/en/documents/3991856
9 Magnus Revang (2019) Using Conversational AI Middleware to Build Chatbots and Virtual Assistants. https://www.gartner.com/en/documents/3970980/using-conversational-ai-middleware-to-build-chatbots-and-virtual-assistants