A previous post called A New AI Frontier: Understanding Emotion explored market drivers for sentiment and emotion analysis. This blog post details the market barriers for sentiment and emotion analysis. But first, let us provide a short review.
The promise of AI is to make work and life more productive. But to do so, AI needs to better understand humans, which are the most complex organisms on earth. A significant element of AI’s limitations, to date, is understanding humans and, more specifically, human emotion.
Accelerated access to data (primarily social media feeds and digital video), cheaper compute power, and evolving deep learning combined with natural language processing and computer vision are enabling technologists to watch and listen to humans with the intention of analyzing their sentiments and emotions. With a better understanding of emotion, AI technology will create more empathetic customer and healthcare experiences, drive our cars, enhance teaching methods, and figure out ways to build better products that meet our needs.
Emotion and sentiment analysis is complex because emotion is complex and not very well understood. Emotion can be deceptive and expressed in multiple ways: in our speech intonation, the text of the words we say or write, and our facial expressions, body posture, and gestures. These factors create variables in emotion analysis confidence scoring that must be overcome for most sentiment and emotion analysis use cases to come into full bloom.
Despite these challenges, the market for sentiment and emotion analysis use cases has begun to expand. Tractica has identified seven use case categories where significant direct software revenue will be generated through 2025:
- Customer Service
- Product/Market Research
- Customer Experience
There are several market challenges facing the advancement of sentiment and emotion analysis, with accuracy being the chief issue.
Emotions are difficult to interpret. Both humans and technology make errors in interpreting emotions, which has consequences within the use cases in which emotion and sentiment analysis are and will be leveraged. In human face-to-face communication, emotion is communicated through speech intonation, the words spoken, facial expressions, body posture, and gestures.
To date, input data for AI-driven emotion and sentiment analysis is typically derived from only one source – facial expression or written or spoken text – and is rarely multi-modal.
Facial Recognition Challenges
A paper by University of California research scientists Elaine Sedenberg and John Chuang, “Smile for the Camera: Privacy and Policy Implications of Emotion AI,” illustrate some of the challenges:
Our interpretation of others’ emotional states in real life is flawed: colored by context and interpersonal history, our own emotional position, and even our own mental state… Other studies have shown that the intensity of emotions expressed and accompanying facial movements such as eye activity vary by culture, which would have bearing on the analysis and expectations that particular reactions correlate to particular ends. Further our social exchanges, while nuanced, often involve complicated interdependencies with mimicry to reflect our conversation partner’s mood in order to promote bonding and exhibit prosocial behavior.
Our microexpressions lie underneath outward facades or in fleeting looks, and betray concealed or subconscious emotional states that are most often too brief or subtle to notice in natural interactions. Microexpressions from others may leave an impression after an interaction, but lack the certainty or explicit labeling of algorithmic verification—a shift that makes the subtle a virtual roar. Deception is a notoriously difficult expression and social cue to pick up on, but studies have shown that individuals who are able to acutely pick up on microexpressions were able to identifying lying behaviors.
Emotion analysis vendor Kairos noted the struggle to interpret the correct emotions from facial expressions. A blog post from March 2015 stated:
Internal research by Kairos, and the company that initially developed its Emotion Analysis API (IMRSV), discovered that not all of Ekman’s universal emotions provide consistently distinctive facial expressions. For example, most people struggle to differentiate between an angry person and a disgusted person, simply based on the reactions showing on the person’s face. The traditional view of an angry face is that the eyebrows lower, lips press firmly and eyes bulge. The traditional view of a disgusted face is that the upper lip raises, the nose bridge wrinkles and the cheeks raise. The problem, in the real world, is that different people react in different ways to these two emotions, and the facial reactions overlap. This makes it virtually impossible to provide a definitive answer as to which of these two emotions the person in a picture is actually feeling.
Based on interviews for this report and other data points, it is clear that when it comes to facial recognition, care must be taken in interpreting emotion across culture and gender.
Accuracy Impact: Emotions or Thoughts?
Emotions and thoughts: which comes first? Do emotions guide thoughts or is it the other way around? Do speech and written text reflect thoughts or emotions? In November 2017, market research firm Bold Delve found that what people said did not always match up to what they meant:
We conducted a project where we had respondents try a new product in their home for a month. We first invited them to an initial interview and had them smell the product and discuss their reaction to it. We analyzed their immediate emotional reactions, their top of mind reactions to it, as well as their numerical ratings of appeal. We then sent them home with the product and had them log into a discussion forum several times a week and tell us how they liked actually using it.
We discovered that the initial emotional reaction measured using facial recognition and emotion detection software was not always directionally consistent with what they verbally articulated about the product. It was however, highly predictive of product acceptance. People claimed that they liked the product, often gave it high ratings, and claimed it smelled good to them. But several respondents showed signs of disgust, surprise, or sadness on their faces. Over the course of the study, the initial negative reaction of these participants was eventually expressed in their verbal feedback. These individuals eventually reported that they disliked the scent. And those who showed sadness eventually explained that the scent was not strong enough. If we had just analyzed their initial emotional response, we would not have been able to accurately predict product usage.
In other studies, we’ve observed similar findings. Although respondents claimed they had a preference for a specific aisle of a grocery store in a control versus test situation, their nonverbal behavior didn’t indicate that at all. And in fact, their purchases and emotional responses were the same for the preferred and non-preferred aisles. Thus, the emotional responses that people have to products and situations, may be telling us the true story. And their verbal feedback may be irrelevant or even misleading.
Emotional expressions provide a great deal more information than we currently obtain from just self-reported behavior. And more importantly, emotions may be driving decision making in ways that we as human beings and as market researchers are not fully realizing.
Barrier: Which Inputs for Understanding Emotion
As there are multiple human cues used to express emotion, it is fair to assume that the level of accuracy in interpreting emotion would depend on accessing as many of those cues as possible, or in the absence of such access, determining which inputs will be most effective for a given emotion analysis use case. There is sufficient evidence to show there is no magic bullet input. There is evidence indicating that using only one modality will lead to inadequate analysis for some use cases. There is evidence that there might be some interdependencies between voice and facial expression to form emotion. As sentiment and emotion use cases emerge, use case proponents will have to navigate the pros and cons of single modalities and/or the cost-benefit analysis of multi-modal sentiment and emotion analysis.
Speech intonation—the pitch, volume, and tone of how you speak—is an important indicator of emotion. Technologists are using AI to identify emotions in speech and then applying it to emotion analysis use cases.
In our modeling approach, the lexical content of the words used in the spoken utterance is not used for model training, rather the models are trained on the paralinguistics, tone, loudness, speed, and voice quality that distinguishes the display of different emotions. However, when you give people the task of labeling emotions, language matters. What if I say something very positive but in a very low, negative intonation? What do you do? It’s very hard for people to separate out the meaning of your words from the tone of your voice. Thus, the lexical content can bias the human labelers’ judgement regarding the emotion displayed. Culture of the displayer and the perceiver can pose similar challenges in acquiring human annotations.
Spectrograms of the Phrase “Kids are talking by the door” Pronounced with Different Emotions
(Source: The Higher School of Economics)
Beyond Verbal is a pioneer in the space of emotion analysis based on voice intonation. According to the company, your voice does not lie. Certain emotion insights can be gained from voice intonation with more confidence than other methodologies. Beyond Verbal claims it can measure the following:
- Valence: A variable ranging from negativity to positivity. When listening to a person talk, it is possible to understand how “positive” or “negative” they feel about the subject, object, or event under discussion.
- Arousal: A variable that ranges from tranquility or boredom to alertness and excitement. It corresponds to similar concepts such as level of activation and stimulation.
- Temper: An emotional measure that covers a speaker’s entire mood range. Low temper describes depressive and gloomy moods. Medium temper describes friendly, warm, and embracive moods. High temper values describe confrontational, domineering, and aggressive moods.
- Mood Groups: A key indicator of a speaker’s emotional state during the analyzed voice section. The application programming interface (API) produces a total of 11 mood groups, ranging from anger, loneliness, and self-control to happiness and excitement.
- Combined Emotions: A combination of various basic emotions, as expressed by the user’s voice during an analyzed voice section.
Diminishing Market Barriers
Tractica believes the market barriers of accuracy and input choices will diminish over the next few years as innovators in the sentiment and emotion analysis marketplace learn from their early experiences. Perhaps more significant in driving improvements, Tractica believes multi-modal emotion analysis will take root and become increasingly prevalent within the next 5 to 7 years.