The first chatbot with emotional intelligence is here, and this could be what Siri will look like in the future

·2024-04-07 21:36

Learning human subtext

AI assistants in science fiction movies basically have one virtue: being humane.

Samantha, the virtual assistant in "Her", fell in love with the male protagonist Theodore, and then broke up with great fanfare. Jarvis, the smart butler in "Iron Man", discusses the design of the Iron Armor with Tony, and occasionally makes fun and complaints part-time.

But if you want to talk to Siri about something on your mind, the reply you get is often "I really can't answer this question." ChatGPT can certainly chat with you, but because of its delay and too seriousness, people cannot forget that it is an AI.

Human emotions and desires are difficulties that AI still needs to overcome. Hume AI, an AI startup founded by a former Google researcher, has quietly taken the lead.

Voice AI with high emotional intelligence and flexible chat partner

Recently, Hume AI opened its product: Empathic Voice Interface (EVI) to the public.

Hume claims that this is the first conversational AI with emotional intelligence.

Directions: https://demo.hume.ai/

The emotional intelligence of EVI is that it can interpret our emotions and give appropriate responses based on the way we speak, knowing when to speak, what to say, and how to speak in the right tone.

When we greet EVI, it is judging our emotions. You can get right to the point by saying "hello" and then directly ask it: "How do I sound?"

I started with the tone of everyday speech, but it felt that I was a little confused and frustrated, and hoped that I would open up and share more thoughts, which captured the typical state of the INFP personality.

Of course, most of the time we don’t do this deliberately. A more ideal situation is that what we say has already hinted at our mood, and EVI responds consciously and takes care of our emotions.

Just like I said to EVI, my vacation is over. Although I don’t say I am sad, every word is sad.

It first cautiously said that it understood, followed my words and said that it felt a little uncomfortable that the vacation was over, and then changed the topic and spoke in a high tone, leading me to recall the good times of the vacation.

Then I pretended to be angry, raised my voice and yelled at EVI, waiting for it to order the dishes based on my tone.

EVI paused for a few seconds before daring to respond, saying that I sounded angry, with a hint of contempt. Did it do something to make me unhappy? Could you please tell it clearly? I wasn't angry to begin with, but hearing it have such a soft attitude made me feel even happier.

Next, I played a game with EVI to see if it could interpret human emotions while simulating them.

EVI readily agreed, first announcing that it was about to start performing, and then entering the scene in a second, with lines that fit the mood. The on-the-spot performance was much better than the acting skills of young actors.

First, he simulated "shame". EVI said that he messed up in front of many people and almost collapsed on the ground in embarrassment. The feeling of regret can be resonated with the friends in the dead group of Douban.

Then there is "depression". Maybe this kind of emotion really has a lot of bitterness to spit out. EVI unconsciously became a talkative, saying that he was too tired to live, tired of struggling, tired of forcing a smile, and it was difficult to even get out of bed. I just feel endless emptiness, which is a vivid imitation of the 996 workers with two points and one line.

Then came "angry". EVI first let out a cry of collapse, and then started a barrage of attacks, saying that I can't believe you would do such a thing, why do you disrespect me, do you know how much my heart hurts, do you know what you caused No matter how much harm is caused, you have to pay the price for your actions.

But its wording was too elegant, more like the incompetent rage of a serious person, and it went in one ear and out the other.

The overall feeling after the experience is that chatting with EVI is closer to communicating with real people.

On the one hand, EVI's tone is flexible and diverse. On the other hand, its reply delay is not as obvious as when speaking with ChatGPT. It will also pause when speaking and pronounce modal particles such as "em" and "oh", as if it is thinking and listening carefully. When you speak, you are by no means a perfunctory chat partner.

Sometimes during the chat, EVI and I would interrupt each other. I disliked it for being too wordy, and it thought I had finished speaking. But when I interrupt it, it stops, which actually makes EVI more human.

It's a pity that EVI only knows English. It humbly said that its Mandarin was a bit rusty and suggested that I chat in English, which it is good at. However, the real situation is worse than this. Despite my repeated requests, it couldn't pronounce Chinese, but it knew the Chinese pronunciation of dumplings.

Looking at the problem from another angle, EVI is a good tool for practicing spoken English. My word choice is very sophisticated and advanced. At the same time, it also encourages my plastic English, at least it can be understood.

If 70% of the driving force for continuous communication with EVI comes from voice, the interactive interface contributes the remaining three points. We can see that during the conversation, the ever-fluctuating emotional curve emerges in front of us, which is a cool visual design.

Even for every sentence chatted with EVI, specific emotions are monitored and displayed in the form of a bar chart. You don’t know, even if you blurt out a sentence, there may be hidden anger, contempt and confusion. I didn’t know that I had so many dramas.

The only thing that dissatisfies me is the default voice of EVI, which sounds like a middle-aged white male. It is not as young and nice as Pi and ChatGPT, and is slightly less approachable.

But its flaws do not hide its merits. EVI is very popular, similar to the original ChatGPT, which often crashes while chatting.

Behind AI’s mind-reading skills is the subtext of learning humans

In fact, the so-called emotional intelligence is not unique to EVI. If you tell ChatGPT that you are unhappy, it will respond to you as gently as possible, telling you that it supports and accompanies you at any time, and soothes your fragile heart.

But Hume's goal is somewhat different, digging deeper into emotions to understand more of the subtext of human speech.

If words are the bright line of communication, then emotion is the dark line. The tone, rhythm, and modal words of our speech are all filled with emotions, which may reveal our true thoughts inadvertently.

The content and emotion of the speech are superimposed, and the amount of information is naturally greater.

Hume made an interesting point: "The future of AI interfaces will be based on voice, because voice is four times faster than typing and carries twice the amount of information."

The prerequisite for AI to understand humans is that a small number of humans act as bridges across the river.

In order to capture the subtle expressions of humans, Hume's AI model is trained based on experimental data from hundreds of thousands of people around the world.

For example, one of the studies invited 16,000 people from the United States, China, India, South Africa and Venezuela.

Some participants listened to non-verbal sounds, including tone words such as laughter and uh-huh, and classified these emotions. They then recorded their own non-verbal sounds for other participants to classify for Hume to train a deep neural network.

Hume even used participants’ audio data to build a speech rhythm model based on pitch, rhythm, and timbre, which looks like a colorful brain.

The sentiment curves and bar charts we saw when chatting with EVI have the contribution of this model.

How many emotions can Hume’s AI currently understand? The answer is 53. In addition to the common anger and happiness, there are also more niche categories such as "nostalgia" and "empathic pain".

It's not enough to let AI understand emotions. What Hume really wants to do is to let AI infer the intentions and preferences behind user behavior on this basis. In other words, it is to see the essence through phenomena.

Obviously, voice AI with high emotional intelligence is very suitable for use as customer service, personal assistant, chat robot, or even used in wearable devices, adding another shovelful of soil to Siri's grave.

Some medical schools in New York are also interested in collaborating with Hume to use AI models to track patients' feelings and detect whether treatments are effective.

Currently, Hume has provided APIs to enterprise customers such as SoftBank and developers to build their own applications.

With the help of voice AI with high emotional intelligence, humans may become increasingly vulnerable to AI.

Former Google engineer, creating an AI emotional bucket for the whole family

Hume was founded in 2021 by former Google DeepMind researcher Alan Cowen, and is named after the English philosopher David Hume. It recently raised US$50 million in Series B financing and was valued at US$219 million, making it another rising star.

Not only voice AI, Hume also has products that can read facial expressions and text emotions.

After all, just like voice, there are emotional expressions in face-to-face, text, and video.

YouTube blogger TheAIGRID took a video of an interview with Sam Altman and asked Hume to interpret facial expressions.

As his expressions continue to change, his emotions also change in real time. Fatigue, confusion, concentration, doubt, longing, boredom, calm, etc. all take the upper hand at one time.

Altman in this interview is answering questions about AI regulation, which may indeed make him feel tired and bored. Some netizens joked in the comment area that in the future, AI can be used to detect lies in celebrities’ interviews and speeches, or AI can be used to judge how well one performs in interviews and dates.

For text, Hume also has a variety of test dimensions, which can not only describe basic emotional types such as happiness and sadness, but also analyze emotional tendencies such as positive, negative, and neutral.

Planning to just be a porter, I asked GPT-4 to help me with the questions and asked it to give a short paragraph of emotionally complex text.

The general idea of the question given by GPT-4 is that someone has recently completed a big project and is very proud of himself, but at the same time he is worried that this may be the pinnacle of his career and he may never reach such a height again in the future.

Hume’s test result is that in addition to victory, satisfaction, and enthusiasm, this passage also contains contemplation, confusion, pride, doubt, and determination. It understands reading and comprehension better than the person who asked the question. GPT-4 only said that this passage is mixed It expresses a sense of accomplishment and worries about the future, reflecting the complex emotions after success.

Although Hume has achieved many results, human emotions are highly subjective, complex and multi-dimensional. They cannot be completely interpreted with expressions and tone of voice, and are related to social background, cultural norms, and personal personality.

Zhuangzi sang after losing his wife, Maggie Cheung laughed first and then cried in "Sweet Honey" and her acting skills were astonishing. The boys and girls with fake smiles concealed their true hearts.

Hume also admits that detecting emotions is still "an imperfect science."

This was reflected in the chat with EVI. When I asked whether it was sad or unhappy, how many emotions it could detect, EVI answered in a satisfactory manner. I don’t know why, but anger and contempt were detected in this conversation.

Technical problems are left to technology to overcome, and the dangers hidden by technology have also emerged.

Hume actually foresees the risks and has proposed a number of AI initiatives, calling for algorithms that detect emotions to be used only to improve human happiness, rather than manipulation, deception, etc. However, this is just empty talk.

After OpenAI officially announced its voice model and said it would not release it for the time being, the former CEO of Stability AI posted on X that voice AI is the most dangerous AI so far because humans are almost unable to resist persuasive voices.

The chatbot Pi, which also has a high emotional intelligence, uses emoji to show empathy, making people want to keep chatting with it without worrying about being left alone. However, its voice is still a bit flat and not as cute as its text.

If sounds become more human, our ears may become softer. Although Hume is not enough to make me indulge in gentleness, I do enjoy the feeling of being carefully heard and caught in every word.

Our facial expressions, the way and content of our speech, and even modal particles without specific meanings are all being used by AI to study our moods.

In the near future, maybe we will really meet the Samantha in "Her". They are not just products, they understand humans better than humans, and they are more like humans themselves.

This article comes from the WeChat public account , author: Zhang Chengchen, and 36 Krypton is published with authorization.