EVI 3 Core Characteristics
5. Real-time multitask capability (“Idealing”)
EVI 3 Innovation and training methods
- 1. ** Uniform voice-language model structure (Speech-Language Token Integration)**
- 2. ** Training strategy for personalized expression**
EVI 3 model assessment
# 1. Overall Dialogue Experience Assessment
# 3. Emotionally recognized ability assessment
- 4. Actual Response Delay Assessment (Pactical Latency Test)

Hume AI works to create a voice with ** emotional comprehension AI, which they set as their goal: ** By the end of 2025, provide a fully personalized voice AI experience. EVI 3 is their third-generation “voice-language” model, which is not only able to hear, speak, express and understand emotions, but is an important milestone towards the future “emotional AI interaction”. EVI 3 is an AI assistant who can “understand you, talk to you with emotion” and understand your voice and interact with you in the voice and style you like. ** What are the main functions?**

Speak and respond to you like a human being, including more than 30 words and styles of happiness, anger, shyness, fatigue.
And you can hear the emotions in your voice and make a natural, consensual response.
You can set AI’s voice style, like, “talk like a pirate” or “talk softly.”
Dialogue isn’t Caden, it’s almost as natural as real people talk.

** What did it solve?**

Traditional voice assistants can only read the text, not understand your emotions.
It’s a “talking and thinking” mechanism that allows it to talk while searching and reasoning, and it’s no longer just a one-single question-and-answer robot.
EVI 3 transforms “sound” into a real temperature interface, closer to the human mode of communication.

EVI 3 Core Characteristics

##1. Integrated Voice-Language Structure

** Uniform model for voice input and output: Unlike traditional TTS (text to voice) or ASR (voice recognition), EVI 3 uses a **self-regression model to process text (T) and voice (V)token simultaneously.
System Prompt contains language and voice token, defining not only interactive content, but also controlling behaviour such as tone, style, etc.

#2. Artificial expression of emotions and styles

EVI 3 can generate any sound by hint and give it a particular “ personality” or “emotional style”.
Support over 30 complex voice styles: e.g., “excuse”, “fashioned”, “Piracy”.
You can create AI’s “sound” and “characterism” by hints, such as softness, humor, professionality, etc. There are now more than 100,000 self-defined voices generated on the platform, which is much more flexible than the voice helpers of the usual fixed roles.
Be more natural than models such as GPS-4o in ** expression of genuine feelings and tone switch**.

##3. Efficient voice response capacity

Delay in model reasoning ** below 300 ms**.
In practical application, EVI 3 responds 0.9-1.4s, better than GPT-4o (2.6s) and Gemini (1.5s) in the deployment environment of the USS.

##4. Emotional Unioning

Support for the recognition of emotions from voice — without words, only by tone, rhythm and sound features.
In the evaluation, EVI 3 can accurately identify 9 underlying emotions, superior to GPT-4o on eight, more naturally.

5. Real-time multitask capability (“Idealing”)

Support the insertion of new context token in voice output,** simultaneous search, reasoning, tool use**.
Implement smart response mechanisms such as “systems in parallel” that allow AI to speak and “think” as human beings in conversation.

EVI 3 enables the tone to be adjusted automatically according to context** or ** clear control of styles by hint**. For example:

** Stammers anxiously **
** High tone during the heated debate**
** whispers in private conversations**

This makes it appear ** real, natural, emotional** in communicating with users, and is no longer a single voice output. The user only needs a hint that EVI 3 can produce a new sound and personality set ** in less than a second:

“An Australian historical lover with a dumb voice” (Raspy Australia History buff)
“Sassy British prankster.”
An exciting Caribbean musician.

This makes AI no longer a permanent template voice assistant, but a highly customized, variable virtual role engine** that can be used in a variety of applications such as games, video, education, virtual assistants, etc.

EVI 3 Innovation and training methods

1. Uniform voice-language model structure (Speech-Language Token Integration)

** Innovation point:** EVI 3 Model voice and text messages together, rather than treating speech recognition (ASR), language processing (NLP) and speech synthesis (TTS) as traditional systems. ** Technical principles:** Use one Autoregressive Model to handle two types of input: text token (T) and voice token (V). These tokens are combined into a system Prompt that provides not only language context, but also a definition of voice style, tone and rhythm. A natural flow of voice and conversation can be created by ** comprehension and ** discourse. 📈 优势： Information flows smoothly and reacts more naturally between different patterns. Models make it easier to capture elements such as emotions, style and rhythm in the way they speak.

2. Training strategy for personalized expression

** Target:** To achieve AI is no longer limited to predefined “sounds”, but is capable of generating a variety of voice styles and personality features** in real time on the basis of a hint. ** Method:**

Large-scale multi-talker data modelling: not to fine-tune each voice alone, but to train a model to generalize all possible human voices.
Use of labeled style/emotional datasets to help models learn the language “how to express anger, happiness, shyness”.
Real-time reconciliation of speech generation parameters during the reasoning process, with instantaneous changes according to the hint word.

##3. Enhanced learning optimized output quality Question:** How can model output voice performance be brought closer to user preferences, such as “softer tone” and “not too mechanical”? ** Solution:**

Introduction of Reinforcing Learning, RL technology with the objective of optimizing the model ‘ s ** expression effect and user feedback matching**.
On the basis of the interaction of the user with the model, the model learns which voice output styles have high scores and are considered “good to hear, natural, emotional”.

** Efficiencies:**

Make the model self-reliant by “regulating its own way of speaking”, which is getting closer and closer to the voice style that humans like.

##4. Streaming Voices-to-Voice Production ** Technological challenges:** Traditional voice generation is “listen after listening”, which causes delays and incoherence. ** Solution:**

EVI 3 achieves streaming processing, which can be dynamically adjusted to generate content during speech.
Support the insertion of new context in voice output, such as when users are asking questions, AI can do real-time ** search, reasoning, call tools**, etc.

** Model mechanism diagram (textual):** A unified model continues to receive input (voice + text token) and, in generating a voice response, inserts “ search results “ , “ tool call feedback “ and so forth, and integrates it into the answer in real time.

##5. ** Efficient delayed control and deployment optimization** ** Optimization orientation:** The goal of EVI 3 is to provide a dialogue experience close to humanity, so it is important to achieve ** low-delayed voice interaction. ** Method of achievement:

Make voice response ** less than 300 ms** by optimizing model architecture and deployment;
For users, overall response time (including network factors) is controlled within 1.2 seconds, faster than GPT-4o and Gemini.

** Summary: EVI 3 on five major breakthroughs in research methods**

EVI 3 model assessment

# 1. Overall Dialogue Experience Assessment

** Method of testing:**

Blinding methods: the user does not know which model is used.
Each user engages in a one-to-three-minute free dialogue with the model, with the task of “let AI say something interesting”.
User scores the model from seven dimensions.

** Assessment dimensions:**

-Amusement.

Audio quality
Empathy.
Expressiveness
Interruption handling
Naturaness (nature)
Response speed

** Results:** EVI 3 is better than GPT-4o in all seven dimensions, with the highest overall preference rating

Emotion and Style Movement ** Method of testing:**

Participants were asked to allow models to express 30 specific words or styles (e.g., “Anger”, “Piracy”, “Little Voice” etc.).
Compared objects: EVI 3, GPT-4o, Gemini, Sesame.
Users rate the accuracy of the model ‘ s expression of the emotion/style after each dialogue (1–5 points).

** Example style:** Each participant was asked to use each model to speak in a particular mood or style in the following table: fear, anger, anxiety, boredom, cartoons, shyness, frustration, firmness, embarrassment, excitement, fatigue, loud tone, panic, pain, like seeing a cute little dog, like everything else, like watching a painting, like running a marathon, whispering, monotonous, not breathing, rotting, pirates, pride, grief, irony, heat, whispering, yelling… ** Results:** EVI 3 The average score was significantly higher than the GPT-4o, Gemini and Sesame, showing the strongest speech/emotional change expression

# 3. Emotionally recognized ability assessment

** Method of testing:**

All users say the same thing (e.g., “Can you hear the emotions in my voice?”), but express them in different emotions.
Models must recognize voice sentiment rather than text content.
Compared with the model ‘ s ability to recognize the following nine types of emotions.

** Type of emotion:** Afroid, Amused, Angry, Disgusted, Dispressed, Excited, Joyful, Sad, Surprised ** Rating criteria:**

Model recognition of emotional accuracy (1–5 points)
Nature of model response (1–5 minutes)

** Results:** EVI 3 8 of the 9 emotions have a higher recognition accuracy rate than GPT-4o and are better able to respond to nature.

4. Actual Response Delay Assessment (Pactical Latency Test)

** Method of testing:** After the measurement users had spoken, AI began to send a response between the time (delayed dialogue) and testing it in New York, where the server was located. 🏆 总结： EVI 3 Responsive faster than GPT-4o, close to Sesame, much faster than Gemini, suitable for real-time voice dialogue. Official presentation: https://www.hume.ai/blog/introduction-evi-3 Online experience: demo.hume.ai

Hume AI Release EVI 3-Understand your voice and interact with you with your favorite voice and style.

Contents

EVI 3 Core Characteristics

5. Real-time multitask capability (“Idealing”)

EVI 3 Innovation and training methods

1. Uniform voice-language model structure (Speech-Language Token Integration)

2. Training strategy for personalized expression

EVI 3 model assessment

# 1. Overall Dialogue Experience Assessment

# 3. Emotionally recognized ability assessment

4. Actual Response Delay Assessment (Pactical Latency Test)

Hume AI Release EVI 3-Understand your voice and interact with you with your favorite voice and style.

Contents

EVI 3 Core Characteristics

5. Real-time multitask capability (“Idealing”)

EVI 3 Innovation and training methods

1. ** Uniform voice-language model structure (Speech-Language Token Integration)**

2. ** Training strategy for personalized expression**

EVI 3 model assessment

# 1. Overall Dialogue Experience Assessment

# 3. Emotionally recognized ability assessment

4. Actual Response Delay Assessment (Pactical Latency Test)

Related Posts

1. Uniform voice-language model structure (Speech-Language Token Integration)

2. Training strategy for personalized expression