Hume AI Release EVI 3-Understand your voice and interact with you with your favorite voice and style.

Hume AI works to create a voice with ** emotional comprehension AI, which they set as their goal: ** By the end of 2025, provide a fully personalized voice AI experience. EVI 3 is their third-generation “voice-language” model, which is not only able to hear, speak, express and understand emotions, but is an important milestone towards the future “emotional AI interaction”. EVI 3 is an AI assistant who can “understand you, talk to you with emotion” and understand your voice and interact with you in the voice and style you like. ** What are the main functions?**

  • Speak and respond to you like a human being, including more than 30 words and styles of happiness, anger, shyness, fatigue.

  • And you can hear the emotions in your voice and make a natural, consensual response.

  • You can set AI’s voice style, like, “talk like a pirate” or “talk softly.”

  • Dialogue isn’t Caden, it’s almost as natural as real people talk.

** What did it solve?**

  • Traditional voice assistants can only read the text, not understand your emotions.

  • It’s a “talking and thinking” mechanism that allows it to talk while searching and reasoning, and it’s no longer just a one-single question-and-answer robot.

  • EVI 3 transforms “sound” into a real temperature interface, closer to the human mode of communication.

EVI 3 Core Characteristics

##1. Integrated Voice-Language Structure

  • ** Uniform model for voice input and output: Unlike traditional TTS (text to voice) or ASR (voice recognition), EVI 3 uses a **self-regression model to process text (T) and voice (V)token simultaneously.

  • System Prompt contains language and voice token, defining not only interactive content, but also controlling behaviour such as tone, style, etc.

#2. Artificial expression of emotions and styles

  • EVI 3 can generate any sound by hint and give it a particular “ personality” or “emotional style”.

  • Support over 30 complex voice styles: e.g., “excuse”, “fashioned”, “Piracy”.

  • You can create AI’s “sound” and “characterism” by hints, such as softness, humor, professionality, etc. There are now more than 100,000 self-defined voices generated on the platform, which is much more flexible than the voice helpers of the usual fixed roles.

  • Be more natural than models such as GPS-4o in ** expression of genuine feelings and tone switch**.

##3. Efficient voice response capacity

  • Delay in model reasoning ** below 300 ms**.

  • In practical application, EVI 3 responds 0.9-1.4s, better than GPT-4o (2.6s) and Gemini (1.5s) in the deployment environment of the USS.

##4. Emotional Unioning

  • Support for the recognition of emotions from voice — without words, only by tone, rhythm and sound features.

  • In the evaluation, EVI 3 can accurately identify 9 underlying emotions, superior to GPT-4o on eight, more naturally.

5. Real-time multitask capability (“Idealing”)

  • Support the insertion of new context token in voice output,** simultaneous search, reasoning, tool use**.

  • Implement smart response mechanisms such as “systems in parallel” that allow AI to speak and “think” as human beings in conversation.

EVI 3 enables the tone to be adjusted automatically according to context** or ** clear control of styles by hint**. For example:

  • ** Stammers anxiously **

  • ** High tone during the heated debate**

  • ** whispers in private conversations**

This makes it appear ** real, natural, emotional** in communicating with users, and is no longer a single voice output. The user only needs a hint that EVI 3 can produce a new sound and personality set ** in less than a second:

  • “An Australian historical lover with a dumb voice” (Raspy Australia History buff)

  • “Sassy British prankster.”

  • An exciting Caribbean musician.

This makes AI no longer a permanent template voice assistant, but a highly customized, variable virtual role engine** that can be used in a variety of applications such as games, video, education, virtual assistants, etc.

EVI 3 Innovation and training methods

1. ** Uniform voice-language model structure (Speech-Language Token Integration)**

** Innovation point:** EVI 3 Model voice and text messages together, rather than treating speech recognition (ASR), language processing (NLP) and speech synthesis (TTS) as traditional systems. ** Technical principles:** Use one Autoregressive Model to handle two types of input: text token (T) and voice token (V). These tokens are combined into a system Prompt that provides not only language context, but also a definition of voice style, tone and rhythm. A natural flow of voice and conversation can be created by ** comprehension and ** discourse. 📈 优势: Information flows smoothly and reacts more naturally between different patterns. Models make it easier to capture elements such as emotions, style and rhythm in the way they speak.

2. ** Training strategy for personalized expression**

** Target:** To achieve AI is no longer limited to predefined “sounds”, but is capable of generating a variety of voice styles and personality features** in real time on the basis of a hint. ** Method:**

  • Large-scale multi-talker data modelling: not to fine-tune each voice alone, but to train a model to generalize all possible human voices.

  • Use of labeled style/emotional datasets to help models learn the language “how to express anger, happiness, shyness”.

  • Real-time reconciliation of speech generation parameters during the reasoning process, with instantaneous changes according to the hint word.

##3. Enhanced learning optimized output quality Question:** How can model output voice performance be brought closer to user preferences, such as “softer tone” and “not too mechanical”? ** Solution:**

  • Introduction of Reinforcing Learning, RL technology with the objective of optimizing the model ‘ s ** expression effect and user feedback matching**.

  • On the basis of the interaction of the user with the model, the model learns which voice output styles have high scores and are considered “good to hear, natural, emotional”.

** Efficiencies:**

  • Make the model self-reliant by “regulating its own way of speaking”, which is getting closer and closer to the voice style that humans like.

##4. Streaming Voices-to-Voice Production ** Technological challenges:** Traditional voice generation is “listen after listening”, which causes delays and incoherence. ** Solution:**

  • EVI 3 achieves streaming processing, which can be dynamically adjusted to generate content during speech.

  • Support the insertion of new context in voice output, such as when users are asking questions, AI can do real-time ** search, reasoning, call tools**, etc.

** Model mechanism diagram (textual):** A unified model continues to receive input (voice + text token) and, in generating a voice response, inserts “ search results “ , “ tool call feedback “ and so forth, and integrates it into the answer in real time.

##5. ** Efficient delayed control and deployment optimization** ** Optimization orientation:** The goal of EVI 3 is to provide a dialogue experience close to humanity, so it is important to achieve ** low-delayed voice interaction. ** Method of achievement:

  • Make voice response ** less than 300 ms** by optimizing model architecture and deployment;

  • For users, overall response time (including network factors) is controlled within 1.2 seconds, faster than GPT-4o and Gemini.

** Summary: EVI 3 on five major breakthroughs in research methods**

EVI 3 model assessment

# 1. Overall Dialogue Experience Assessment

** Method of testing:**

  • Blinding methods: the user does not know which model is used.

  • Each user engages in a one-to-three-minute free dialogue with the model, with the task of “let AI say something interesting”.

  • User scores the model from seven dimensions.

** Assessment dimensions:**

-Amusement.

  • Audio quality

  • Empathy.

  • Expressiveness

  • Interruption handling

  • Naturaness (nature)

  • Response speed

** Results:** EVI 3 is better than GPT-4o in all seven dimensions, with the highest overall preference rating

  1. Emotion and Style Movement ** Method of testing:**
  • Participants were asked to allow models to express 30 specific words or styles (e.g., “Anger”, “Piracy”, “Little Voice” etc.).

  • Compared objects: EVI 3, GPT-4o, Gemini, Sesame.

  • Users rate the accuracy of the model ‘ s expression of the emotion/style after each dialogue (1–5 points).

** Example style:** Each participant was asked to use each model to speak in a particular mood or style in the following table: fear, anger, anxiety, boredom, cartoons, shyness, frustration, firmness, embarrassment, excitement, fatigue, loud tone, panic, pain, like seeing a cute little dog, like everything else, like watching a painting, like running a marathon, whispering, monotonous, not breathing, rotting, pirates, pride, grief, irony, heat, whispering, yelling… ** Results:** EVI 3 The average score was significantly higher than the GPT-4o, Gemini and Sesame, showing the strongest speech/emotional change expression

# 3. Emotionally recognized ability assessment

** Method of testing:**

  • All users say the same thing (e.g., “Can you hear the emotions in my voice?”), but express them in different emotions.

  • Models must recognize voice sentiment rather than text content.

  • Compared with the model ‘ s ability to recognize the following nine types of emotions.

** Type of emotion:** Afroid, Amused, Angry, Disgusted, Dispressed, Excited, Joyful, Sad, Surprised ** Rating criteria:**

  • Model recognition of emotional accuracy (1–5 points)

  • Nature of model response (1–5 minutes)

** Results:** EVI 3 8 of the 9 emotions have a higher recognition accuracy rate than GPT-4o and are better able to respond to nature.

4. Actual Response Delay Assessment (Pactical Latency Test)

** Method of testing:** After the measurement users had spoken, AI began to send a response between the time (delayed dialogue) and testing it in New York, where the server was located. 🏆 总结: EVI 3 Responsive faster than GPT-4o, close to Sesame, much faster than Gemini, suitable for real-time voice dialogue. Official presentation: https://www.hume.ai/blog/introduction-evi-3 Online experience: demo.hume.ai