Bland AI releases a new TTS engine-using a large language model to directly generate a voice that only needs a short audio to clone any human voice.

Bland AI released a brand-new **Bland TTS, declaring it to be **the first “Uncanny Valley” product.

  • Treasure Valley: It means that AI’s voice or face is not perfect when it’s not perfect. Bland TTS claims it’s broken, making AI’s voice ** almost impossible to distinguish from the real person**.

Bland TTS** only requires a short audio ** to:

  • Cloning any human voice.

  • Or “combined” the other cloned voice style (e.g. tone, rhythm, pronunciation, etc.).

At its core is the use of large language models (LLMs) for direct voice generation, rather than relying on traditional layer-by-storey structures. The system has unprecedented emotional expression, style control, multi-talker understanding, non-verbal sound generation, and has achieved more real, controlled, and contextualized speech synthesis through self-researched audio Token systems (SNACs).

Activate bright

#1 Style Transfer

  • Models can be understood automatically through ** “Learning in context”** what is an “excited tone” or “cool tone”;

  • Control labels may also be added manually, such as: This is a big breakthrough!
  • It takes 3-6 speech examples for the system to synthesize new content of the same style.

##2 Sound Generations Not only can synthesizing languages, but can also produce sound effects**, such as:

simulates laughter on behalf of dog barking as long as you provide a punctuated text and audio examples, the model will remember the correspondence. #3 # Voice Blending By providing multiple voice examples, the system automatically “combines” a new voice, preserving the identity of multiple speakers and maintaining a consistent tone. - Brand voice design; - Unanimous multilingual output; - Virtual image role creation. # # 4 # The system is no longer word-for-word, but really changes the tone from context to context. - More rational technical orientation; - Comfortable content is warmer; - Questions and answers are more natural. # Core technology: reshaping traditional TTS processes ** The pain of the traditional TTS** In the past, TTS was a waterline approach: Text # Sylvester # Rhythm # Wave # Synthetic sound Each step can be wrong, and the end effect is often “lack of emotion, sound splitting.” This is because traditional methods** are to understand content first and then to “assemble” the voice** and it is difficult to communicate the tone and emotions naturally. ** Programme Bland: integrated modelling** The new Bland AI technology, which connects the entire process, uses ** Large Language Models to directly predict sound**, as follows: Text input Model output " Audio Token" directly and then restore to real sound It's like, "You tell it what it says, it makes a voice out of the tone and emotion of understanding" instead of a collator to progressively "translate." # # A breakthrough at the data level: a thousand times higher The bottom of any generation system is data quality. The Bland team believes that public voice data is not enough, especially for real dialogue modelling. They constructed a large-scale voice data set for **the industry,** with the following characteristics: ![](https://assets-v2.circle.so/b3zzwwxqduz6nfhx1fgvjfxgmxrq) # Technology architecture core: from text LLM to voice LLM # # The common thinking of LLM # The traditional LLM approach is: Cut the text into Token. Learn to predict the next Token to restore it to full sentence. Bland's method: Sever text to predict the corresponding " Audio Token" and restore it to voice wave form Here's ** Audio Token** is a discrete expression of SNAC coding, taking into account: - Macro beats (e.g. speed of speech, pause); - Micro-details (e.g. pronunciation, sounds). This approach allows the model to really master the “content plus expression” at the same time, right and right. # **Application scene and user population** # 1. Creatives - Turn text into a real AI voice or sound** - Support** fine control styles and emotions** - Design scenes suitable for content such as podcasting, audio programming, audio novels, films, etc. #2. # Developers - Access your application via API - Products used to construct custom voice functions (e.g. voice assistants, educational products, broadcasting systems, etc.) ##3. # Enterprise users - Construction of commercial voice services such as **AI customer service systems, telephone assistants, etc.** - The sound is natural. The client will even keep it as a contact. - A dialogue with AI can be tried directly on the website** Official presentation: https://www.bland.ai/blogs/new-tts-announcement Quick Start Link: - Developer portal: https://t.co/qBpGkJh2Gp - Enterprise portal: https://t.co/Szf9KNwfHs