ChatTTS: Transform Your Text into Speech

Maigari David 27 Aug, 2024
8 min read

Introduction

Imagine you’re creating a podcast or crafting a virtual assistant that sounds as natural as a real conversation. That’s where ChatTTS comes in. This cutting-edge text-to-speech tool turns your written words into lifelike audio, capturing nuances and emotions with incredible precision. Picture this: you type out a script, and ChatTTS brings it to life with a voice that feels genuine and expressive. Whether you’re developing engaging content or enhancing user interactions, ChatTTS offers a glimpse into the future of seamless, natural-sounding dialogues. Dive in to see how this tool can transform your projects and make your voice heard in a whole new way.

Learning Outcomes

  • Learn about the unique capabilities and advantages of ChatTTS in text-to-speech technology.
  • Identify key differences and benefits of ChatTTS compared to other text-to-speech models like Bark and Vall-E.
  • Gain insight into how text pre-processing and output fine-tuning enhance the customizability and expressiveness of generated speech.
  • Discover how to integrate ChatTTS with large language models for advanced text-to-speech applications.
  • Understand practical applications of ChatTTS in creating audio content and virtual assistants.

This article was published as a part of the Data Science Blogathon.

Overview of ChatTTS

ChatTTS, a voice generation tool, is a significant leap in AI, enabling seamless conversations. As the demand for voice generation increases alongside text generation and LLMs, ChatTTS makes audio dialogues more handy and comprehensive. Engaging in a dialogue with this tool is a breeze, and with comprehensive data mining and pretraining, the efficiency of this concept only amplifies. 

ChatTTS is one of the best open-source models for Text-to-Speech voice generation for many applications. This tool is perfect in both English and Chinese. With over 100,000 hours of training data, this model can provide dialogue in both languages seems natural. 

ChatTTS

What are the Features of ChatTTS?

ChatTTS, with its unique features, stands out from other large language models that can be generic and lack expressiveness. With approximately 10 hours of data training in English and Chinese, this tool greatly advances AI. Other text-to-audio models, like Bark and Vall-E, have great features similar to this one. But ChatTTS edges out in some aspects. 

For example, when comparing ChatTTS with Bark, there is a notable difference with the long-form input.

The output, in this case, is usually no longer than 13 seconds, and that is because of its GPT-style architecture. Also, Bark’s inference speed can be slower for old GPUs, default collabs, or CPUs. However, it works for enterprise GPUs, Pytorch, and CPUs. 

ChatTTS, on the other hand, has a good inference speed; it can generate audio corresponding to around seven semantic tokens per second. This model’s emotion control also makes it edge out Valle.

Let’s delve into some of the unique features that make ChatTTS a valuable tool for AI voice generation: 

Conversational TTS

This model is trained to execute task dialogue expressively. It carries natural speech patterns and also keeps speech synthesis for multiple speakers. This simple concept makes it easier for users, especially those with voice synthesis needs. 

Control and Security

ChatTTS is doing a lot to ensure this tool’s safety and ethical concerns. There is an understandable concern about the abuse of this model, and some features, like reducing image quality and current work on an open-source tool to detect artificial speech, are good examples of ethical AI developments. 

Integration with LLMs

This is another evolution toward the security and control of this model. The ChatTTS team has shown its desire to maintain its reliability; adding watermarks and integrating them with large language models is a visible sign of ensuring the safety and reliability concerns that may arise. 

This model has a few more standout qualities. One vital feature is that users can control the output and certain speech variations. The next section explains this better. 

Text Pre-processing: Special Tokens For More Control

The level of controllability this model gives users is what makes it unique. When adding text, you can include tokens. These tokens act as embedded commands that control oral commands, including pauses and laughter. 

This token concept can be divided into two stages: sentence-level control and word-level control. The sentence level introduces tokens such as laughter [laugh_ (0-2)] and pauses. On the other hand, the word-level control introduces these breaks around certain words to make the sentence more expressive. 

ChatTTS: Fine-tuning the Output

Using some parameters, you can refine the output during audio generation. This is another crucial feature that makes this model more controllable. 

This concept is similar to sentence-level control, as users can control specific identities, such as speaker identity, speech variations, and decoding strategies.

Generally, text pre-processing and output fine-tuning are two critical features that give ChatTTS its high level of customization and ability to generate expressive voice conversations.

params_infer_code = {'prompt':'[speed_5]', 'temperature':.3}
params_refine_text = {'prompt':'[oral_2][laugh_0][break_6]'}

Open Source Plans and Community Involvement

ChatTTS has powerful potential, with fine-tuning capabilities and seamless integration with LLM. The community is looking to open-source a train-based model to develop further and recruit more researchers and developers to improve it. 

There have also been talks of releasing a version of this model with multiple emotion controls and a Lora training code. This development could drastically reduce the difficulty in training since ChatTTS has LLM integration. 

This model also supports a web user interface where you can input text, change parameters, and generate audio interactively. This is possible with the webui.py script. 

 python webui.py --server_name 0.0.0.0 --server_port 8080 --local_path /path/to/local/models
 

How to Use ChatTTS

We’ll highlight this model’s simple steps to run efficiently, from downloading the code to fine-tuning. 

Downloading the Code and Installing Dependencies

!rm -rf /content/ChatTTS
!git clone https://github.com/2noise/ChatTTS.git
!pip install -r /content/ChatTTS/requirements.txt
!pip install nemo_text_processing WeTextProcessing
!ldconfig /usr/lib64-nvidia

This code consists of commands to help set up the environment. Downloading the clone version of this model from Git Hub gets the project’s latest version. The lines of code also install the necessary dependencies and ensure that the system libraries are correctly configured for NVIDIA GPUs. 

Importing Required Libraries

The next step in running inference for this model involves importing the necessary libraries for your scrip; you’ll need to import Torch, ChatTTS, and Audio from IPython.display. You can listen to the audio with an ipynb file. There is also an alternative to save this audio as a ‘.wav’ file if you want to use a third-party library or install an audio driver like FFmpeg or SoundFile.

The code should look like the block below: 

import torch
torch._dynamo.config.cache_size_limit = 64
torch._dynamo.config.suppress_errors = True
torch.set_float32_matmul_precision('high')


from ChatTTS import ChatTTS
from IPython.display import Audio

Initializing ChatTTS

This step involves initiating the model using the ‘chat’ as an instance in the class. Then, load the ChatTTS pre-trained data.

chat = ChatTTS.Chat()

# Use force_redownload=True if the weights updated.
chat.load_models(force_redownload=True)

# Alternatively, if you downloaded the weights manually, set source='locals' and local_path will point to your directory.

# chat.load_models(source='local', local_path='YOUR LOCAL PATH')

Batch Inference with ChatTTS

texts = ["So we found being competitive and collaborative was a huge way of staying motivated towards our goals, so one person to call when you fall off, one person who gets you back on then one person to actually do the activity with.",]*3 \
       + ["我觉得像我们这些写程序的人,他,我觉得多多少少可能会对开源有一种情怀在吧我觉得开源是一个很好的形式。现在其实最先进的技术掌握在一些公司的手里的话,就他们并不会轻易的开放给所有的人用。"]*3


wavs = chat.infer(texts)

This model performs batch inference by providing a list of text. The ‘audio’ function in IPython can help you play the generated audio. 

Audio(wavs[0], rate=24_000, autoplay=True)
Audio(wavs[3], rate=24_000, autoplay=True)
wav = chat.infer('四川美食可多了,有麻辣火锅、宫保鸡丁、麻婆豆腐、担担面、回锅肉、夫妻肺片等,每样都让人垂涎三尺。', \
   params_refine_text=params_refine_text, params_infer_code=params_infer_code)

So, this shows how the parameters for speed, variability, and specific speech characteristics are defined.

Audio(wav[0], rate=24_000, autoplay=True)

Using Random Speakers

This concept is another great customization feature that this model allows. Sampling a random speaker to generate audio with ChatTTS is seamless, and the sample random speaker embedding also makes it possible.

You can listen to the generated audio using an ipynb file or save it as a .wav file using a third-party library. 

rand_spk = chat.sample_random_speaker()
params_infer_code = {'spk_emb' : rand_spk, }


wav = chat.infer('四川美食确实以辣闻名,但也有不辣的选择。比如甜水面、赖汤圆、蛋烘糕、叶儿粑等,这些小吃口味温和,甜而不腻,也很受欢迎。', \
   params_refine_text=params_refine_text, params_infer_code=params_infer_code)

How to Run Two-stage Control with ChatTTS

Two-stage control allows you to perform text refinement and audio generation seperately. This is possible with the ‘refine_text_only’ and ‘skip_refine_text’ parameters. 

You can use the two-stage control in ChatTTS to refine text and audio generation. Also, this refinement can be separately done with some unique parameters in the code block below: 

text = "So we found being competitive and collaborative was a huge way of staying motivated towards our goals, so one person to call when you fall off, one person who gets you back on then one person to actually do the activity with."

refined_text = chat.infer(text, refine_text_only=True)
refined_text
wav = chat.infer(refined_text)
Audio(wav[0], rate=24_000, autoplay=True)

This is the second stage that indicates the breaks, and pauses in the speech during audio generation. 

text = 'so we found being competitive and collaborative [uv_break] was a huge way of staying [uv_break] motivated towards our goals, [uv_break] so [uv_break] one person to call [uv_break] when you fall off, [uv_break] one person who [uv_break] gets you back [uv_break] on then [uv_break] one person [uv_break] to actually do the activity with.'
wav = chat.infer(text, skip_refine_text=True)
Audio(wav[0], rate=24_000, autoplay=True)

Integrating ChatTTS with LLMs

The integration of ChatTTS with LLMs means it can refine text and generate audio from users’ questions in these models. Here are a few steps to break down this process. 

Importing Necessary Module

 from ChatTTS.experimental.llm import llm_api

This function imports the ‘llm_api’ used to create the API client. We will then use Deepseek to create the API. This API helps to facilitate seamless interactions in text-based applications. We can get the API from Deepseek API. Choose the ‘Access API’ option on the page, sign up for an account, and you can create a New key. 

Creating API Client

 API_KEY = ''
client = llm_api(api_key=API_KEY,
       base_url="https://api.deepseek.com",
       model="deepseek-chat")


 user_question = '四川有哪些好吃的美食呢?'
text = client.call(user_question, prompt_version = 'deepseek')
print(text)
text = client.call(text, prompt_version = 'deepseek_TN')
print(text)

You can then generate the audio using the text generated. Here is how to add the audio; 

params_infer_code = {'spk_emb' : rand_spk, 'temperature':.3}
wav = chat.infer(text, params_infer_code=params_infer_code)

Application of ChatTTS

A voice generation tool that converts text to audio will be valuable today. The wave of AI chatbots, virtual assistants, and the integration of automated voices in many industries makes ChatTTS a massive deal. Here are some of the real-life applications of this model. 

  • Creating Audio versions of text-based content: Whether for research papers or academic articles, ChatTTS can efficiently convert text content into audio. This alternative way of consuming materials can help in a more direct form of learning.  
  • Speech Generation for Virtual Assistants and Chatbots: Virtual assistants and chatbots have become very popular today, and automated systems integration has helped this course. ChatTTS can help generate voice speech based on text from these virtual assistants. 
  • Exploring Text-to-Speech Technology: There are different ways to explore this model, some of which are already on course by the ChatTTS community. A critical application in this regard is studying speech synthesis by this model for research purposes. 
ChatTTS text to speech for chat

Conclusion

ChatTTS indicates a massive leap in AI generation, with natural and smooth conversations in both English and Chinese. The best part of this model is its controllability, which allows users to customize and, as a result, brings expressiveness to the speech. As the ChatTTS community continues to develop and refine this model, its potential for advancing text-to-speech technology is bright.

Key Takeaways

  • ChatTTS excels in generating natural and expressive voice dialogues.
  • The model allows for precise control over speech patterns and characteristics.
  • ChatTTS supports seamless integration with large language models for improved functionality.
  • The model includes mechanisms to ensure responsible and secure use of text-to-speech technology.
  • Ongoing community contributions and future enhancements promise continued advancement and versatility.
  • The team behind this open-source model also prioritizes safety and ethical considerations. Features such as high-frequency noise and compressed audio quality provide reliability and control. 
  • This tool is also great because it has customization features that allow users to fine-tune the output with parameters that introduce pauses, laughter, and other oral characteristics in the speech. 

Resources

Frequently Asked Questions

Q1. How can Developers Integrate this model into their applications?

A. Developers can integrate chatTTS into their applications using APIs and SDKs. 

Q2. What languages does ChatTTS support for text-to-speech conversion?

A. With over 100,000 hours of data training, this model can efficiently perform tasks of voice generation in English and Chinese. 

Q3. Is ChatTTS suitable for commercial use?

A. No, ChatTTS is intended for research and academic applications only. It should not be used for commercial or legal purposes. The model’s development includes ethical considerations to ensure safe and responsible use.

Q4. What Can ChatTTS be used for?

A. This model is valuable in various applications. One of its most prominent uses is a conversational tool for large language model assistants. ChatTTS can generate dialogue speech for video introduction, educational training, and other applications that require text-to-speech content. 

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Maigari David 27 Aug, 2024

Hey there! I'm David Maigari a dynamic professional with a passion for technical writing writing, Web Development, and the AI world. David is an also enthusiast of data science and AI innovations.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear