Step 4 TTS Workflow
Notebook 4: TTS Workflow
We have the exact podcast transcripts ready now to generate our audio for the Podcast.
In this notebook, we will learn how to generate Audio using both suno/bark and parler-tts/parler-tts-mini-v1 models first.
After that, we will use the output from Notebook 3 to generate our complete podcast
Note: Please feel free to extend this notebook with newer models. The above two were chosen after some tests using a sample prompt.
⚠️ Warning: This notebook likes have transformers version to be 4.43.3 or earlier so we will downgrade our environment to make sure things run smoothly
Credit: This Colab was used for starter code
We can install these packages for speedups
Let's import the necessary frameworks
Flash attention 2 is not installed
Testing the Audio Generation
Let's try generating audio using both the models to understand how they work.
Note the subtle differences in prompting:
- Parler: Takes in a
descriptionprompt that can be used to set the speaker profile and generation speeds - Suno: Takes in expression words like
[sigh],[laughs]etc. You can find more notes on the experiments that were run for this notebook in the TTS_Notes.md file to learn more.
Please set device = "cuda" below if you're using a single GPU node.
Parler Model
Let's try using the Parler Model first and generate a short segment with speaker Laura's voice
Bark Model
Amazing, let's try the same with bark now:
- We will set the
voice_presetto our favorite speaker - This time we can include expression prompts inside our generation prompt
- Note you can CAPTILISE words to make the model emphasise on these
- You can add hyphens to make the model pause on certain words
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
Bringing it together: Making the Podcast
Okay now that we understand everything-we can now use the complete pipeline to generate the entire podcast
Let's load in our pickle file from earlier and proceed:
Let's define load in the bark model and set it's hyper-parameters for discussions
/home/sanyambhutani/.conda/envs/final-checking-meta/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
WeightNorm.apply(module, name, dim)
/home/sanyambhutani/.conda/envs/final-checking-meta/lib/python3.11/site-packages/transformers/models/encodec/modeling_encodec.py:120: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)
Now for the Parler model:
We will concatenate the generated segments of audio and also their respective sampling rates since we will require this to generate the final audio
Function generate text for speaker 1
Function to generate text for speaker 2
Helper function to convert the numpy output from the models into audio
'[\n ("Speaker 1", "Welcome to this week\'s episode of AI Insights, where we explore the latest developments in the field of artificial intelligence. Today, we\'re going to dive into the fascinating world of knowledge distillation, a methodology that transfers advanced capabilities from leading proprietary Large Language Models, or LLMs, to their open-source counterparts. Joining me on this journey is my co-host, who\'s new to the topic, and I\'ll be guiding them through the ins and outs of knowledge distillation. So, let\'s get started!"),\n ("Speaker 2", "Sounds exciting! I\'ve heard of knowledge distillation, but I\'m not entirely sure what it\'s all about. Can you give me a brief overview?"),\n ("Speaker 1", "Of course! Knowledge distillation is a technique that enables the transfer of knowledge from a large, complex model, like GPT-4 or Gemini, to a smaller, more efficient model, like LLaMA or Mistral. This process allows the smaller model to learn from the teacher model\'s output, enabling it to acquire similar capabilities. Think of it like a master chef teaching their apprentice the art of cooking – the apprentice doesn\'t need to start from scratch."),\n ("Speaker 2", "Hmm, that sounds interesting. So, it\'s like a teacher-student relationship, where the teacher model guides the student model to learn from its output... Umm, can you explain this process in more detail?"),\n ("Speaker 1", "The distillation process involves several stages, including knowledge elicitation, knowledge storage, knowledge inference, and knowledge application. The teacher model shares its knowledge with the student model, which then learns to emulate the teacher\'s output behavior."),\n ("Speaker 2", "That makes sense, I think. So, it\'s like the teacher model is saying, \'Hey, student model, learn from my output, and try to produce similar results.\' But what about the different approaches to knowledge distillation? I\'ve heard of supervised fine-tuning, divergence and similarity, reinforcement learning, and rank optimization."),\n ("Speaker 1", "Ah, yes! Those are all valid approaches to knowledge distillation. Supervised fine-tuning involves training the student model on a smaller dataset, while divergence and similarity focus on aligning the hidden states or features of the student model with those of the teacher model. Reinforcement learning and rank optimization are more advanced methods that involve feedback from the teacher model to train the student model. Imagine you\'re trying to tune a piano – you need to adjust the keys to produce the perfect sound."),\n ("Speaker 2", "[laughs] Okay, I think I\'m starting to get it. But can you give me some examples of how these approaches are used in real-world applications? I\'m thinking of something like a language model that can generate human-like text..."),\n ("Speaker 1", "Of course! For instance, the Vicuna model uses supervised fine-tuning to distill knowledge from the teacher model, while the UltraChat model employs a combination of knowledge distillation and reinforcement learning to create a powerful chat model."),\n ("Speaker 2", "Wow, that\'s fascinating! I\'m starting to see how knowledge distillation can be applied to various domains, like natural language processing, computer vision, and even multimodal tasks... Umm, can we talk more about multimodal tasks? That sounds really interesting."),\n ("Speaker 1", "Exactly! Knowledge distillation has far-reaching implications for AI research and applications. It enables the transfer of knowledge across different models, architectures, and domains, making it a powerful tool for building more efficient and effective AI systems."),\n ("Speaker 2", "[sigh] I\'m starting to see the bigger picture now. Knowledge distillation is not just a technique; it\'s a way to democratize access to advanced AI capabilities and foster innovation across a broader spectrum of applications and users... Hmm, that\'s a pretty big deal."),\n ("Speaker 1", "That\'s right! And as we continue to explore the frontiers of AI, knowledge distillation will play an increasingly important role in shaping the future of artificial intelligence."),\n ("Speaker 2", "Well, I\'m excited to learn more about knowledge distillation and its applications. Thanks for guiding me through this journey, and I\'m looking forward to our next episode!"),\n ("Speaker 1", "Thank you for joining me on this episode of AI Insights! If you want to learn more about knowledge distillation and its applications, be sure to check out our resources section, where we\'ve curated a list of papers, articles, and tutorials to help you get started."),\n ("Speaker 2", "And if you\'re interested in building your own AI model using knowledge distillation, maybe we can even do a follow-up episode on how to get started... Umm, let\'s discuss that further next time."),\n]' Most of the times we argue in life that Data Structures isn't very useful. However, this time the knowledge comes in handy.
We will take the string from the pickle file and load it in as a Tuple with the help of ast.literal_eval()
[('Speaker 1',
, "Welcome to this week's episode of AI Insights, where we explore the latest developments in the field of artificial intelligence. Today, we're going to dive into the fascinating world of knowledge distillation, a methodology that transfers advanced capabilities from leading proprietary Large Language Models, or LLMs, to their open-source counterparts. Joining me on this journey is my co-host, who's new to the topic, and I'll be guiding them through the ins and outs of knowledge distillation. So, let's get started!"),
, ('Speaker 2',
, "Sounds exciting! I've heard of knowledge distillation, but I'm not entirely sure what it's all about. Can you give me a brief overview?"),
, ('Speaker 1',
, "Of course! Knowledge distillation is a technique that enables the transfer of knowledge from a large, complex model, like GPT-4 or Gemini, to a smaller, more efficient model, like LLaMA or Mistral. This process allows the smaller model to learn from the teacher model's output, enabling it to acquire similar capabilities. Think of it like a master chef teaching their apprentice the art of cooking – the apprentice doesn't need to start from scratch."),
, ('Speaker 2',
, "Hmm, that sounds interesting. So, it's like a teacher-student relationship, where the teacher model guides the student model to learn from its output... Umm, can you explain this process in more detail?"),
, ('Speaker 1',
, "The distillation process involves several stages, including knowledge elicitation, knowledge storage, knowledge inference, and knowledge application. The teacher model shares its knowledge with the student model, which then learns to emulate the teacher's output behavior."),
, ('Speaker 2',
, "That makes sense, I think. So, it's like the teacher model is saying, 'Hey, student model, learn from my output, and try to produce similar results.' But what about the different approaches to knowledge distillation? I've heard of supervised fine-tuning, divergence and similarity, reinforcement learning, and rank optimization."),
, ('Speaker 1',
, "Ah, yes! Those are all valid approaches to knowledge distillation. Supervised fine-tuning involves training the student model on a smaller dataset, while divergence and similarity focus on aligning the hidden states or features of the student model with those of the teacher model. Reinforcement learning and rank optimization are more advanced methods that involve feedback from the teacher model to train the student model. Imagine you're trying to tune a piano – you need to adjust the keys to produce the perfect sound."),
, ('Speaker 2',
, "[laughs] Okay, I think I'm starting to get it. But can you give me some examples of how these approaches are used in real-world applications? I'm thinking of something like a language model that can generate human-like text..."),
, ('Speaker 1',
, 'Of course! For instance, the Vicuna model uses supervised fine-tuning to distill knowledge from the teacher model, while the UltraChat model employs a combination of knowledge distillation and reinforcement learning to create a powerful chat model.'),
, ('Speaker 2',
, "Wow, that's fascinating! I'm starting to see how knowledge distillation can be applied to various domains, like natural language processing, computer vision, and even multimodal tasks... Umm, can we talk more about multimodal tasks? That sounds really interesting."),
, ('Speaker 1',
, 'Exactly! Knowledge distillation has far-reaching implications for AI research and applications. It enables the transfer of knowledge across different models, architectures, and domains, making it a powerful tool for building more efficient and effective AI systems.'),
, ('Speaker 2',
, "[sigh] I'm starting to see the bigger picture now. Knowledge distillation is not just a technique; it's a way to democratize access to advanced AI capabilities and foster innovation across a broader spectrum of applications and users... Hmm, that's a pretty big deal."),
, ('Speaker 1',
, "That's right! And as we continue to explore the frontiers of AI, knowledge distillation will play an increasingly important role in shaping the future of artificial intelligence."),
, ('Speaker 2',
, "Well, I'm excited to learn more about knowledge distillation and its applications. Thanks for guiding me through this journey, and I'm looking forward to our next episode!"),
, ('Speaker 1',
, "Thank you for joining me on this episode of AI Insights! If you want to learn more about knowledge distillation and its applications, be sure to check out our resources section, where we've curated a list of papers, articles, and tutorials to help you get started."),
, ('Speaker 2',
, "And if you're interested in building your own AI model using knowledge distillation, maybe we can even do a follow-up episode on how to get started... Umm, let's discuss that further next time.")] Generating the Final Podcast
Finally, we can loop over the Tuple and use our helper functions to generate the audio
Generating podcast segments: 6%|███▉ | 1/16 [00:20<05:02, 20.16s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation. Generating podcast segments: 19%|███████████▋ | 3/16 [01:02<04:33, 21.06s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation. Generating podcast segments: 31%|███████████████████▍ | 5/16 [01:41<03:30, 19.18s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation. Generating podcast segments: 44%|███████████████████████████▏ | 7/16 [02:26<03:05, 20.57s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation. Generating podcast segments: 56%|██████████████████████████████████▉ | 9/16 [03:04<02:13, 19.10s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation. Generating podcast segments: 69%|█████████████████████████████████████████▉ | 11/16 [03:42<01:31, 18.27s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation. Generating podcast segments: 81%|█████████████████████████████████████████████████▌ | 13/16 [04:17<00:50, 16.99s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation. Generating podcast segments: 94%|█████████████████████████████████████████████████████████▏ | 15/16 [04:49<00:15, 15.83s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation. Generating podcast segments: 100%|█████████████████████████████████████████████████████████████| 16/16 [05:13<00:00, 19.57s/segment]
Output the Podcast
We can now save this as a mp3 file
<_io.BufferedRandom name='_podcast.mp3'>
Suggested Next Steps:
- Experiment with the prompts: Please feel free to experiment with the SYSTEM_PROMPT in the notebooks
- Extend workflow beyond two speakers
- Test other TTS Models
- Experiment with Speech Enhancer models as a step 5.