pretrained.tacotron2
Defines a pre-trained Tacotron2 model.
This combines a Tacotron2 model with a HiFiGAN vocoder to produce an end-to-end TTS model, adapted to be fine-tunable.
from pretrained.tacotron2 import pretrained_tacotron2_tts
tts = pretrained_tacotron2_tts()
audio, states = tts.generate("Hello, world!")
write_audio([audio])
You can also interact with this model directly through the command line:
python -m pretrained.tacotron2 'Hello, world!'
The two parts of the model can be trained separately, including using LoRA fine-tuning.
Using this model requires the following additional dependencies:
inflect
ftfy
Additionally, to generate STFTs for training the model, you will need
to install librosa
. If you want to play audio for the demo, you should
also install sounddevice
.
- class pretrained.tacotron2.LinearNorm(in_dim: int, out_dim: int, bias: bool = True, w_init_gain: str = 'linear', lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0)[source]
Bases:
Module
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(x: Tensor) Tensor [source]
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class pretrained.tacotron2.ConvNorm(in_channels: int, out_channels: int, kernel_size: int = 1, stride: int = 1, padding: int | None = None, dilation: int = 1, bias: bool = True, w_init_gain: str = 'linear', lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0)[source]
Bases:
Module
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(signal: Tensor) Tensor [source]
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class pretrained.tacotron2.LocationLayer(attention_n_filters: int, attention_kernel_size: int, attention_dim: int, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0)[source]
Bases:
Module
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(attention_weights_cat: Tensor) Tensor [source]
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class pretrained.tacotron2.Attention(attention_rnn_dim: int, embedding_dim: int, attention_dim: int, attention_location_n_filters: int, attention_location_kernel_size: int, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0)[source]
Bases:
Module
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- get_alignment_energies(query: Tensor, processed_memory: Tensor, attention_weights_cat: Tensor) Tensor [source]
- forward(attn_hid_state: Tensor, memory: Tensor, proc_memory: Tensor, attn_weights_cat: Tensor, mask: Tensor | None) tuple[torch.Tensor, torch.Tensor] [source]
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class pretrained.tacotron2.Prenet(in_dim: int = 80, sizes: list[int] = [256, 256], dropout: float = 0.5, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0, dropout_always_on: bool = True)[source]
Bases:
Module
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(x: Tensor) Tensor [source]
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class pretrained.tacotron2.PostnetConfig(n_mel_channels: int = 80, emb_dim: int = 512, kernel_size: int = 5, n_convolutions: int = 5, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0)[source]
Bases:
object
- n_mel_channels: int = 80
- emb_dim: int = 512
- kernel_size: int = 5
- n_convolutions: int = 5
- lora_rank: int | None = None
- lora_alpha: float = 1.0
- lora_dropout: float = 0.0
- class pretrained.tacotron2.Postnet(config: PostnetConfig)[source]
Bases:
Module
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(x: Tensor) Tensor [source]
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class pretrained.tacotron2.EncoderConfig(emb_dim: int = 512, kernel_size: int = 5, n_convolutions: int = 3, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0, freeze_bn: bool = False, speaker_emb_dim: int | None = None)[source]
Bases:
object
- emb_dim: int = 512
- kernel_size: int = 5
- n_convolutions: int = 3
- lora_rank: int | None = None
- lora_alpha: float = 1.0
- lora_dropout: float = 0.0
- freeze_bn: bool = False
- speaker_emb_dim: int | None = None
- class pretrained.tacotron2.Encoder(config: EncoderConfig)[source]
Bases:
Module
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(x: Tensor, input_lengths: Tensor, speaker_emb: Tensor | None = None) Tensor [source]
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class pretrained.tacotron2.DecoderConfig(n_mel_channels: int = 80, n_frames_per_step: int = 1, encoder_emb_dim: int = 512, attention_dim: int = 128, attention_location_n_filters: int = 32, attention_location_kernel_size: int = 31, attention_rnn_dim: int = 1024, decoder_rnn_dim: int = 1024, prenet_dim: int = 256, prenet_dropout: bool = 0.5, max_decoder_steps: int = 1000, gate_threshold: float = 0.5, p_attention_dropout: float = 0.1, p_decoder_dropout: float = 0.1, prenet_dropout_always_on: bool = True, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0)[source]
Bases:
object
- n_mel_channels: int = 80
- n_frames_per_step: int = 1
- encoder_emb_dim: int = 512
- attention_dim: int = 128
- attention_location_n_filters: int = 32
- attention_location_kernel_size: int = 31
- attention_rnn_dim: int = 1024
- decoder_rnn_dim: int = 1024
- prenet_dim: int = 256
- prenet_dropout: bool = 0.5
- max_decoder_steps: int = 1000
- gate_threshold: float = 0.5
- p_attention_dropout: float = 0.1
- p_decoder_dropout: float = 0.1
- prenet_dropout_always_on: bool = True
- lora_rank: int | None = None
- lora_alpha: float = 1.0
- lora_dropout: float = 0.0
- class pretrained.tacotron2.DecoderStates(attn_h, attn_c, dec_h, dec_c, attn_weights, attn_weights_cum, attn_ctx, memory, processed_memory, mask)[source]
Bases:
NamedTuple
Create new instance of DecoderStates(attn_h, attn_c, dec_h, dec_c, attn_weights, attn_weights_cum, attn_ctx, memory, processed_memory, mask)
- attn_h: Tensor
Alias for field number 0
- attn_c: Tensor
Alias for field number 1
- dec_h: Tensor
Alias for field number 2
- dec_c: Tensor
Alias for field number 3
- attn_weights: Tensor
Alias for field number 4
- attn_weights_cum: Tensor
Alias for field number 5
- attn_ctx: Tensor
Alias for field number 6
- memory: Tensor
Alias for field number 7
- processed_memory: Tensor
Alias for field number 8
- mask: Tensor | None
Alias for field number 9
- class pretrained.tacotron2.Decoder(config: DecoderConfig)[source]
Bases:
Module
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- initialize_decoder_states(memory: Tensor, mask: Tensor | None) DecoderStates [source]
- parse_decoder_outputs(mel_outputs: list[torch.Tensor], gate_outputs: list[torch.Tensor], alignments: list[torch.Tensor], states: DecoderStates) tuple[torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates] [source]
- decode(decoder_input: Tensor, states: DecoderStates) tuple[torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates] [source]
- forward(memory: Tensor, dec_ins: Tensor, memory_lengths: Tensor, states: DecoderStates | None = None) tuple[torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates] [source]
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- infer(memory: Tensor, memory_lengths: Tensor, states: DecoderStates | None = None) tuple[torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates] [source]
- pretrained.tacotron2.window_sumsquare(window: str | float, n_frames: int, hop_length: int = 200, win_length: int = 800, n_fft: int = 800, dtype: type = <class 'numpy.float32'>, norm: float | None = None) ndarray [source]
- pretrained.tacotron2.griffin_lim(magnitudes: Tensor, stft_fn: STFT, n_iters: int = 30) Tensor [source]
- pretrained.tacotron2.dynamic_range_compression(x: Tensor, c: int | float = 1, clip_val: float = 1e-05) Tensor [source]
- class pretrained.tacotron2.STFT(filter_length: int = 800, hop_length: int = 200, win_length: int = 800, window: str = 'hann')[source]
Bases:
Module
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward_basis: Tensor
- inverse_basis: Tensor
- forward(input_data: Tensor) Tensor [source]
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class pretrained.tacotron2.TacotronSTFT(filter_length: int = 1024, hop_length: int = 256, win_length: int = 1024, n_mel_channels: int = 80, sampling_rate: int = 16000, mel_fmin: float = 0.0, mel_fmax: float = 8000.0)[source]
Bases:
Module
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- mel_basis: Tensor
- class pretrained.tacotron2.TacotronConfig(name: str = '???', mask_padding: bool = False, n_mel_channels: int = 80, n_symbols: int = 148, symbols_emb_dim: int = 512, n_frames_per_step: int = 1, symbols_emb_dropout: float = 0.1, encoder: pretrained.tacotron2.EncoderConfig = <factory>, decoder: pretrained.tacotron2.DecoderConfig = <factory>, postnet: pretrained.tacotron2.PostnetConfig = <factory>)[source]
Bases:
BaseModelConfig
- mask_padding: bool = False
- n_mel_channels: int = 80
- n_symbols: int = 148
- symbols_emb_dim: int = 512
- n_frames_per_step: int = 1
- symbols_emb_dropout: float = 0.1
- encoder: EncoderConfig
- decoder: DecoderConfig
- postnet: PostnetConfig
- class pretrained.tacotron2.Tacotron(config: TacotronConfig)[source]
Bases:
BaseModel
- parse_output(outputs: tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates], output_lengths: Tensor | None = None) tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates] [source]
- forward(inputs: tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor], states: DecoderStates | None = None, speaker_emb: Tensor | None = None) tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates] [source]
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- infer(inputs: Tensor, input_lengths: Tensor, states: DecoderStates | None = None, speaker_emb: Tensor | None = None) tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates] [source]
- pretrained.tacotron2.pretrained_tacotron2(*, pretrained: bool = True, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0, lora_encoder: bool = True, lora_decoder: bool = True, lora_postnet: bool = True, device: device | None = None, prenet_dropout: bool = True, num_tokens: int | None = None) Tacotron [source]
Loads the pretrained Tacotron2 model.
- Parameters:
pretrained – Whether to load the pretrained weights.
lora_rank – The LoRA rank to use, if LoRA is desired.
lora_alpha – The LoRA alpha to use, if LoRA is desired.
lora_dropout – The LoRA dropout to use, if LoRA is desired.
lora_encoder – Whether to use LoRA in the encoder.
lora_decoder – Whether to use LoRA in the decoder.
lora_postnet – Whether to use LoRA in the postnet.
device – The device to load the weights onto.
prenet_dropout – Whether to use always apply dropout in the PreNet.
num_tokens – The number of tokens in the vocabulary.
- Returns:
The pretrained Tacotron model.
- pretrained.tacotron2.tacotron_stft(filter_length: int = 1024, hop_length: int = 256, win_length: int = 1024, n_mel_channels: int = 80, sampling_rate: int = 16000, mel_fmin: float = 0.0, mel_fmax: float = 8000.0) TacotronSTFT [source]
Returns an STFT module for training the Tacotron model.
- Parameters:
filter_length – The length of the filters used for the STFT.
hop_length – The hop length of the STFT.
win_length – The window length of the STFT.
n_mel_channels – The number of mel channels.
sampling_rate – The sampling rate of the audio.
mel_fmin – The minimum frequency of the mel filterbank.
mel_fmax – The maximum frequency of the mel filterbank.
- Returns:
The STFT module.
- class pretrained.tacotron2.TTS(tacotron: Tacotron, vocoder: HiFiGAN | WaveGlow, *, device: base_device | None = None)[source]
Bases:
object
Provides an API for doing text-to-speech.
Note that this module is not an nn.Module, so you can use it in your module without worrying aobut storing all the weights on accident.
- Parameters:
tacotron – The Tacotron model.
vocoder – The vocoder model.
device – The device to load the weights onto.
- generate_mels(text: str | list[str], postnet: bool = True, states: DecoderStates | None = None) tuple[torch.Tensor, pretrained.tacotron2.DecoderStates] [source]
- generate(text: str | list[str], postnet: bool = True, states: DecoderStates | None = None) tuple[torch.Tensor, pretrained.tacotron2.DecoderStates] [source]