pretrained.tacotron2

Defines a pre-trained Tacotron2 model.

This combines a Tacotron2 model with a HiFiGAN vocoder to produce an end-to-end TTS model, adapted to be fine-tunable.

from pretrained.tacotron2 import pretrained_tacotron2_tts

tts = pretrained_tacotron2_tts()
audio, states = tts.generate("Hello, world!")
write_audio([audio])

You can also interact with this model directly through the command line:

python -m pretrained.tacotron2 'Hello, world!'

The two parts of the model can be trained separately, including using LoRA fine-tuning.

Using this model requires the following additional dependencies:

  • inflect

  • ftfy

Additionally, to generate STFTs for training the model, you will need to install librosa. If you want to play audio for the demo, you should also install sounddevice.

class pretrained.tacotron2.Normalizer[source]

Bases: object

pretrained.tacotron2.text_clean_func(lower: bool = True) Callable[[str], str][source]
pretrained.tacotron2.get_mask_from_lengths(lengths: Tensor) Tensor[source]
class pretrained.tacotron2.LinearNorm(in_dim: int, out_dim: int, bias: bool = True, w_init_gain: str = 'linear', lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pretrained.tacotron2.ConvNorm(in_channels: int, out_channels: int, kernel_size: int = 1, stride: int = 1, padding: int | None = None, dilation: int = 1, bias: bool = True, w_init_gain: str = 'linear', lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(signal: Tensor) Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pretrained.tacotron2.LocationLayer(attention_n_filters: int, attention_kernel_size: int, attention_dim: int, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(attention_weights_cat: Tensor) Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pretrained.tacotron2.Attention(attention_rnn_dim: int, embedding_dim: int, attention_dim: int, attention_location_n_filters: int, attention_location_kernel_size: int, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

get_alignment_energies(query: Tensor, processed_memory: Tensor, attention_weights_cat: Tensor) Tensor[source]
forward(attn_hid_state: Tensor, memory: Tensor, proc_memory: Tensor, attn_weights_cat: Tensor, mask: Tensor | None) tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pretrained.tacotron2.Prenet(in_dim: int = 80, sizes: list[int] = [256, 256], dropout: float = 0.5, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0, dropout_always_on: bool = True)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pretrained.tacotron2.PostnetConfig(n_mel_channels: int = 80, emb_dim: int = 512, kernel_size: int = 5, n_convolutions: int = 5, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0)[source]

Bases: object

n_mel_channels: int = 80
emb_dim: int = 512
kernel_size: int = 5
n_convolutions: int = 5
lora_rank: int | None = None
lora_alpha: float = 1.0
lora_dropout: float = 0.0
class pretrained.tacotron2.Postnet(config: PostnetConfig)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pretrained.tacotron2.EncoderConfig(emb_dim: int = 512, kernel_size: int = 5, n_convolutions: int = 3, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0, freeze_bn: bool = False, speaker_emb_dim: int | None = None)[source]

Bases: object

emb_dim: int = 512
kernel_size: int = 5
n_convolutions: int = 3
lora_rank: int | None = None
lora_alpha: float = 1.0
lora_dropout: float = 0.0
freeze_bn: bool = False
speaker_emb_dim: int | None = None
class pretrained.tacotron2.Encoder(config: EncoderConfig)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, input_lengths: Tensor, speaker_emb: Tensor | None = None) Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

infer(x: Tensor, input_lengths: Tensor, speaker_emb: Tensor | None = None) Tensor[source]
class pretrained.tacotron2.DecoderConfig(n_mel_channels: int = 80, n_frames_per_step: int = 1, encoder_emb_dim: int = 512, attention_dim: int = 128, attention_location_n_filters: int = 32, attention_location_kernel_size: int = 31, attention_rnn_dim: int = 1024, decoder_rnn_dim: int = 1024, prenet_dim: int = 256, prenet_dropout: bool = 0.5, max_decoder_steps: int = 1000, gate_threshold: float = 0.5, p_attention_dropout: float = 0.1, p_decoder_dropout: float = 0.1, prenet_dropout_always_on: bool = True, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0)[source]

Bases: object

n_mel_channels: int = 80
n_frames_per_step: int = 1
encoder_emb_dim: int = 512
attention_dim: int = 128
attention_location_n_filters: int = 32
attention_location_kernel_size: int = 31
attention_rnn_dim: int = 1024
decoder_rnn_dim: int = 1024
prenet_dim: int = 256
prenet_dropout: bool = 0.5
max_decoder_steps: int = 1000
gate_threshold: float = 0.5
p_attention_dropout: float = 0.1
p_decoder_dropout: float = 0.1
prenet_dropout_always_on: bool = True
lora_rank: int | None = None
lora_alpha: float = 1.0
lora_dropout: float = 0.0
class pretrained.tacotron2.DecoderStates(attn_h, attn_c, dec_h, dec_c, attn_weights, attn_weights_cum, attn_ctx, memory, processed_memory, mask)[source]

Bases: NamedTuple

Create new instance of DecoderStates(attn_h, attn_c, dec_h, dec_c, attn_weights, attn_weights_cum, attn_ctx, memory, processed_memory, mask)

attn_h: Tensor

Alias for field number 0

attn_c: Tensor

Alias for field number 1

dec_h: Tensor

Alias for field number 2

dec_c: Tensor

Alias for field number 3

attn_weights: Tensor

Alias for field number 4

attn_weights_cum: Tensor

Alias for field number 5

attn_ctx: Tensor

Alias for field number 6

memory: Tensor

Alias for field number 7

processed_memory: Tensor

Alias for field number 8

mask: Tensor | None

Alias for field number 9

class pretrained.tacotron2.Decoder(config: DecoderConfig)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

get_go_frame(memory: Tensor) Tensor[source]
initialize_decoder_states(memory: Tensor, mask: Tensor | None) DecoderStates[source]
parse_decoder_inputs(decoder_inputs: Tensor) Tensor[source]
parse_decoder_outputs(mel_outputs: list[torch.Tensor], gate_outputs: list[torch.Tensor], alignments: list[torch.Tensor], states: DecoderStates) tuple[torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates][source]
decode(decoder_input: Tensor, states: DecoderStates) tuple[torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates][source]
forward(memory: Tensor, dec_ins: Tensor, memory_lengths: Tensor, states: DecoderStates | None = None) tuple[torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

infer(memory: Tensor, memory_lengths: Tensor, states: DecoderStates | None = None) tuple[torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates][source]
pretrained.tacotron2.window_sumsquare(window: str | float, n_frames: int, hop_length: int = 200, win_length: int = 800, n_fft: int = 800, dtype: type = <class 'numpy.float32'>, norm: float | None = None) ndarray[source]
pretrained.tacotron2.griffin_lim(magnitudes: Tensor, stft_fn: STFT, n_iters: int = 30) Tensor[source]
pretrained.tacotron2.dynamic_range_compression(x: Tensor, c: int | float = 1, clip_val: float = 1e-05) Tensor[source]
pretrained.tacotron2.dynamic_range_decompression(x: Tensor, c: int | float = 1) Tensor[source]
class pretrained.tacotron2.STFT(filter_length: int = 800, hop_length: int = 200, win_length: int = 800, window: str = 'hann')[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward_basis: Tensor
inverse_basis: Tensor
transform(input_data: Tensor) tuple[torch.Tensor, torch.Tensor][source]
inverse(magnitude: Tensor, phase: Tensor) Tensor[source]
forward(input_data: Tensor) Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pretrained.tacotron2.TacotronSTFT(filter_length: int = 1024, hop_length: int = 256, win_length: int = 1024, n_mel_channels: int = 80, sampling_rate: int = 16000, mel_fmin: float = 0.0, mel_fmax: float = 8000.0)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

mel_basis: Tensor
spectral_normalize(magnitudes: Tensor) Tensor[source]
spectral_de_normalize(magnitudes: Tensor) Tensor[source]
mel_spectrogram(y: Tensor) Tensor[source]
class pretrained.tacotron2.TacotronConfig(name: str = '???', mask_padding: bool = False, n_mel_channels: int = 80, n_symbols: int = 148, symbols_emb_dim: int = 512, n_frames_per_step: int = 1, symbols_emb_dropout: float = 0.1, encoder: pretrained.tacotron2.EncoderConfig = <factory>, decoder: pretrained.tacotron2.DecoderConfig = <factory>, postnet: pretrained.tacotron2.PostnetConfig = <factory>)[source]

Bases: BaseModelConfig

mask_padding: bool = False
n_mel_channels: int = 80
n_symbols: int = 148
symbols_emb_dim: int = 512
n_frames_per_step: int = 1
symbols_emb_dropout: float = 0.1
encoder: EncoderConfig
decoder: DecoderConfig
postnet: PostnetConfig
class pretrained.tacotron2.Tacotron(config: TacotronConfig)[source]

Bases: BaseModel

parse_output(outputs: tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates], output_lengths: Tensor | None = None) tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates][source]
forward(inputs: tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor], states: DecoderStates | None = None, speaker_emb: Tensor | None = None) tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

infer(inputs: Tensor, input_lengths: Tensor, states: DecoderStates | None = None, speaker_emb: Tensor | None = None) tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates][source]
class pretrained.tacotron2.Tokenizer[source]

Bases: object

pretrained.tacotron2.ensure_tacotron_downloaded() Path[source]
pretrained.tacotron2.pretrained_tacotron2(*, pretrained: bool = True, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0, lora_encoder: bool = True, lora_decoder: bool = True, lora_postnet: bool = True, device: device | None = None, prenet_dropout: bool = True, num_tokens: int | None = None) Tacotron[source]

Loads the pretrained Tacotron2 model.

Parameters:
  • pretrained – Whether to load the pretrained weights.

  • lora_rank – The LoRA rank to use, if LoRA is desired.

  • lora_alpha – The LoRA alpha to use, if LoRA is desired.

  • lora_dropout – The LoRA dropout to use, if LoRA is desired.

  • lora_encoder – Whether to use LoRA in the encoder.

  • lora_decoder – Whether to use LoRA in the decoder.

  • lora_postnet – Whether to use LoRA in the postnet.

  • device – The device to load the weights onto.

  • prenet_dropout – Whether to use always apply dropout in the PreNet.

  • num_tokens – The number of tokens in the vocabulary.

Returns:

The pretrained Tacotron model.

pretrained.tacotron2.tacotron_stft(filter_length: int = 1024, hop_length: int = 256, win_length: int = 1024, n_mel_channels: int = 80, sampling_rate: int = 16000, mel_fmin: float = 0.0, mel_fmax: float = 8000.0) TacotronSTFT[source]

Returns an STFT module for training the Tacotron model.

Parameters:
  • filter_length – The length of the filters used for the STFT.

  • hop_length – The hop length of the STFT.

  • win_length – The window length of the STFT.

  • n_mel_channels – The number of mel channels.

  • sampling_rate – The sampling rate of the audio.

  • mel_fmin – The minimum frequency of the mel filterbank.

  • mel_fmax – The maximum frequency of the mel filterbank.

Returns:

The STFT module.

pretrained.tacotron2.tacotron_tokenizer() Tokenizer[source]
class pretrained.tacotron2.TTS(tacotron: Tacotron, vocoder: HiFiGAN | WaveGlow, *, device: base_device | None = None)[source]

Bases: object

Provides an API for doing text-to-speech.

Note that this module is not an nn.Module, so you can use it in your module without worrying aobut storing all the weights on accident.

Parameters:
  • tacotron – The Tacotron model.

  • vocoder – The vocoder model.

  • device – The device to load the weights onto.

generate_mels(text: str | list[str], postnet: bool = True, states: DecoderStates | None = None) tuple[torch.Tensor, pretrained.tacotron2.DecoderStates][source]
generate_wave(mels: Tensor) Tensor[source]
generate(text: str | list[str], postnet: bool = True, states: DecoderStates | None = None) tuple[torch.Tensor, pretrained.tacotron2.DecoderStates][source]
pretrained.tacotron2.pretrained_tacotron2_tts(vocoder_type: Literal['waveglow', 'hifigan'] = 'hifigan', *, device: base_device | None = None) TTS[source]
pretrained.tacotron2.test_tacotron_adhoc() None[source]