pretrained.tacotron2

Defines a pre-trained Tacotron2 model.

This combines a Tacotron2 model with a HiFiGAN vocoder to produce an end-to-end TTS model, adapted to be fine-tunable.

from pretrained.tacotron2 import pretrained_tacotron2_tts

tts = pretrained_tacotron2_tts()
audio, states = tts.generate("Hello, world!")
write_audio([audio])

You can also interact with this model directly through the command line:

python -m pretrained.tacotron2 'Hello, world!'

The two parts of the model can be trained separately, including using LoRA fine-tuning.

Using this model requires the following additional dependencies:

inflect
ftfy

Additionally, to generate STFTs for training the model, you will need to install librosa. If you want to play audio for the demo, you should also install sounddevice.

class pretrained.tacotron2.Normalizer[source]: Bases: object

pretrained.tacotron2.text_clean_func(lower: bool = True) → Callable[[str], str][source]

pretrained.tacotron2.get_mask_from_lengths(lengths: Tensor) → Tensor[source]

class pretrained.tacotron2.LinearNorm(in_dim: int, out_dim: int, bias: bool = True, w_init_gain: str = 'linear', lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pretrained.tacotron2.ConvNorm(in_channels: int, out_channels: int, kernel_size: int = 1, stride: int = 1, padding: int | None = None, dilation: int = 1, bias: bool = True, w_init_gain: str = 'linear', lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(signal: Tensor) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pretrained.tacotron2.LocationLayer(attention_n_filters: int, attention_kernel_size: int, attention_dim: int, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(attention_weights_cat: Tensor) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pretrained.tacotron2.Attention(attention_rnn_dim: int, embedding_dim: int, attention_dim: int, attention_location_n_filters: int, attention_location_kernel_size: int, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

get_alignment_energies(query: Tensor, processed_memory: Tensor, attention_weights_cat: Tensor) → Tensor[source]

forward(attn_hid_state: Tensor, memory: Tensor, proc_memory: Tensor, attn_weights_cat: Tensor, mask: Tensor | None) → tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pretrained.tacotron2.Prenet(in_dim: int = 80, sizes: list[int] = [256, 256], dropout: float = 0.5, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0, dropout_always_on: bool = True)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pretrained.tacotron2.PostnetConfig(n_mel_channels: int = 80, emb_dim: int = 512, kernel_size: int = 5, n_convolutions: int = 5, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0)[source]

Bases: object

n_mel_channels: int = 80

emb_dim: int = 512

kernel_size: int = 5

n_convolutions: int = 5

lora_rank: int | None = None

lora_alpha: float = 1.0

lora_dropout: float = 0.0

class pretrained.tacotron2.Postnet(config: PostnetConfig)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pretrained.tacotron2.EncoderConfig(emb_dim: int = 512, kernel_size: int = 5, n_convolutions: int = 3, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0, freeze_bn: bool = False, speaker_emb_dim: int | None = None)[source]

Bases: object

emb_dim: int = 512

kernel_size: int = 5

n_convolutions: int = 3

lora_rank: int | None = None

lora_alpha: float = 1.0

lora_dropout: float = 0.0

freeze_bn: bool = False

speaker_emb_dim: int | None = None

class pretrained.tacotron2.Encoder(config: EncoderConfig)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, input_lengths: Tensor, speaker_emb: Tensor | None = None) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

infer(x: Tensor, input_lengths: Tensor, speaker_emb: Tensor | None = None) → Tensor[source]

class pretrained.tacotron2.DecoderConfig(n_mel_channels: int = 80, n_frames_per_step: int = 1, encoder_emb_dim: int = 512, attention_dim: int = 128, attention_location_n_filters: int = 32, attention_location_kernel_size: int = 31, attention_rnn_dim: int = 1024, decoder_rnn_dim: int = 1024, prenet_dim: int = 256, prenet_dropout: bool = 0.5, max_decoder_steps: int = 1000, gate_threshold: float = 0.5, p_attention_dropout: float = 0.1, p_decoder_dropout: float = 0.1, prenet_dropout_always_on: bool = True, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0)[source]

Bases: object

n_mel_channels: int = 80

n_frames_per_step: int = 1

encoder_emb_dim: int = 512

attention_dim: int = 128

attention_location_n_filters: int = 32

attention_location_kernel_size: int = 31

attention_rnn_dim: int = 1024

decoder_rnn_dim: int = 1024

prenet_dim: int = 256

prenet_dropout: bool = 0.5

max_decoder_steps: int = 1000

gate_threshold: float = 0.5

p_attention_dropout: float = 0.1

p_decoder_dropout: float = 0.1

prenet_dropout_always_on: bool = True

lora_rank: int | None = None

lora_alpha: float = 1.0

lora_dropout: float = 0.0

class pretrained.tacotron2.DecoderStates(attn_h, attn_c, dec_h, dec_c, attn_weights, attn_weights_cum, attn_ctx, memory, processed_memory, mask)[source]

Bases: NamedTuple

Create new instance of DecoderStates(attn_h, attn_c, dec_h, dec_c, attn_weights, attn_weights_cum, attn_ctx, memory, processed_memory, mask)

attn_h: Tensor: Alias for field number 0

attn_c: Tensor: Alias for field number 1

dec_h: Tensor: Alias for field number 2

dec_c: Tensor: Alias for field number 3

attn_weights: Tensor: Alias for field number 4

attn_weights_cum: Tensor: Alias for field number 5

attn_ctx: Tensor: Alias for field number 6

memory: Tensor: Alias for field number 7

processed_memory: Tensor: Alias for field number 8

mask: Tensor | None: Alias for field number 9

class pretrained.tacotron2.Decoder(config: DecoderConfig)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

get_go_frame(memory: Tensor) → Tensor[source]

initialize_decoder_states(memory: Tensor, mask: Tensor | None) → DecoderStates[source]

parse_decoder_inputs(decoder_inputs: Tensor) → Tensor[source]

parse_decoder_outputs(mel_outputs: list[torch.Tensor], gate_outputs: list[torch.Tensor], alignments: list[torch.Tensor], states: DecoderStates) → tuple[torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates][source]

decode(decoder_input: Tensor, states: DecoderStates) → tuple[torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates][source]

forward(memory: Tensor, dec_ins: Tensor, memory_lengths: Tensor, states: DecoderStates | None = None) → tuple[torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

infer(memory: Tensor, memory_lengths: Tensor, states: DecoderStates | None = None) → tuple[torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates][source]

pretrained.tacotron2.window_sumsquare(window: str | float, n_frames: int, hop_length: int = 200, win_length: int = 800, n_fft: int = 800, dtype: type = <class 'numpy.float32'>, norm: float | None = None) → ndarray[source]

pretrained.tacotron2.griffin_lim(magnitudes: Tensor, stft_fn: STFT, n_iters: int = 30) → Tensor[source]

pretrained.tacotron2.dynamic_range_compression(x: Tensor, c: int | float = 1, clip_val: float = 1e-05) → Tensor[source]

pretrained.tacotron2.dynamic_range_decompression(x: Tensor, c: int | float = 1) → Tensor[source]

class pretrained.tacotron2.STFT(filter_length: int = 800, hop_length: int = 200, win_length: int = 800, window: str = 'hann')[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward_basis: Tensor

inverse_basis: Tensor

transform(input_data: Tensor) → tuple[torch.Tensor, torch.Tensor][source]

inverse(magnitude: Tensor, phase: Tensor) → Tensor[source]

forward(input_data: Tensor) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pretrained.tacotron2.TacotronSTFT(filter_length: int = 1024, hop_length: int = 256, win_length: int = 1024, n_mel_channels: int = 80, sampling_rate: int = 16000, mel_fmin: float = 0.0, mel_fmax: float = 8000.0)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

mel_basis: Tensor

spectral_normalize(magnitudes: Tensor) → Tensor[source]

spectral_de_normalize(magnitudes: Tensor) → Tensor[source]

mel_spectrogram(y: Tensor) → Tensor[source]

class pretrained.tacotron2.TacotronConfig(name: str = '???', mask_padding: bool = False, n_mel_channels: int = 80, n_symbols: int = 148, symbols_emb_dim: int = 512, n_frames_per_step: int = 1, symbols_emb_dropout: float = 0.1, encoder: pretrained.tacotron2.EncoderConfig = <factory>, decoder: pretrained.tacotron2.DecoderConfig = <factory>, postnet: pretrained.tacotron2.PostnetConfig = <factory>)[source]

Bases: BaseModelConfig

mask_padding: bool = False

n_mel_channels: int = 80

n_symbols: int = 148

symbols_emb_dim: int = 512

n_frames_per_step: int = 1

symbols_emb_dropout: float = 0.1

encoder: EncoderConfig

decoder: DecoderConfig

postnet: PostnetConfig

class pretrained.tacotron2.Tacotron(config: TacotronConfig)[source]

Bases: BaseModel

parse_output(outputs: tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates], output_lengths: Tensor | None = None) → tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates][source]

forward(inputs: tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor], states: DecoderStates | None = None, speaker_emb: Tensor | None = None) → tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

infer(inputs: Tensor, input_lengths: Tensor, states: DecoderStates | None = None, speaker_emb: Tensor | None = None) → tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, pretrained.tacotron2.DecoderStates][source]

class pretrained.tacotron2.Tokenizer[source]: Bases: object

pretrained.tacotron2.ensure_tacotron_downloaded() → Path[source]

pretrained.tacotron2.pretrained_tacotron2(*, pretrained: bool = True, lora_rank: int | None = None, lora_alpha: float = 1.0, lora_dropout: float = 0.0, lora_encoder: bool = True, lora_decoder: bool = True, lora_postnet: bool = True, device: device | None = None, prenet_dropout: bool = True, num_tokens: int | None = None) → Tacotron[source]

Loads the pretrained Tacotron2 model.

Parameters:

pretrained – Whether to load the pretrained weights.
lora_rank – The LoRA rank to use, if LoRA is desired.
lora_alpha – The LoRA alpha to use, if LoRA is desired.
lora_dropout – The LoRA dropout to use, if LoRA is desired.
lora_encoder – Whether to use LoRA in the encoder.
lora_decoder – Whether to use LoRA in the decoder.
lora_postnet – Whether to use LoRA in the postnet.
device – The device to load the weights onto.
prenet_dropout – Whether to use always apply dropout in the PreNet.
num_tokens – The number of tokens in the vocabulary.

Returns:

The pretrained Tacotron model.

pretrained.tacotron2.tacotron_stft(filter_length: int = 1024, hop_length: int = 256, win_length: int = 1024, n_mel_channels: int = 80, sampling_rate: int = 16000, mel_fmin: float = 0.0, mel_fmax: float = 8000.0) → TacotronSTFT[source]

Returns an STFT module for training the Tacotron model.

Parameters:

filter_length – The length of the filters used for the STFT.
hop_length – The hop length of the STFT.
win_length – The window length of the STFT.
n_mel_channels – The number of mel channels.
sampling_rate – The sampling rate of the audio.
mel_fmin – The minimum frequency of the mel filterbank.
mel_fmax – The maximum frequency of the mel filterbank.

Returns:

The STFT module.

pretrained.tacotron2.tacotron_tokenizer() → Tokenizer[source]

class pretrained.tacotron2.TTS(tacotron: Tacotron, vocoder: HiFiGAN | WaveGlow, *, device: base_device | None = None)[source]

Bases: object

Provides an API for doing text-to-speech.

Note that this module is not an nn.Module, so you can use it in your module without worrying aobut storing all the weights on accident.

Parameters:

tacotron – The Tacotron model.
vocoder – The vocoder model.
device – The device to load the weights onto.

generate_mels(text: str | list[str], postnet: bool = True, states: DecoderStates | None = None) → tuple[torch.Tensor, pretrained.tacotron2.DecoderStates][source]

generate_wave(mels: Tensor) → Tensor[source]

generate(text: str | list[str], postnet: bool = True, states: DecoderStates | None = None) → tuple[torch.Tensor, pretrained.tacotron2.DecoderStates][source]

pretrained.tacotron2.pretrained_tacotron2_tts(vocoder_type: Literal['waveglow', 'hifigan'] = 'hifigan', *, device: base_device | None = None) → TTS[source]

pretrained.tacotron2.test_tacotron_adhoc() → None[source]