pretrained.mel_codec

Defines a simple API for an audio quantizer model that runs on Mels.

from pretrained.mel_codec import pretrained_mel_codec

model = pretrained_mel_codec("librivox")
quantizer, dequantizer = model.quantizer(), model.dequantizer()

# Convert some audio to a quantized representation.
quantized = quantizer(audio)

# Convert the quantized representation back to audio.
audio = dequantizer(quantized)

pretrained.mel_codec.cast_pretrained_mel_codec_type(s: str) → Literal['base'][source]

class pretrained.mel_codec.CBR(in_channels: int, out_channels: int, kernel_size: int)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, state: tuple[torch.Tensor, int] | None = None) → tuple[torch.Tensor, tuple[torch.Tensor, int]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pretrained.mel_codec.Encoder(num_mels: int, d_model: int)[source]

Bases: Module

Defines the encoder module.

This module takes the Mel spectrogram as an input and outputs the latent representation to quantize.

Parameters:

num_mels – Number of input Mel spectrogram bins.
d_model – The hidden dimension of the model.

Inputs:: mels: The input Mel spectrogram, with shape (B, T, C). state: The previous state of the encoder, if any.
Outputs:: The latent representation, with shape (B, T, C), along with the updated state.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(mels: Tensor, state: list[tuple[torch.Tensor, int]] | None = None) → tuple[torch.Tensor, list[tuple[torch.Tensor, int]]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pretrained.mel_codec.Decoder(num_mels: int, d_model: int, num_layers: int)[source]

Bases: Module

Defines the decoder module.

This module takes the latent representation as input and outputs the reconstructed Mel spectrogram. this can be run in inference mode, where the model expects to see batches of codes and maintains a state over time, or in training mode, where the model expects to see the ground truth Mel spectrogram in addition to the codes.

Parameters:

num_mels – Number of input Mel spectrogram bins.
d_model – The hidden dimension of the model.

Inputs:: codes: The latent representation, with shape (B, T, C). mels: The input Mel spectrogram, with shape (B, T, C), if in

training mode.
Outputs:: The reconstructed Mel spectrogram, with shape (B, T, C), along with the updated state if in training mode.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

init_emb: Tensor

forward(codes: Tensor, mels: Tensor) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

infer(codes: Tensor, state: tuple[torch.Tensor, torch.Tensor] | None = None) → tuple[torch.Tensor, tuple[torch.Tensor, torch.Tensor]][source]

class pretrained.mel_codec.MelCodec(num_mels: int, d_model: int, num_layers: int, codebook_size: int, num_quantizers: int, hifigan_key: Literal['16000hz', '22050hz'])[source]

Bases: Module

Defines an audio RNN module.

This module takes the Mel spectrogram as an input and outputs the predicted next step of the Mel spectrogram.

Parameters:

num_mels – Number of input Mel spectrogram bins.
d_model – The hidden dimension of the model.
num_layers – Number of hidden layers in the decoder.
codebook_size – Number of codebook entries.
num_quantizers – Number of quantizers to use.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → tuple[torch.Tensor, torch.Tensor][source]

Runs the forward pass of the model.

Parameters:: x – The input Mel spectrogram, with shape (B, T, C).
Returns:: The predicted next step of the Mel spectrogram, with shape (B, T, C), along with the codebook loss.

infer(x: Tensor) → Tensor[source]

Runs the inference pass, for evaluating model quality.

This just converts the input mels to codes and then decodes them.

Parameters:: x – The input Mel spectrogram, with shape (B, T, C).
Returns:: The predicted next step of the Mel spectrogram, with shape (B, T, C),

property hifigan: HiFiGAN

quantizer() → MelCodecQuantizer[source]

dequantizer() → MelCodecDequantizer[source]

class pretrained.mel_codec.MelCodecQuantizer(codec: MelCodec, hifigan: HiFiGAN)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

encode(audio: Tensor, state: list[tuple[torch.Tensor, int]] | None = None) → tuple[torch.Tensor, list[tuple[torch.Tensor, int]]][source]

Converts a waveform to a set of tokens.

Parameters:

audio – The single-channel input waveform, with shape (B, T) This should be at 22050 Hz.
state – The encoder state from the previous step, if any.

Returns:

The quantized tokens, with shape (N, B, Tq), along with the updated encoder state.

forward(audio: Tensor, state: list[tuple[torch.Tensor, int]] | None = None) → tuple[torch.Tensor, list[tuple[torch.Tensor, int]]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pretrained.mel_codec.MelCodecDequantizer(codec: MelCodec, hifigan: HiFiGAN)[source]

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

decode(tokens: Tensor, state: tuple[torch.Tensor, torch.Tensor] | None = None) → tuple[torch.Tensor, tuple[torch.Tensor, torch.Tensor]][source]

Converts a set of tokens to a waveform.

Parameters:

tokens – The single-channel input tokens, with shape (N, B, Tq), at 22050 Hz.
state – The decoder state from the previous step, if any.

Returns:

The decoded waveform, with shape (B, T), along with the updated decoder state.

forward(tokens: Tensor, state: tuple[torch.Tensor, torch.Tensor] | None = None) → tuple[torch.Tensor, tuple[torch.Tensor, torch.Tensor]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

pretrained.mel_codec.pretrained_mel_codec(key: str | Literal['base'], load_weights: bool = True) → MelCodec[source]

pretrained.mel_codec.test_codec_adhoc() → None[source]