PyTorch ASR Phoneme Extraction¶
This notebook demonstrates how to extract phoneme representations from speech using PyTorch and pre-trained ASR models.
Setup¶
First, let's install the required libraries if they're not already installed.
# Uncomment and run if you need to install the packages
!pip install torch torchaudio transformers matplotlib numpy soundfile librosa
Requirement already satisfied: torch in ./.venv/lib/python3.12/site-packages (2.7.0) Requirement already satisfied: torchaudio in ./.venv/lib/python3.12/site-packages (2.7.0) Requirement already satisfied: transformers in ./.venv/lib/python3.12/site-packages (4.52.3) Requirement already satisfied: matplotlib in ./.venv/lib/python3.12/site-packages (3.10.3) Requirement already satisfied: numpy in ./.venv/lib/python3.12/site-packages (2.2.6) Requirement already satisfied: soundfile in ./.venv/lib/python3.12/site-packages (0.13.1) Requirement already satisfied: librosa in ./.venv/lib/python3.12/site-packages (0.11.0) Requirement already satisfied: filelock in ./.venv/lib/python3.12/site-packages (from torch) (3.18.0) Requirement already satisfied: typing-extensions>=4.10.0 in ./.venv/lib/python3.12/site-packages (from torch) (4.13.2) Requirement already satisfied: setuptools in ./.venv/lib/python3.12/site-packages (from torch) (80.9.0) Requirement already satisfied: sympy>=1.13.3 in ./.venv/lib/python3.12/site-packages (from torch) (1.14.0) Requirement already satisfied: networkx in ./.venv/lib/python3.12/site-packages (from torch) (3.5) Requirement already satisfied: jinja2 in ./.venv/lib/python3.12/site-packages (from torch) (3.1.6) Requirement already satisfied: fsspec in ./.venv/lib/python3.12/site-packages (from torch) (2025.5.1) Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.6.77 in ./.venv/lib/python3.12/site-packages (from torch) (12.6.77) Requirement already satisfied: nvidia-cuda-runtime-cu12==12.6.77 in ./.venv/lib/python3.12/site-packages (from torch) (12.6.77) Requirement already satisfied: nvidia-cuda-cupti-cu12==12.6.80 in ./.venv/lib/python3.12/site-packages (from torch) (12.6.80) Requirement already satisfied: nvidia-cudnn-cu12==9.5.1.17 in ./.venv/lib/python3.12/site-packages (from torch) (9.5.1.17) Requirement already satisfied: nvidia-cublas-cu12==12.6.4.1 in ./.venv/lib/python3.12/site-packages (from torch) (12.6.4.1) Requirement already satisfied: nvidia-cufft-cu12==11.3.0.4 in ./.venv/lib/python3.12/site-packages (from torch) (11.3.0.4) Requirement already satisfied: nvidia-curand-cu12==10.3.7.77 in ./.venv/lib/python3.12/site-packages (from torch) (10.3.7.77) Requirement already satisfied: nvidia-cusolver-cu12==11.7.1.2 in ./.venv/lib/python3.12/site-packages (from torch) (11.7.1.2) Requirement already satisfied: nvidia-cusparse-cu12==12.5.4.2 in ./.venv/lib/python3.12/site-packages (from torch) (12.5.4.2) Requirement already satisfied: nvidia-cusparselt-cu12==0.6.3 in ./.venv/lib/python3.12/site-packages (from torch) (0.6.3) Requirement already satisfied: nvidia-nccl-cu12==2.26.2 in ./.venv/lib/python3.12/site-packages (from torch) (2.26.2) Requirement already satisfied: nvidia-nvtx-cu12==12.6.77 in ./.venv/lib/python3.12/site-packages (from torch) (12.6.77) Requirement already satisfied: nvidia-nvjitlink-cu12==12.6.85 in ./.venv/lib/python3.12/site-packages (from torch) (12.6.85) Requirement already satisfied: nvidia-cufile-cu12==1.11.1.6 in ./.venv/lib/python3.12/site-packages (from torch) (1.11.1.6) Requirement already satisfied: triton==3.3.0 in ./.venv/lib/python3.12/site-packages (from torch) (3.3.0) Requirement already satisfied: huggingface-hub<1.0,>=0.30.0 in ./.venv/lib/python3.12/site-packages (from transformers) (0.32.2) Requirement already satisfied: packaging>=20.0 in ./.venv/lib/python3.12/site-packages (from transformers) (25.0) Requirement already satisfied: pyyaml>=5.1 in ./.venv/lib/python3.12/site-packages (from transformers) (6.0.2) Requirement already satisfied: regex!=2019.12.17 in ./.venv/lib/python3.12/site-packages (from transformers) (2024.11.6) Requirement already satisfied: requests in ./.venv/lib/python3.12/site-packages (from transformers) (2.32.3) Requirement already satisfied: tokenizers<0.22,>=0.21 in ./.venv/lib/python3.12/site-packages (from transformers) (0.21.1) Requirement already satisfied: safetensors>=0.4.3 in ./.venv/lib/python3.12/site-packages (from transformers) (0.5.3) Requirement already satisfied: tqdm>=4.27 in ./.venv/lib/python3.12/site-packages (from transformers) (4.67.1) Requirement already satisfied: contourpy>=1.0.1 in ./.venv/lib/python3.12/site-packages (from matplotlib) (1.3.2) Requirement already satisfied: cycler>=0.10 in ./.venv/lib/python3.12/site-packages (from matplotlib) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in ./.venv/lib/python3.12/site-packages (from matplotlib) (4.58.1) Requirement already satisfied: kiwisolver>=1.3.1 in ./.venv/lib/python3.12/site-packages (from matplotlib) (1.4.8) Requirement already satisfied: pillow>=8 in ./.venv/lib/python3.12/site-packages (from matplotlib) (11.2.1) Requirement already satisfied: pyparsing>=2.3.1 in ./.venv/lib/python3.12/site-packages (from matplotlib) (3.2.3) Requirement already satisfied: python-dateutil>=2.7 in ./.venv/lib/python3.12/site-packages (from matplotlib) (2.9.0.post0) Requirement already satisfied: cffi>=1.0 in ./.venv/lib/python3.12/site-packages (from soundfile) (1.17.1) Requirement already satisfied: audioread>=2.1.9 in ./.venv/lib/python3.12/site-packages (from librosa) (3.0.1) Requirement already satisfied: numba>=0.51.0 in ./.venv/lib/python3.12/site-packages (from librosa) (0.61.2) Requirement already satisfied: scipy>=1.6.0 in ./.venv/lib/python3.12/site-packages (from librosa) (1.15.3) Requirement already satisfied: scikit-learn>=1.1.0 in ./.venv/lib/python3.12/site-packages (from librosa) (1.6.1) Requirement already satisfied: joblib>=1.0 in ./.venv/lib/python3.12/site-packages (from librosa) (1.5.1) Requirement already satisfied: decorator>=4.3.0 in ./.venv/lib/python3.12/site-packages (from librosa) (5.2.1) Requirement already satisfied: pooch>=1.1 in ./.venv/lib/python3.12/site-packages (from librosa) (1.8.2) Requirement already satisfied: soxr>=0.3.2 in ./.venv/lib/python3.12/site-packages (from librosa) (0.5.0.post1) Requirement already satisfied: lazy_loader>=0.1 in ./.venv/lib/python3.12/site-packages (from librosa) (0.4) Requirement already satisfied: msgpack>=1.0 in ./.venv/lib/python3.12/site-packages (from librosa) (1.1.0) Requirement already satisfied: pycparser in ./.venv/lib/python3.12/site-packages (from cffi>=1.0->soundfile) (2.22) Requirement already satisfied: hf-xet<2.0.0,>=1.1.2 in ./.venv/lib/python3.12/site-packages (from huggingface-hub<1.0,>=0.30.0->transformers) (1.1.2) Requirement already satisfied: llvmlite<0.45,>=0.44.0dev0 in ./.venv/lib/python3.12/site-packages (from numba>=0.51.0->librosa) (0.44.0) Requirement already satisfied: platformdirs>=2.5.0 in ./.venv/lib/python3.12/site-packages (from pooch>=1.1->librosa) (4.3.8) Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib) (1.17.0) Requirement already satisfied: charset-normalizer<4,>=2 in ./.venv/lib/python3.12/site-packages (from requests->transformers) (3.4.2) Requirement already satisfied: idna<4,>=2.5 in ./.venv/lib/python3.12/site-packages (from requests->transformers) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in ./.venv/lib/python3.12/site-packages (from requests->transformers) (2.4.0) Requirement already satisfied: certifi>=2017.4.17 in ./.venv/lib/python3.12/site-packages (from requests->transformers) (2025.4.26) Requirement already satisfied: threadpoolctl>=3.1.0 in ./.venv/lib/python3.12/site-packages (from scikit-learn>=1.1.0->librosa) (3.6.0) Requirement already satisfied: mpmath<1.4,>=1.1.0 in ./.venv/lib/python3.12/site-packages (from sympy>=1.13.3->torch) (1.3.0) Requirement already satisfied: MarkupSafe>=2.0 in ./.venv/lib/python3.12/site-packages (from jinja2->torch) (3.0.2)
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import matplotlib.pyplot as plt
import numpy as np
import IPython.display as ipd
import os
import subprocess
# Try to install required packages for audio processing
try:
import librosa
import soundfile as sf
except ImportError:
print("Installing librosa and soundfile for audio processing...")
!pip install librosa soundfile
import librosa
import soundfile as sf
# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
Using device: cuda
Download Sample Audio¶
Let's download a sample audio file to work with.
# Download and extract a sample audio file from LibriSpeech
import os
import tarfile
import tempfile
from urllib.request import urlretrieve
import shutil
sample_dir = "sample_data"
os.makedirs(sample_dir, exist_ok=True)
# Target audio file paths - we'll create both FLAC and WAV versions
flac_path = os.path.join(sample_dir, "sample_audio.flac")
wav_path = os.path.join(sample_dir, "sample_audio.wav")
# Check which files exist and set the audio path accordingly
flac_exists = os.path.exists(flac_path)
wav_exists = os.path.exists(wav_path)
# Prefer WAV if it exists, otherwise use FLAC if it exists
if wav_exists:
audio_path = wav_path
print(f"Using existing WAV file: {wav_path}")
elif flac_exists:
audio_path = flac_path
print(f"Using existing FLAC file: {flac_path}")
# Try to convert to WAV if FLAC exists but WAV doesn't
print("Converting FLAC to WAV format for better compatibility...")
try:
# Load the audio file with librosa
audio_data, sample_rate = librosa.load(flac_path, sr=None)
# Save as WAV using soundfile
sf.write(wav_path, audio_data, sample_rate)
print(f"Converted audio saved to {wav_path}")
audio_path = wav_path # Use the newly created WAV file
except Exception as e:
print(f"Error converting audio: {e}")
print("Using original FLAC file instead.")
else:
# Neither file exists, need to download
print("Sample audio not found. Downloading and extracting from archive...")
# Download the tarball
tarball_url = "https://openslr.elda.org/resources/12/dev-clean.tar.gz"
with tempfile.NamedTemporaryFile(suffix=".tar.gz", delete=False) as temp_file:
print(f"Downloading archive from {tarball_url}...")
urlretrieve(tarball_url, temp_file.name)
tarball_path = temp_file.name
# Extract the specific file we need
target_file_path = "LibriSpeech/dev-clean/84/121123/84-121123-0001.flac"
with tempfile.TemporaryDirectory() as temp_dir:
print("Extracting archive...")
with tarfile.open(tarball_path, "r:gz") as tar:
# Extract only the file we need
member = tar.getmember(target_file_path)
tar.extract(member, path=temp_dir)
# Move the extracted file to our sample directory
extracted_file = os.path.join(temp_dir, target_file_path)
shutil.copy(extracted_file, flac_path)
# Clean up the tarball
os.unlink(tarball_path)
print(f"Sample audio extracted to {flac_path}")
# Convert FLAC to WAV using librosa
print("Converting FLAC to WAV format...")
try:
# Load the audio file with librosa
audio_data, sample_rate = librosa.load(flac_path, sr=None)
# Save as WAV using soundfile
sf.write(wav_path, audio_data, sample_rate)
print(f"Converted audio saved to {wav_path}")
audio_path = wav_path # Use the WAV file
except Exception as e:
print(f"Error converting audio: {e}")
print("Using original FLAC file instead.")
audio_path = flac_path
print(f"Using audio file: {audio_path}")
Using existing WAV file: sample_data/sample_audio.wav Using audio file: sample_data/sample_audio.wav
Loading and Processing Audio¶
def process_audio(file_path):
# Load audio using alternative method if torchaudio fails
try:
# Try torchaudio first
waveform, sample_rate = torchaudio.load(file_path)
except RuntimeError:
# Fall back to using librosa
print(f"torchaudio failed to load {file_path}, trying librosa instead...")
import librosa
import numpy as np
# Load with librosa (automatically handles various formats including FLAC)
audio_data, sample_rate = librosa.load(file_path, sr=None)
waveform = torch.from_numpy(audio_data).unsqueeze(0).float()
print("Successfully loaded audio with librosa")
# Resample if needed
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(sample_rate, 16000)
waveform = resampler(waveform)
sample_rate = 16000
# Convert to mono if needed
if waveform.shape[0] > 1:
waveform = torch.mean(waveform, dim=0, keepdim=True)
return waveform.squeeze(), sample_rate
# Load and process the audio
waveform, sample_rate = process_audio(audio_path)
# Display audio information
print(f"Sample rate: {sample_rate} Hz")
print(f"Waveform shape: {waveform.shape}")
print(f"Audio duration: {waveform.shape[0]/sample_rate:.2f} seconds")
# Play the audio
ipd.Audio(waveform.numpy(), rate=sample_rate)
Sample rate: 16000 Hz Waveform shape: torch.Size([63840]) Audio duration: 3.99 seconds
Load Pre-trained ASR Model¶
We'll use the Wav2Vec 2.0 model from Facebook, which has been pre-trained on 960 hours of LibriSpeech.
# Load pre-trained model and processor
model_name = "facebook/wav2vec2-base-960h"
print(f"Loading model: {model_name}")
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
print("Model loaded successfully!")
Loading model: facebook/wav2vec2-base-960h
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Model loaded successfully!
Extracting Phoneme Probabilities¶
Now we'll extract the logits from the model, which represent the probabilities of different phonemes at each time step.
def extract_phoneme_probs(waveform, sample_rate=16000):
# Process audio for model input
input_values = processor(waveform, sampling_rate=sample_rate, return_tensors="pt").input_values
input_values = input_values.to(device)
# Get model outputs (without gradient calculation)
with torch.no_grad():
outputs = model(input_values)
logits = outputs.logits
# Convert logits to probabilities
probs = torch.nn.functional.softmax(logits, dim=-1)
return probs.cpu().squeeze(), processor.tokenizer.decoder
# Get phoneme probabilities
phoneme_probs, decoder = extract_phoneme_probs(waveform)
print(f"Shape of phoneme probabilities: {phoneme_probs.shape}")
print(f"Number of time steps: {phoneme_probs.shape[0]}")
print(f"Number of phoneme classes: {phoneme_probs.shape[1]}")
Shape of phoneme probabilities: torch.Size([199, 32]) Number of time steps: 199 Number of phoneme classes: 32
Visualizing Phoneme Activations¶
Let's visualize the top phoneme activations over time.
def plot_phoneme_activations(probs, decoder, top_k=5):
# Get top-k phonemes at each time step
top_probs, top_indices = torch.topk(probs, k=top_k, dim=1)
# Convert to numpy for plotting
top_probs = top_probs.numpy()
top_indices = top_indices.numpy()
# Get phoneme labels
phoneme_map = {v: k for k, v in decoder.items()}
# Create a time axis (assuming 50 frames per second for Wav2Vec 2.0)
time_steps = np.arange(top_probs.shape[0]) / 50
# Plot
plt.figure(figsize=(15, 8))
# Plot for a subset of time steps for clarity
start_idx = 0
end_idx = min(200, len(time_steps)) # Show first 4 seconds or less
for i in range(top_k):
plt.plot(time_steps[start_idx:end_idx],
top_probs[start_idx:end_idx, i],
label=f"Class {top_indices[0, i]} ({phoneme_map.get(top_indices[0, i], '')})")
plt.xlabel("Time (seconds)")
plt.ylabel("Probability")
plt.title("Top Phoneme Activations Over Time")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Visualize phoneme activations
plot_phoneme_activations(phoneme_probs, decoder)
Decoding to Phonemes and Text¶
Now let's decode the model outputs to both phonemes and text.
def decode_outputs(probs, decoder):
# Get the most likely phoneme at each time step
pred_ids = torch.argmax(probs, dim=-1)
# Decode to phonemes (keeping all predictions)
phoneme_sequence = [decoder.get(id.item(), f"[{id.item()}]") for id in pred_ids]
# Apply CTC decoding logic (collapse repeated tokens and remove blanks)
collapsed_phonemes = []
prev_id = -1
for id in pred_ids:
if id != prev_id and id != 0: # 0 is usually the blank token in CTC
collapsed_phonemes.append(decoder.get(id.item(), f"[{id.item()}]"))
prev_id = id
# Join phonemes to get the text
text = ''.join(collapsed_phonemes).replace('|', ' ')
return phoneme_sequence, collapsed_phonemes, text
# Decode outputs
phoneme_sequence, collapsed_phonemes, text = decode_outputs(phoneme_probs, decoder)
print("Full phoneme sequence (first 50 frames):")
print(phoneme_sequence[:50])
print("\nCollapsed phoneme sequence:")
print(collapsed_phonemes)
print("\nDecoded text:")
print(text)
Full phoneme sequence (first 50 frames): ['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 'B', '<pad>', 'U', 'T', '<pad>', '|', '<pad>', 'I', 'N', '|', '|', '|', 'L', '<pad>', 'E', 'S', '<pad>', '<pad>', 'S', '|', '|', 'T', 'H', '<pad>', 'A', 'N', '<pad>', '|', '|', '<pad>', 'F', '<pad>', '<pad>', '<pad>'] Collapsed phoneme sequence: ['B', 'U', 'T', '|', 'I', 'N', '|', 'L', 'E', 'S', 'S', '|', 'T', 'H', 'A', 'N', '|', 'F', 'I', 'V', 'E', '|', 'M', 'I', 'N', 'U', 'T', 'E', 'S', '|', 'T', 'H', 'E', '|', 'S', 'T', 'A', 'I', 'R', 'C', 'A', 'S', 'E', '|', 'G', 'R', 'O', 'A', 'N', 'E', 'D', '|', 'B', 'E', 'N', 'E', 'A', 'T', 'H', '|', 'A', 'N', '|', 'E', 'X', 'T', 'R', 'A', 'O', 'R', 'D', 'I', 'N', 'A', 'R', 'Y', '|', 'W', 'E', 'I', 'G', 'H', 'T', '|'] Decoded text: BUT IN LESS THAN FIVE MINUTES THE STAIRCASE GROANED BENEATH AN EXTRAORDINARY WEIGHT
Using a Model Fine-tuned for Phoneme Recognition¶
For a more direct approach to phoneme recognition, we can use a model specifically fine-tuned for phoneme recognition.
# Load a model fine-tuned for phoneme recognition
# Note: This will download a different model
phoneme_model_name = "facebook/wav2vec2-lv-60-espeak-cv-ft"
print(f"Loading phoneme model: {phoneme_model_name}")
try:
# Import specific processor class for this model
from transformers import Wav2Vec2ProcessorWithLM, Wav2Vec2CTCTokenizer, Wav2Vec2FeatureExtractor
# Load the model components separately
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(phoneme_model_name)
tokenizer = Wav2Vec2CTCTokenizer.from_pretrained(phoneme_model_name)
phoneme_processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
phoneme_model = Wav2Vec2ForCTC.from_pretrained(phoneme_model_name).to(device)
print("Phoneme model loaded successfully!")
def transcribe_to_phonemes(waveform, sample_rate=16000):
# Process audio for model input
input_values = phoneme_processor(waveform, sampling_rate=sample_rate, return_tensors="pt").input_values
input_values = input_values.to(device)
# Get model predictions
with torch.no_grad():
logits = phoneme_model(input_values).logits
# Decode phonemes
predicted_ids = torch.argmax(logits, dim=-1)
phoneme_string = phoneme_processor.batch_decode(predicted_ids)[0]
return phoneme_string
# Get phoneme transcription
phoneme_transcription = transcribe_to_phonemes(waveform)
print("\nPhoneme transcription:")
print(phoneme_transcription)
except Exception as e:
print(f"Error loading phoneme model: {e}")
print("Skipping phoneme-specific model demonstration.")
Loading phoneme model: facebook/wav2vec2-lv-60-espeak-cv-ft
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'Wav2Vec2PhonemeCTCTokenizer'. The class this function is called from is 'Wav2Vec2CTCTokenizer'.
Phoneme model loaded successfully! Phoneme transcription: bʌɾɪnlɛsðənfaɪvmɪnɪtsðəstɛɹkeɪsɡɹoʊndbɪniːθɐnɛkstɹɔːɹdnɛɹiweɪt
Analyzing Phoneme Distributions¶
Let's analyze the distribution of phonemes in our sample.
# Count phoneme occurrences
from collections import Counter
# Count non-blank phonemes
phoneme_counts = Counter([p for p in collapsed_phonemes if p != ''])
# Plot top 15 phonemes
top_phonemes = phoneme_counts.most_common(15)
phonemes, counts = zip(*top_phonemes)
plt.figure(figsize=(12, 6))
plt.bar(phonemes, counts)
plt.title('Top 15 Phonemes in Sample')
plt.xlabel('Phoneme')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Comparing Phoneme Transcriptions¶
Let's compare the different phoneme transcriptions side by side to see the differences between methods.
# Create a comparison of the different phoneme transcriptions
import pandas as pd
from IPython.display import display, HTML
# Store the transcriptions in variables for comparison
# Note: These will be populated when the cells above are run
ctc_text = text # From the CTC decoding section
try:
specialized_text = phoneme_transcription # From the specialized model section
except NameError:
specialized_text = "[Model failed to load]"
# Create a DataFrame for comparison
comparison_df = pd.DataFrame({
'Method': ['Wav2Vec2 Base with CTC Decoding', 'Specialized Phoneme Model'],
'Transcription': [ctc_text, specialized_text]
})
# Display the comparison
display(HTML(comparison_df.to_html(index=False)))
# Also show a more detailed comparison of the phoneme sequences
print("\nDetailed Phoneme Sequence Comparison:")
print("\nWav2Vec2 Base with CTC Decoding:")
print(' '.join(collapsed_phonemes[:30]) + "...")
try:
# For the specialized model, we might need to split the string into individual phonemes
if isinstance(specialized_text, str):
specialized_phonemes = list(specialized_text.replace(" ", "|SPACE|"))
print("\nSpecialized Phoneme Model:")
print(' '.join(specialized_phonemes[:30]) + "...")
except Exception as e:
print(f"\nCould not process specialized phonemes: {e}")
Method | Transcription |
---|---|
Wav2Vec2 Base with CTC Decoding | BUT IN LESS THAN FIVE MINUTES THE STAIRCASE GROANED BENEATH AN EXTRAORDINARY WEIGHT |
Specialized Phoneme Model | bʌɾɪnlɛsðənfaɪvmɪnɪtsðəstɛɹkeɪsɡɹoʊndbɪniːθɐnɛkstɹɔːɹdnɛɹiweɪt |
Detailed Phoneme Sequence Comparison: Wav2Vec2 Base with CTC Decoding: B U T | I N | L E S S | T H A N | F I V E | M I N U T E S |... Specialized Phoneme Model: b ʌ ɾ ɪ n l ɛ s ð ə n f a ɪ v m ɪ n ɪ t s ð ə s t ɛ ɹ k e ɪ...
Conclusion¶
In this notebook, we've demonstrated how to:
- Load and process audio files for ASR
- Extract phoneme probabilities from Wav2Vec 2.0 models
- Visualize phoneme activations over time
- Decode phoneme sequences to text
- Use models specifically fine-tuned for phoneme recognition
- Compare different phoneme transcription methods
These techniques can be applied to various applications such as:
- Studying pronunciation patterns
- Developing language learning tools
- Creating more interpretable ASR systems
- Analyzing speech disorders