# Uncomment and run if you need to install the packages
!pip install torch torchaudio transformers matplotlib numpy soundfile librosa

Requirement already satisfied: torch in ./.venv/lib/python3.12/site-packages (2.7.0)
Requirement already satisfied: torchaudio in ./.venv/lib/python3.12/site-packages (2.7.0)
Requirement already satisfied: transformers in ./.venv/lib/python3.12/site-packages (4.52.3)
Requirement already satisfied: matplotlib in ./.venv/lib/python3.12/site-packages (3.10.3)
Requirement already satisfied: numpy in ./.venv/lib/python3.12/site-packages (2.2.6)
Requirement already satisfied: soundfile in ./.venv/lib/python3.12/site-packages (0.13.1)
Requirement already satisfied: librosa in ./.venv/lib/python3.12/site-packages (0.11.0)
Requirement already satisfied: filelock in ./.venv/lib/python3.12/site-packages (from torch) (3.18.0)
Requirement already satisfied: typing-extensions>=4.10.0 in ./.venv/lib/python3.12/site-packages (from torch) (4.13.2)
Requirement already satisfied: setuptools in ./.venv/lib/python3.12/site-packages (from torch) (80.9.0)
Requirement already satisfied: sympy>=1.13.3 in ./.venv/lib/python3.12/site-packages (from torch) (1.14.0)
Requirement already satisfied: networkx in ./.venv/lib/python3.12/site-packages (from torch) (3.5)
Requirement already satisfied: jinja2 in ./.venv/lib/python3.12/site-packages (from torch) (3.1.6)
Requirement already satisfied: fsspec in ./.venv/lib/python3.12/site-packages (from torch) (2025.5.1)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.6.77 in ./.venv/lib/python3.12/site-packages (from torch) (12.6.77)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.6.77 in ./.venv/lib/python3.12/site-packages (from torch) (12.6.77)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.6.80 in ./.venv/lib/python3.12/site-packages (from torch) (12.6.80)
Requirement already satisfied: nvidia-cudnn-cu12==9.5.1.17 in ./.venv/lib/python3.12/site-packages (from torch) (9.5.1.17)
Requirement already satisfied: nvidia-cublas-cu12==12.6.4.1 in ./.venv/lib/python3.12/site-packages (from torch) (12.6.4.1)
Requirement already satisfied: nvidia-cufft-cu12==11.3.0.4 in ./.venv/lib/python3.12/site-packages (from torch) (11.3.0.4)
Requirement already satisfied: nvidia-curand-cu12==10.3.7.77 in ./.venv/lib/python3.12/site-packages (from torch) (10.3.7.77)
Requirement already satisfied: nvidia-cusolver-cu12==11.7.1.2 in ./.venv/lib/python3.12/site-packages (from torch) (11.7.1.2)
Requirement already satisfied: nvidia-cusparse-cu12==12.5.4.2 in ./.venv/lib/python3.12/site-packages (from torch) (12.5.4.2)
Requirement already satisfied: nvidia-cusparselt-cu12==0.6.3 in ./.venv/lib/python3.12/site-packages (from torch) (0.6.3)
Requirement already satisfied: nvidia-nccl-cu12==2.26.2 in ./.venv/lib/python3.12/site-packages (from torch) (2.26.2)
Requirement already satisfied: nvidia-nvtx-cu12==12.6.77 in ./.venv/lib/python3.12/site-packages (from torch) (12.6.77)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.6.85 in ./.venv/lib/python3.12/site-packages (from torch) (12.6.85)
Requirement already satisfied: nvidia-cufile-cu12==1.11.1.6 in ./.venv/lib/python3.12/site-packages (from torch) (1.11.1.6)
Requirement already satisfied: triton==3.3.0 in ./.venv/lib/python3.12/site-packages (from torch) (3.3.0)
Requirement already satisfied: huggingface-hub<1.0,>=0.30.0 in ./.venv/lib/python3.12/site-packages (from transformers) (0.32.2)
Requirement already satisfied: packaging>=20.0 in ./.venv/lib/python3.12/site-packages (from transformers) (25.0)
Requirement already satisfied: pyyaml>=5.1 in ./.venv/lib/python3.12/site-packages (from transformers) (6.0.2)
Requirement already satisfied: regex!=2019.12.17 in ./.venv/lib/python3.12/site-packages (from transformers) (2024.11.6)
Requirement already satisfied: requests in ./.venv/lib/python3.12/site-packages (from transformers) (2.32.3)
Requirement already satisfied: tokenizers<0.22,>=0.21 in ./.venv/lib/python3.12/site-packages (from transformers) (0.21.1)
Requirement already satisfied: safetensors>=0.4.3 in ./.venv/lib/python3.12/site-packages (from transformers) (0.5.3)
Requirement already satisfied: tqdm>=4.27 in ./.venv/lib/python3.12/site-packages (from transformers) (4.67.1)
Requirement already satisfied: contourpy>=1.0.1 in ./.venv/lib/python3.12/site-packages (from matplotlib) (1.3.2)
Requirement already satisfied: cycler>=0.10 in ./.venv/lib/python3.12/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in ./.venv/lib/python3.12/site-packages (from matplotlib) (4.58.1)
Requirement already satisfied: kiwisolver>=1.3.1 in ./.venv/lib/python3.12/site-packages (from matplotlib) (1.4.8)
Requirement already satisfied: pillow>=8 in ./.venv/lib/python3.12/site-packages (from matplotlib) (11.2.1)
Requirement already satisfied: pyparsing>=2.3.1 in ./.venv/lib/python3.12/site-packages (from matplotlib) (3.2.3)
Requirement already satisfied: python-dateutil>=2.7 in ./.venv/lib/python3.12/site-packages (from matplotlib) (2.9.0.post0)
Requirement already satisfied: cffi>=1.0 in ./.venv/lib/python3.12/site-packages (from soundfile) (1.17.1)
Requirement already satisfied: audioread>=2.1.9 in ./.venv/lib/python3.12/site-packages (from librosa) (3.0.1)
Requirement already satisfied: numba>=0.51.0 in ./.venv/lib/python3.12/site-packages (from librosa) (0.61.2)
Requirement already satisfied: scipy>=1.6.0 in ./.venv/lib/python3.12/site-packages (from librosa) (1.15.3)
Requirement already satisfied: scikit-learn>=1.1.0 in ./.venv/lib/python3.12/site-packages (from librosa) (1.6.1)
Requirement already satisfied: joblib>=1.0 in ./.venv/lib/python3.12/site-packages (from librosa) (1.5.1)
Requirement already satisfied: decorator>=4.3.0 in ./.venv/lib/python3.12/site-packages (from librosa) (5.2.1)
Requirement already satisfied: pooch>=1.1 in ./.venv/lib/python3.12/site-packages (from librosa) (1.8.2)
Requirement already satisfied: soxr>=0.3.2 in ./.venv/lib/python3.12/site-packages (from librosa) (0.5.0.post1)
Requirement already satisfied: lazy_loader>=0.1 in ./.venv/lib/python3.12/site-packages (from librosa) (0.4)
Requirement already satisfied: msgpack>=1.0 in ./.venv/lib/python3.12/site-packages (from librosa) (1.1.0)
Requirement already satisfied: pycparser in ./.venv/lib/python3.12/site-packages (from cffi>=1.0->soundfile) (2.22)
Requirement already satisfied: hf-xet<2.0.0,>=1.1.2 in ./.venv/lib/python3.12/site-packages (from huggingface-hub<1.0,>=0.30.0->transformers) (1.1.2)
Requirement already satisfied: llvmlite<0.45,>=0.44.0dev0 in ./.venv/lib/python3.12/site-packages (from numba>=0.51.0->librosa) (0.44.0)
Requirement already satisfied: platformdirs>=2.5.0 in ./.venv/lib/python3.12/site-packages (from pooch>=1.1->librosa) (4.3.8)
Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib) (1.17.0)
Requirement already satisfied: charset-normalizer<4,>=2 in ./.venv/lib/python3.12/site-packages (from requests->transformers) (3.4.2)
Requirement already satisfied: idna<4,>=2.5 in ./.venv/lib/python3.12/site-packages (from requests->transformers) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in ./.venv/lib/python3.12/site-packages (from requests->transformers) (2.4.0)
Requirement already satisfied: certifi>=2017.4.17 in ./.venv/lib/python3.12/site-packages (from requests->transformers) (2025.4.26)
Requirement already satisfied: threadpoolctl>=3.1.0 in ./.venv/lib/python3.12/site-packages (from scikit-learn>=1.1.0->librosa) (3.6.0)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in ./.venv/lib/python3.12/site-packages (from sympy>=1.13.3->torch) (1.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in ./.venv/lib/python3.12/site-packages (from jinja2->torch) (3.0.2)

import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import matplotlib.pyplot as plt
import numpy as np
import IPython.display as ipd
import os
import subprocess

# Try to install required packages for audio processing
try:
    import librosa
    import soundfile as sf
except ImportError:
    print("Installing librosa and soundfile for audio processing...")
    !pip install librosa soundfile
    import librosa
    import soundfile as sf

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda

# Download and extract a sample audio file from LibriSpeech
import os
import tarfile
import tempfile
from urllib.request import urlretrieve
import shutil

sample_dir = "sample_data"
os.makedirs(sample_dir, exist_ok=True)

# Target audio file paths - we'll create both FLAC and WAV versions
flac_path = os.path.join(sample_dir, "sample_audio.flac")
wav_path = os.path.join(sample_dir, "sample_audio.wav")

# Check which files exist and set the audio path accordingly
flac_exists = os.path.exists(flac_path)
wav_exists = os.path.exists(wav_path)

# Prefer WAV if it exists, otherwise use FLAC if it exists
if wav_exists:
    audio_path = wav_path
    print(f"Using existing WAV file: {wav_path}")
elif flac_exists:
    audio_path = flac_path
    print(f"Using existing FLAC file: {flac_path}")
    
    # Try to convert to WAV if FLAC exists but WAV doesn't
    print("Converting FLAC to WAV format for better compatibility...")
    try:
        # Load the audio file with librosa
        audio_data, sample_rate = librosa.load(flac_path, sr=None)
        
        # Save as WAV using soundfile
        sf.write(wav_path, audio_data, sample_rate)
        print(f"Converted audio saved to {wav_path}")
        audio_path = wav_path  # Use the newly created WAV file
    except Exception as e:
        print(f"Error converting audio: {e}")
        print("Using original FLAC file instead.")
else:
    # Neither file exists, need to download
    print("Sample audio not found. Downloading and extracting from archive...")
    
    # Download the tarball
    tarball_url = "https://openslr.elda.org/resources/12/dev-clean.tar.gz"
    with tempfile.NamedTemporaryFile(suffix=".tar.gz", delete=False) as temp_file:
        print(f"Downloading archive from {tarball_url}...")
        urlretrieve(tarball_url, temp_file.name)
        tarball_path = temp_file.name
    
    # Extract the specific file we need
    target_file_path = "LibriSpeech/dev-clean/84/121123/84-121123-0001.flac"
    
    with tempfile.TemporaryDirectory() as temp_dir:
        print("Extracting archive...")
        with tarfile.open(tarball_path, "r:gz") as tar:
            # Extract only the file we need
            member = tar.getmember(target_file_path)
            tar.extract(member, path=temp_dir)
        
        # Move the extracted file to our sample directory
        extracted_file = os.path.join(temp_dir, target_file_path)
        shutil.copy(extracted_file, flac_path)
    
    # Clean up the tarball
    os.unlink(tarball_path)
    print(f"Sample audio extracted to {flac_path}")
    
    # Convert FLAC to WAV using librosa
    print("Converting FLAC to WAV format...")
    try:
        # Load the audio file with librosa
        audio_data, sample_rate = librosa.load(flac_path, sr=None)
        
        # Save as WAV using soundfile
        sf.write(wav_path, audio_data, sample_rate)
        print(f"Converted audio saved to {wav_path}")
        audio_path = wav_path  # Use the WAV file
    except Exception as e:
        print(f"Error converting audio: {e}")
        print("Using original FLAC file instead.")
        audio_path = flac_path

print(f"Using audio file: {audio_path}")

Using existing WAV file: sample_data/sample_audio.wav
Using audio file: sample_data/sample_audio.wav

def process_audio(file_path):
    # Load audio using alternative method if torchaudio fails
    try:
        # Try torchaudio first
        waveform, sample_rate = torchaudio.load(file_path)
    except RuntimeError:
        # Fall back to using librosa
        print(f"torchaudio failed to load {file_path}, trying librosa instead...")
        import librosa
        import numpy as np
        
        # Load with librosa (automatically handles various formats including FLAC)
        audio_data, sample_rate = librosa.load(file_path, sr=None)
        waveform = torch.from_numpy(audio_data).unsqueeze(0).float()
        print("Successfully loaded audio with librosa")
    
    # Resample if needed
    if sample_rate != 16000:
        resampler = torchaudio.transforms.Resample(sample_rate, 16000)
        waveform = resampler(waveform)
        sample_rate = 16000
    
    # Convert to mono if needed
    if waveform.shape[0] > 1:
        waveform = torch.mean(waveform, dim=0, keepdim=True)
    
    return waveform.squeeze(), sample_rate

# Load and process the audio
waveform, sample_rate = process_audio(audio_path)

# Display audio information
print(f"Sample rate: {sample_rate} Hz")
print(f"Waveform shape: {waveform.shape}")
print(f"Audio duration: {waveform.shape[0]/sample_rate:.2f} seconds")

# Play the audio
ipd.Audio(waveform.numpy(), rate=sample_rate)

Sample rate: 16000 Hz
Waveform shape: torch.Size([63840])
Audio duration: 3.99 seconds

# Load pre-trained model and processor
model_name = "facebook/wav2vec2-base-960h"
print(f"Loading model: {model_name}")

processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)

print("Model loaded successfully!")

Loading model: facebook/wav2vec2-base-960h

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Model loaded successfully!

def extract_phoneme_probs(waveform, sample_rate=16000):
    # Process audio for model input
    input_values = processor(waveform, sampling_rate=sample_rate, return_tensors="pt").input_values
    input_values = input_values.to(device)
    
    # Get model outputs (without gradient calculation)
    with torch.no_grad():
        outputs = model(input_values)
        logits = outputs.logits
    
    # Convert logits to probabilities
    probs = torch.nn.functional.softmax(logits, dim=-1)
    
    return probs.cpu().squeeze(), processor.tokenizer.decoder

# Get phoneme probabilities
phoneme_probs, decoder = extract_phoneme_probs(waveform)
print(f"Shape of phoneme probabilities: {phoneme_probs.shape}")
print(f"Number of time steps: {phoneme_probs.shape[0]}")
print(f"Number of phoneme classes: {phoneme_probs.shape[1]}")

Shape of phoneme probabilities: torch.Size([199, 32])
Number of time steps: 199
Number of phoneme classes: 32

def plot_phoneme_activations(probs, decoder, top_k=5):
    # Get top-k phonemes at each time step
    top_probs, top_indices = torch.topk(probs, k=top_k, dim=1)
    
    # Convert to numpy for plotting
    top_probs = top_probs.numpy()
    top_indices = top_indices.numpy()
    
    # Get phoneme labels
    phoneme_map = {v: k for k, v in decoder.items()}
    
    # Create a time axis (assuming 50 frames per second for Wav2Vec 2.0)
    time_steps = np.arange(top_probs.shape[0]) / 50
    
    # Plot
    plt.figure(figsize=(15, 8))
    
    # Plot for a subset of time steps for clarity
    start_idx = 0
    end_idx = min(200, len(time_steps))  # Show first 4 seconds or less
    
    for i in range(top_k):
        plt.plot(time_steps[start_idx:end_idx], 
                 top_probs[start_idx:end_idx, i], 
                 label=f"Class {top_indices[0, i]} ({phoneme_map.get(top_indices[0, i], '')})")
    
    plt.xlabel("Time (seconds)")
    plt.ylabel("Probability")
    plt.title("Top Phoneme Activations Over Time")
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

# Visualize phoneme activations
plot_phoneme_activations(phoneme_probs, decoder)

def decode_outputs(probs, decoder):
    # Get the most likely phoneme at each time step
    pred_ids = torch.argmax(probs, dim=-1)
    
    # Decode to phonemes (keeping all predictions)
    phoneme_sequence = [decoder.get(id.item(), f"[{id.item()}]") for id in pred_ids]
    
    # Apply CTC decoding logic (collapse repeated tokens and remove blanks)
    collapsed_phonemes = []
    prev_id = -1
    for id in pred_ids:
        if id != prev_id and id != 0:  # 0 is usually the blank token in CTC
            collapsed_phonemes.append(decoder.get(id.item(), f"[{id.item()}]"))
        prev_id = id
    
    # Join phonemes to get the text
    text = ''.join(collapsed_phonemes).replace('|', ' ')
    
    return phoneme_sequence, collapsed_phonemes, text

# Decode outputs
phoneme_sequence, collapsed_phonemes, text = decode_outputs(phoneme_probs, decoder)

print("Full phoneme sequence (first 50 frames):")
print(phoneme_sequence[:50])
print("\nCollapsed phoneme sequence:")
print(collapsed_phonemes)
print("\nDecoded text:")
print(text)

Full phoneme sequence (first 50 frames):
['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 'B', '<pad>', 'U', 'T', '<pad>', '|', '<pad>', 'I', 'N', '|', '|', '|', 'L', '<pad>', 'E', 'S', '<pad>', '<pad>', 'S', '|', '|', 'T', 'H', '<pad>', 'A', 'N', '<pad>', '|', '|', '<pad>', 'F', '<pad>', '<pad>', '<pad>']

Collapsed phoneme sequence:
['B', 'U', 'T', '|', 'I', 'N', '|', 'L', 'E', 'S', 'S', '|', 'T', 'H', 'A', 'N', '|', 'F', 'I', 'V', 'E', '|', 'M', 'I', 'N', 'U', 'T', 'E', 'S', '|', 'T', 'H', 'E', '|', 'S', 'T', 'A', 'I', 'R', 'C', 'A', 'S', 'E', '|', 'G', 'R', 'O', 'A', 'N', 'E', 'D', '|', 'B', 'E', 'N', 'E', 'A', 'T', 'H', '|', 'A', 'N', '|', 'E', 'X', 'T', 'R', 'A', 'O', 'R', 'D', 'I', 'N', 'A', 'R', 'Y', '|', 'W', 'E', 'I', 'G', 'H', 'T', '|']

Decoded text:
BUT IN LESS THAN FIVE MINUTES THE STAIRCASE GROANED BENEATH AN EXTRAORDINARY WEIGHT

# Load a model fine-tuned for phoneme recognition
# Note: This will download a different model
phoneme_model_name = "facebook/wav2vec2-lv-60-espeak-cv-ft"
print(f"Loading phoneme model: {phoneme_model_name}")

try:
    # Import specific processor class for this model
    from transformers import Wav2Vec2ProcessorWithLM, Wav2Vec2CTCTokenizer, Wav2Vec2FeatureExtractor
    
    # Load the model components separately
    feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(phoneme_model_name)
    tokenizer = Wav2Vec2CTCTokenizer.from_pretrained(phoneme_model_name)
    phoneme_processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
    phoneme_model = Wav2Vec2ForCTC.from_pretrained(phoneme_model_name).to(device)
    print("Phoneme model loaded successfully!")
    
    def transcribe_to_phonemes(waveform, sample_rate=16000):
        # Process audio for model input
        input_values = phoneme_processor(waveform, sampling_rate=sample_rate, return_tensors="pt").input_values
        input_values = input_values.to(device)
        
        # Get model predictions
        with torch.no_grad():
            logits = phoneme_model(input_values).logits
        
        # Decode phonemes
        predicted_ids = torch.argmax(logits, dim=-1)
        phoneme_string = phoneme_processor.batch_decode(predicted_ids)[0]
        
        return phoneme_string

    # Get phoneme transcription
    phoneme_transcription = transcribe_to_phonemes(waveform)
    print("\nPhoneme transcription:")
    print(phoneme_transcription)
    
except Exception as e:
    print(f"Error loading phoneme model: {e}")
    print("Skipping phoneme-specific model demonstration.")

Loading phoneme model: facebook/wav2vec2-lv-60-espeak-cv-ft

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2PhonemeCTCTokenizer'. 
The class this function is called from is 'Wav2Vec2CTCTokenizer'.

Phoneme model loaded successfully!

Phoneme transcription:
bʌɾɪnlɛsðənfaɪvmɪnɪtsðəstɛɹkeɪsɡɹoʊndbɪniːθɐnɛkstɹɔːɹdnɛɹiweɪt

# Count phoneme occurrences
from collections import Counter

# Count non-blank phonemes
phoneme_counts = Counter([p for p in collapsed_phonemes if p != ''])

# Plot top 15 phonemes
top_phonemes = phoneme_counts.most_common(15)
phonemes, counts = zip(*top_phonemes)

plt.figure(figsize=(12, 6))
plt.bar(phonemes, counts)
plt.title('Top 15 Phonemes in Sample')
plt.xlabel('Phoneme')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Create a comparison of the different phoneme transcriptions
import pandas as pd
from IPython.display import display, HTML

# Store the transcriptions in variables for comparison
# Note: These will be populated when the cells above are run
ctc_text = text  # From the CTC decoding section
try:
    specialized_text = phoneme_transcription  # From the specialized model section
except NameError:
    specialized_text = "[Model failed to load]"

# Create a DataFrame for comparison
comparison_df = pd.DataFrame({
    'Method': ['Wav2Vec2 Base with CTC Decoding', 'Specialized Phoneme Model'],
    'Transcription': [ctc_text, specialized_text]
})

# Display the comparison
display(HTML(comparison_df.to_html(index=False)))

# Also show a more detailed comparison of the phoneme sequences
print("\nDetailed Phoneme Sequence Comparison:")
print("\nWav2Vec2 Base with CTC Decoding:")
print(' '.join(collapsed_phonemes[:30]) + "...")

try:
    # For the specialized model, we might need to split the string into individual phonemes
    if isinstance(specialized_text, str):
        specialized_phonemes = list(specialized_text.replace(" ", "|SPACE|"))
        print("\nSpecialized Phoneme Model:")
        print(' '.join(specialized_phonemes[:30]) + "...")
except Exception as e:
    print(f"\nCould not process specialized phonemes: {e}")

Detailed Phoneme Sequence Comparison:

Wav2Vec2 Base with CTC Decoding:
B U T | I N | L E S S | T H A N | F I V E | M I N U T E S |...

Specialized Phoneme Model:
b ʌ ɾ ɪ n l ɛ s ð ə n f a ɪ v m ɪ n ɪ t s ð ə s t ɛ ɹ k e ɪ...

PyTorch ASR Phoneme Extraction¶

Setup¶

Download Sample Audio¶

Loading and Processing Audio¶

Load Pre-trained ASR Model¶

Extracting Phoneme Probabilities¶

Visualizing Phoneme Activations¶

Decoding to Phonemes and Text¶

Using a Model Fine-tuned for Phoneme Recognition¶

Analyzing Phoneme Distributions¶

Comparing Phoneme Transcriptions¶

Conclusion¶

Method	Transcription
Wav2Vec2 Base with CTC Decoding	BUT IN LESS THAN FIVE MINUTES THE STAIRCASE GROANED BENEATH AN EXTRAORDINARY WEIGHT
Specialized Phoneme Model	bʌɾɪnlɛsðənfaɪvmɪnɪtsðəstɛɹkeɪsɡɹoʊndbɪniːθɐnɛkstɹɔːɹdnɛɹiweɪt