Simpler Distil-Whisper

Migrated from whisper_and_distil_whisper and asr evaluation on 2024-05-30.

Distil-Whisper

Simple speaking, Whisper is too large to deploy into a lot of production environments, we can deal this with model distillation technique.

The paper "Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling" proposed a robust and straight forward approach to make full-size Whisper become smaller, but unfortunately, the original implementation has following problems which make it painful to use in real working:

Unnecessarily coupled together with a lot of HuggingFace libs, most of them are not really easy to use, like datasets, huggingface_hub.
A lot of rarely using logics, like uploading your data, uploading your trained model.... seriously, WHY WOULD I DO THAT...??
Tied together with data on HuggingFace Datasets platform, I believe most MLE/researchers who need distil or fine-tune Whiper have their internal datasets.

So the target of this working is to solve the problem brought by original implementation and make MLEs/researchers life easier when they need distil Whisper with their customized datasets.

Design

Basically do distillation of Whisper contain three steps:

Generate psuedo labeled dataset based on original training dataset.
Initialize a distilled/pruned Whisper based on full-sized model.
Model distillation.

Each of above steps will corresponding with a single program to run.

Config

Compare with original implementation, we make parameters be much more clear by put all parameter into a JSON configs, so:

Do not have a lot of commond-line parameters, one JSON configs can have everything you need.
Even you don't know or care some not commonly used default parameters, you can still get their existance, in case when you need change it in future.
After each task, the JSON configs will be copied into an output folder with which you can always reproduce your task in future.

The demo configs can be found at root directory of this project.

Data

All audio datasets are just a JSON lines file.

Original Training Dataset: Contains at least 2 fields, one represents text/transcript, the other represents audio file path.
Psuedo Labelled Training Dataset: Contains at least 3 fields, the first represents text/transcript generated by original full-sized Whiser(psuedo label), the second for audio file path, and the third for CER/WER between psuedo label and original text.

Step-By-Step

Preparation

A directory in which contains a pre-trained or fine-tuned HuggingFace Whisper model.
JSON lines audio training & dev & test dataset.
Build Python environment:

conda create -p ./_venv python=3.10
conda activate ./_venv
conda deactivate

# or 

python3 -m venv ./_venv --copies
source ./bin/activate
# deactivate


pip install -r ./requirements.txt

Psuedo Labelling

python ./run_pseudo_labelling.py ./run_pseudo_labelling.json

Model Pruning

python run_student_model_init.py run_student_model_init.json

Model Distillation

python ./run_distillation.py ./run_distillation.json

Offline Inference

python ./offline_inference.py ./offline_inference.json

Evaluation

python eval.py eval.json

Notes

I did this on the internal datasets which only contains Mandarin and Hokkien. So far the performance can be understood as reproduced the original paper:

Full-sized Whiser (Small) CER on Mixed Full Test Data: 0.3033
Distil-Whisper (Small) CER on Mixed Full Test Data: 0.3039

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simpler Distil-Whisper

Distil-Whisper

Design

Config

Data

Step-By-Step

Preparation

Psuedo Labelling

Model Pruning

Model Distillation

Offline Inference

Evaluation

Notes

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
demo_data		demo_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval.json		eval.json
eval.py		eval.py
lib.py		lib.py
offline_inference.json		offline_inference.json
offline_inference.py		offline_inference.py
requirements.txt		requirements.txt
run_distillation.json		run_distillation.json
run_distillation.py		run_distillation.py
run_pseudo_labelling.json		run_pseudo_labelling.json
run_pseudo_labelling.py		run_pseudo_labelling.py
run_student_model_init.json		run_student_model_init.json
run_student_model_init.py		run_student_model_init.py

License

innerNULL/simpler-distil-whisper

Folders and files

Latest commit

History

Repository files navigation

Simpler Distil-Whisper

Distil-Whisper

Design

Config

Data

Step-By-Step

Preparation

Psuedo Labelling

Model Pruning

Model Distillation

Offline Inference

Evaluation

Notes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages