mlx-examples/whisper/README.md

# Whisper

Speech recognition with Whisper in MLX. Whisper is a set of open source speech
recognition models from OpenAI, ranging from 39 million to 1.5 billion
parameters.[^1]

### Setup

First, install the dependencies:

```
pip install -r requirements.txt
```

Install [`ffmpeg`](https://ffmpeg.org/):

```
# on macOS using Homebrew (https://brew.sh/)
brew install ffmpeg
```

> [!TIP]
> Skip the conversion step by using pre-converted checkpoints from the Hugging
> Face Hub. There are a few available in the [MLX
> Community](https://huggingface.co/mlx-community) organization.

To convert a model, first download the Whisper PyTorch checkpoint and convert
the weights to the MLX format. For example, to convert the `tiny` model use:

```
python convert.py --torch-name-or-path tiny --mlx-path mlx_models/tiny
```

Note you can also convert a local PyTorch checkpoint which is in the original OpenAI format.

To generate a 4-bit quantized model, use `-q`. For a full list of options:

```
python convert.py --help
```

By default, the conversion script will make the directory `mlx_models`
and save the converted `weights.npz` and `config.json` there. 

Each time it is run, `convert.py` will overwrite any model in the provided
path. To save different models, make sure to set `--mlx-path` to a unique
directory for each converted model. For example:

```bash
model="tiny"
python convert.py --torch-name-or-path ${model} --mlx-path mlx_models/${model}_fp16
python convert.py --torch-name-or-path ${model} --dtype float32 --mlx-path mlx_models/${model}_fp32
python convert.py --torch-name-or-path ${model} -q --q_bits 4 --mlx-path mlx_models/${model}_quantized_4bits
```

### Run

Transcribe audio with:

```python
import whisper

text = whisper.transcribe(speech_file)["text"]
```

Choose the model by setting `path_or_hf_repo`. For example:

```python
result = whisper.transcribe(speech_file, path_or_hf_repo="models/large")
```

This will load the model contained in `models/large`. The `path_or_hf_repo`
can also point to an MLX-style Whisper model on the Hugging Face Hub. In this
case, the model will be automatically downloaded.

The `transcribe` function also supports word-level timestamps. You can generate
these with:

```python
output = whisper.transcribe(speech_file, word_timestamps=True)
print(output["segments"][0]["words"])
```

To see more transcription options use:

```
>>> help(whisper.transcribe)
```

[^1]: Refer to the [arXiv paper](https://arxiv.org/abs/2212.04356), [blog post](https://openai.com/research/whisper), and [code](https://github.com/openai/whisper) for more details.
Corrected spelling of terms in whisper/README.md 2023-12-14 08:15:26 +08:00			`# Whisper`
a few examples 2023-11-30 00:17:26 +08:00
update whisper readme and requirements 2023-12-08 03:15:54 +08:00			`Speech recognition with Whisper in MLX. Whisper is a set of open source speech`
Corrected spelling of terms in whisper/README.md 2023-12-14 08:15:26 +08:00			`recognition models from OpenAI, ranging from 39 million to 1.5 billion`
[Whisper] Add word timestamps and confidence scores (#201) * Add word timestamps and confidence scores * Create a separate forward_with_cross_qk function * Move multiple ops from np to mlx, clean comments * Save alignment_heads * Cast qk to fp32 * Add test for word-level timestamps and confidence scores * format + readme * nit --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-01-08 02:01:29 +08:00			`parameters.[^1]`
a few examples 2023-11-30 00:17:26 +08:00
update whisper readme and requirements 2023-12-08 03:15:54 +08:00			`### Setup`
a few examples 2023-11-30 00:17:26 +08:00
[Whisper] Load customized MLX model & Quantization (#191) * Add option to load customized mlx model * Add quantization * Apply reviews * Separate model conversion and loading * Update test * Fix benchmark * Add notes about conversion * Improve doc 2023-12-30 02:22:15 +08:00			`First, install the dependencies:`
a few examples 2023-11-30 00:17:26 +08:00
			```
			`pip install -r requirements.txt`
			```

			Install [`ffmpeg`](https://ffmpeg.org/):

update whisper readme and requirements 2023-12-08 03:15:54 +08:00			```
Corrected spelling of terms in whisper/README.md 2023-12-14 08:15:26 +08:00			`# on macOS using Homebrew (https://brew.sh/)`
a few examples 2023-11-30 00:17:26 +08:00			`brew install ffmpeg`
			```

[Lora] Fix generate (#282) * fix generate * update readme, fix test, better default * nits * typo 2024-01-11 08:13:06 +08:00			`> [!TIP]`
			`> Skip the conversion step by using pre-converted checkpoints from the Hugging`
			`> Face Hub. There are a few available in the [MLX`
			`> Community](https://huggingface.co/mlx-community) organization.`

			`To convert a model, first download the Whisper PyTorch checkpoint and convert`
			the weights to the MLX format. For example, to convert the `tiny` model use:
[Whisper] Load customized MLX model & Quantization (#191) * Add option to load customized mlx model * Add quantization * Apply reviews * Separate model conversion and loading * Update test * Fix benchmark * Add notes about conversion * Improve doc 2023-12-30 02:22:15 +08:00
			```
			`python convert.py --torch-name-or-path tiny --mlx-path mlx_models/tiny`
			```

			`Note you can also convert a local PyTorch checkpoint which is in the original OpenAI format.`

			To generate a 4-bit quantized model, use `-q`. For a full list of options:

			```
			`python convert.py --help`
			```

Update README.md (#530) * Update README.md The default behaviour of where the convert.py saved files was wrong. It also was inconsistent with how the later script test.py is trying to use them (and assuming naming convention). I don't actually see a quick way to automate this since--as written--the target directory is set directly by an argument. It would probably be best to rewrite it so that the argument is used as an override variable, but the default behaviour is to construct a file path based on set and unset arugments. This also is complex because "defaults" are assumed in the naming convention as well. * Update README.md Created an actual script that'll run and do this correctly. * Update README.md Typo fix: mlx-models should have been mlx_models. This conforms with standard later in the mlx-examples/whisper code. * Update README.md Removed the larger script and changed it back to the simpler script as before. * nits in readme --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-03-07 22:23:43 +08:00			By default, the conversion script will make the directory `mlx_models`
			and save the converted `weights.npz` and `config.json` there.

			Each time it is run, `convert.py` will overwrite any model in the provided
			path. To save different models, make sure to set `--mlx-path` to a unique
			`directory for each converted model. For example:`

			```bash
			`model="tiny"`
			`python convert.py --torch-name-or-path ${model} --mlx-path mlx_models/${model}_fp16`
			`python convert.py --torch-name-or-path ${model} --dtype float32 --mlx-path mlx_models/${model}_fp32`
			`python convert.py --torch-name-or-path ${model} -q --q_bits 4 --mlx-path mlx_models/${model}_quantized_4bits`
			```
[Whisper] Load customized MLX model & Quantization (#191) * Add option to load customized mlx model * Add quantization * Apply reviews * Separate model conversion and loading * Update test * Fix benchmark * Add notes about conversion * Improve doc 2023-12-30 02:22:15 +08:00
update whisper readme and requirements 2023-12-08 03:15:54 +08:00			`### Run`

			`Transcribe audio with:`
a few examples 2023-11-30 00:17:26 +08:00
[Whisper] Add word timestamps and confidence scores (#201) * Add word timestamps and confidence scores * Create a separate forward_with_cross_qk function * Move multiple ops from np to mlx, clean comments * Save alignment_heads * Cast qk to fp32 * Add test for word-level timestamps and confidence scores * format + readme * nit --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-01-08 02:01:29 +08:00			```python
a few examples 2023-11-30 00:17:26 +08:00			`import whisper`

			`text = whisper.transcribe(speech_file)["text"]`
			```

Use pip for mlx data with speech commands (#307) * update to use pypi mlx data * nit in readme 2024-01-13 03:06:33 +08:00			Choose the model by setting `path_or_hf_repo`. For example:
[Lora] Fix generate (#282) * fix generate * update readme, fix test, better default * nits * typo 2024-01-11 08:13:06 +08:00
			```python
Use pip for mlx data with speech commands (#307) * update to use pypi mlx data * nit in readme 2024-01-13 03:06:33 +08:00			`result = whisper.transcribe(speech_file, path_or_hf_repo="models/large")`
[Lora] Fix generate (#282) * fix generate * update readme, fix test, better default * nits * typo 2024-01-11 08:13:06 +08:00			```

Use pip for mlx data with speech commands (#307) * update to use pypi mlx data * nit in readme 2024-01-13 03:06:33 +08:00			This will load the model contained in `models/large`. The `path_or_hf_repo`
[Lora] Fix generate (#282) * fix generate * update readme, fix test, better default * nits * typo 2024-01-11 08:13:06 +08:00			`can also point to an MLX-style Whisper model on the Hugging Face Hub. In this`
			`case, the model will be automatically downloaded.`

[Whisper] Add word timestamps and confidence scores (#201) * Add word timestamps and confidence scores * Create a separate forward_with_cross_qk function * Move multiple ops from np to mlx, clean comments * Save alignment_heads * Cast qk to fp32 * Add test for word-level timestamps and confidence scores * format + readme * nit --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-01-08 02:01:29 +08:00			The `transcribe` function also supports word-level timestamps. You can generate
			`these with:`

			```python
			`output = whisper.transcribe(speech_file, word_timestamps=True)`
			`print(output["segments"][0]["words"])`
			```

			`To see more transcription options use:`

			```
			`>>> help(whisper.transcribe)`
			```

update whisper readme and requirements 2023-12-08 03:15:54 +08:00			`[^1]: Refer to the [arXiv paper](https://arxiv.org/abs/2212.04356), [blog post](https://openai.com/research/whisper), and [code](https://github.com/openai/whisper) for more details.`