mlx-examples/whisper/README.md

# Whisper

Speech recognition with Whisper in MLX. Whisper is a set of open source speech
recognition models from OpenAI, ranging from 39 million to 1.5 billion
parameters.[^1]

### Setup

First, install the dependencies:

```
pip install -r requirements.txt
```

Install [`ffmpeg`](https://ffmpeg.org/):

```
# on macOS using Homebrew (https://brew.sh/)
brew install ffmpeg
```

> [!TIP]
> Skip the conversion step by using pre-converted checkpoints from the Hugging
> Face Hub. There are a few available in the [MLX
> Community](https://huggingface.co/mlx-community) organization.

To convert a model, first download the Whisper PyTorch checkpoint and convert
the weights to the MLX format. For example, to convert the `tiny` model use:

```
python convert.py --torch-name-or-path tiny --mlx-path mlx_models/tiny
```

Note you can also convert a local PyTorch checkpoint which is in the original OpenAI format.

To generate a 4-bit quantized model, use `-q`. For a full list of options:

```
python convert.py --help
```

By default, the conversion script will make the directory `mlx_models/tiny`
and save the converted `weights.npz` and `config.json` there.

### Run

Transcribe audio with:

```python
import whisper

text = whisper.transcribe(speech_file)["text"]
```

Choose the model by setting `path_or_hf_repo`. For example:

```python
result = whisper.transcribe(speech_file, path_or_hf_repo="models/large")
```

This will load the model contained in `models/large`. The `path_or_hf_repo`
can also point to an MLX-style Whisper model on the Hugging Face Hub. In this
case, the model will be automatically downloaded.

The `transcribe` function also supports word-level timestamps. You can generate
these with:

```python
output = whisper.transcribe(speech_file, word_timestamps=True)
print(output["segments"][0]["words"])
```

To see more transcription options use:

```
>>> help(whisper.transcribe)
```

[^1]: Refer to the [arXiv paper](https://arxiv.org/abs/2212.04356), [blog post](https://openai.com/research/whisper), and [code](https://github.com/openai/whisper) for more details.
Corrected spelling of terms in whisper/README.md 2023-12-14 08:15:26 +08:00			`# Whisper`
a few examples 2023-11-30 00:17:26 +08:00
update whisper readme and requirements 2023-12-08 03:15:54 +08:00			`Speech recognition with Whisper in MLX. Whisper is a set of open source speech`
Corrected spelling of terms in whisper/README.md 2023-12-14 08:15:26 +08:00			`recognition models from OpenAI, ranging from 39 million to 1.5 billion`
[Whisper] Add word timestamps and confidence scores (#201) * Add word timestamps and confidence scores * Create a separate forward_with_cross_qk function * Move multiple ops from np to mlx, clean comments * Save alignment_heads * Cast qk to fp32 * Add test for word-level timestamps and confidence scores * format + readme * nit --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-01-08 02:01:29 +08:00			`parameters.[^1]`
a few examples 2023-11-30 00:17:26 +08:00
update whisper readme and requirements 2023-12-08 03:15:54 +08:00			`### Setup`
a few examples 2023-11-30 00:17:26 +08:00
[Whisper] Load customized MLX model & Quantization (#191) * Add option to load customized mlx model * Add quantization * Apply reviews * Separate model conversion and loading * Update test * Fix benchmark * Add notes about conversion * Improve doc 2023-12-30 02:22:15 +08:00			`First, install the dependencies:`
a few examples 2023-11-30 00:17:26 +08:00
			```
			`pip install -r requirements.txt`
			```

			Install [`ffmpeg`](https://ffmpeg.org/):

update whisper readme and requirements 2023-12-08 03:15:54 +08:00			```
Corrected spelling of terms in whisper/README.md 2023-12-14 08:15:26 +08:00			`# on macOS using Homebrew (https://brew.sh/)`
a few examples 2023-11-30 00:17:26 +08:00			`brew install ffmpeg`
			```

[Lora] Fix generate (#282) * fix generate * update readme, fix test, better default * nits * typo 2024-01-11 08:13:06 +08:00			`> [!TIP]`
			`> Skip the conversion step by using pre-converted checkpoints from the Hugging`
			`> Face Hub. There are a few available in the [MLX`
			`> Community](https://huggingface.co/mlx-community) organization.`

			`To convert a model, first download the Whisper PyTorch checkpoint and convert`
			the weights to the MLX format. For example, to convert the `tiny` model use:
[Whisper] Load customized MLX model & Quantization (#191) * Add option to load customized mlx model * Add quantization * Apply reviews * Separate model conversion and loading * Update test * Fix benchmark * Add notes about conversion * Improve doc 2023-12-30 02:22:15 +08:00
			```
			`python convert.py --torch-name-or-path tiny --mlx-path mlx_models/tiny`
			```

			`Note you can also convert a local PyTorch checkpoint which is in the original OpenAI format.`

			To generate a 4-bit quantized model, use `-q`. For a full list of options:

			```
			`python convert.py --help`
			```

[Lora] Fix generate (#282) * fix generate * update readme, fix test, better default * nits * typo 2024-01-11 08:13:06 +08:00			By default, the conversion script will make the directory `mlx_models/tiny`
			and save the converted `weights.npz` and `config.json` there.
[Whisper] Load customized MLX model & Quantization (#191) * Add option to load customized mlx model * Add quantization * Apply reviews * Separate model conversion and loading * Update test * Fix benchmark * Add notes about conversion * Improve doc 2023-12-30 02:22:15 +08:00
update whisper readme and requirements 2023-12-08 03:15:54 +08:00			`### Run`

			`Transcribe audio with:`
a few examples 2023-11-30 00:17:26 +08:00
[Whisper] Add word timestamps and confidence scores (#201) * Add word timestamps and confidence scores * Create a separate forward_with_cross_qk function * Move multiple ops from np to mlx, clean comments * Save alignment_heads * Cast qk to fp32 * Add test for word-level timestamps and confidence scores * format + readme * nit --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-01-08 02:01:29 +08:00			```python
a few examples 2023-11-30 00:17:26 +08:00			`import whisper`

			`text = whisper.transcribe(speech_file)["text"]`
			```

Use pip for mlx data with speech commands (#307) * update to use pypi mlx data * nit in readme 2024-01-13 03:06:33 +08:00			Choose the model by setting `path_or_hf_repo`. For example:
[Lora] Fix generate (#282) * fix generate * update readme, fix test, better default * nits * typo 2024-01-11 08:13:06 +08:00
			```python
Use pip for mlx data with speech commands (#307) * update to use pypi mlx data * nit in readme 2024-01-13 03:06:33 +08:00			`result = whisper.transcribe(speech_file, path_or_hf_repo="models/large")`
[Lora] Fix generate (#282) * fix generate * update readme, fix test, better default * nits * typo 2024-01-11 08:13:06 +08:00			```

Use pip for mlx data with speech commands (#307) * update to use pypi mlx data * nit in readme 2024-01-13 03:06:33 +08:00			This will load the model contained in `models/large`. The `path_or_hf_repo`
[Lora] Fix generate (#282) * fix generate * update readme, fix test, better default * nits * typo 2024-01-11 08:13:06 +08:00			`can also point to an MLX-style Whisper model on the Hugging Face Hub. In this`
			`case, the model will be automatically downloaded.`

[Whisper] Add word timestamps and confidence scores (#201) * Add word timestamps and confidence scores * Create a separate forward_with_cross_qk function * Move multiple ops from np to mlx, clean comments * Save alignment_heads * Cast qk to fp32 * Add test for word-level timestamps and confidence scores * format + readme * nit --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-01-08 02:01:29 +08:00			The `transcribe` function also supports word-level timestamps. You can generate
			`these with:`

			```python
			`output = whisper.transcribe(speech_file, word_timestamps=True)`
			`print(output["segments"][0]["words"])`
			```

			`To see more transcription options use:`

			```
			`>>> help(whisper.transcribe)`
			```

update whisper readme and requirements 2023-12-08 03:15:54 +08:00			`[^1]: Refer to the [arXiv paper](https://arxiv.org/abs/2212.04356), [blog post](https://openai.com/research/whisper), and [code](https://github.com/openai/whisper) for more details.`