mlx-examples/whisper/README.md

# Whisper

Speech recognition with Whisper in MLX. Whisper is a set of open source speech
recognition models from OpenAI, ranging from 39 million to 1.5 billion
parameters.[^1]

### Setup

First, install the dependencies:

```
pip install -r requirements.txt
```

Install [`ffmpeg`](https://ffmpeg.org/):

```
# on macOS using Homebrew (https://brew.sh/)
brew install ffmpeg
```

Next, download the Whisper PyTorch checkpoint and convert the weights to the
MLX format. For example, to convert the `tiny` model use:

```
python convert.py --torch-name-or-path tiny --mlx-path mlx_models/tiny
```

Note you can also convert a local PyTorch checkpoint which is in the original OpenAI format.

To generate a 4-bit quantized model, use `-q`. For a full list of options:

```
python convert.py --help
```

By default, the conversion script will make the directory `mlx_models/tiny` and save
the converted `weights.npz` and `config.json` there.

> [!TIP]
> Alternatively, you can also download a few converted checkpoints from the
> [MLX Community](https://huggingface.co/mlx-community) organization on Hugging
> Face and skip the conversion step.

### Run

Transcribe audio with:

```python
import whisper

text = whisper.transcribe(speech_file)["text"]
```

The `transcribe` function also supports word-level timestamps. You can generate
these with:

```python
output = whisper.transcribe(speech_file, word_timestamps=True)
print(output["segments"][0]["words"])
```

To see more transcription options use:

```
>>> help(whisper.transcribe)
```

[^1]: Refer to the [arXiv paper](https://arxiv.org/abs/2212.04356), [blog post](https://openai.com/research/whisper), and [code](https://github.com/openai/whisper) for more details.
Corrected spelling of terms in whisper/README.md 2023-12-14 08:15:26 +08:00			`# Whisper`
a few examples 2023-11-30 00:17:26 +08:00
update whisper readme and requirements 2023-12-08 03:15:54 +08:00			`Speech recognition with Whisper in MLX. Whisper is a set of open source speech`
Corrected spelling of terms in whisper/README.md 2023-12-14 08:15:26 +08:00			`recognition models from OpenAI, ranging from 39 million to 1.5 billion`
[Whisper] Add word timestamps and confidence scores (#201) * Add word timestamps and confidence scores * Create a separate forward_with_cross_qk function * Move multiple ops from np to mlx, clean comments * Save alignment_heads * Cast qk to fp32 * Add test for word-level timestamps and confidence scores * format + readme * nit --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-01-08 02:01:29 +08:00			`parameters.[^1]`
a few examples 2023-11-30 00:17:26 +08:00
update whisper readme and requirements 2023-12-08 03:15:54 +08:00			`### Setup`
a few examples 2023-11-30 00:17:26 +08:00
[Whisper] Load customized MLX model & Quantization (#191) * Add option to load customized mlx model * Add quantization * Apply reviews * Separate model conversion and loading * Update test * Fix benchmark * Add notes about conversion * Improve doc 2023-12-30 02:22:15 +08:00			`First, install the dependencies:`
a few examples 2023-11-30 00:17:26 +08:00
			```
			`pip install -r requirements.txt`
			```

			Install [`ffmpeg`](https://ffmpeg.org/):

update whisper readme and requirements 2023-12-08 03:15:54 +08:00			```
Corrected spelling of terms in whisper/README.md 2023-12-14 08:15:26 +08:00			`# on macOS using Homebrew (https://brew.sh/)`
a few examples 2023-11-30 00:17:26 +08:00			`brew install ffmpeg`
			```

[Whisper] Add word timestamps and confidence scores (#201) * Add word timestamps and confidence scores * Create a separate forward_with_cross_qk function * Move multiple ops from np to mlx, clean comments * Save alignment_heads * Cast qk to fp32 * Add test for word-level timestamps and confidence scores * format + readme * nit --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-01-08 02:01:29 +08:00			`Next, download the Whisper PyTorch checkpoint and convert the weights to the`
			MLX format. For example, to convert the `tiny` model use:
[Whisper] Load customized MLX model & Quantization (#191) * Add option to load customized mlx model * Add quantization * Apply reviews * Separate model conversion and loading * Update test * Fix benchmark * Add notes about conversion * Improve doc 2023-12-30 02:22:15 +08:00
			```
			`python convert.py --torch-name-or-path tiny --mlx-path mlx_models/tiny`
			```

			`Note you can also convert a local PyTorch checkpoint which is in the original OpenAI format.`

			To generate a 4-bit quantized model, use `-q`. For a full list of options:

			```
			`python convert.py --help`
			```

			By default, the conversion script will make the directory `mlx_models/tiny` and save
			the converted `weights.npz` and `config.json` there.

			`> [!TIP]`
			`> Alternatively, you can also download a few converted checkpoints from the`
			`> [MLX Community](https://huggingface.co/mlx-community) organization on Hugging`
			`> Face and skip the conversion step.`

update whisper readme and requirements 2023-12-08 03:15:54 +08:00			`### Run`

			`Transcribe audio with:`
a few examples 2023-11-30 00:17:26 +08:00
[Whisper] Add word timestamps and confidence scores (#201) * Add word timestamps and confidence scores * Create a separate forward_with_cross_qk function * Move multiple ops from np to mlx, clean comments * Save alignment_heads * Cast qk to fp32 * Add test for word-level timestamps and confidence scores * format + readme * nit --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-01-08 02:01:29 +08:00			```python
a few examples 2023-11-30 00:17:26 +08:00			`import whisper`

			`text = whisper.transcribe(speech_file)["text"]`
			```

[Whisper] Add word timestamps and confidence scores (#201) * Add word timestamps and confidence scores * Create a separate forward_with_cross_qk function * Move multiple ops from np to mlx, clean comments * Save alignment_heads * Cast qk to fp32 * Add test for word-level timestamps and confidence scores * format + readme * nit --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-01-08 02:01:29 +08:00			The `transcribe` function also supports word-level timestamps. You can generate
			`these with:`

			```python
			`output = whisper.transcribe(speech_file, word_timestamps=True)`
			`print(output["segments"][0]["words"])`
			```

			`To see more transcription options use:`

			```
			`>>> help(whisper.transcribe)`
			```

update whisper readme and requirements 2023-12-08 03:15:54 +08:00			`[^1]: Refer to the [arXiv paper](https://arxiv.org/abs/2212.04356), [blog post](https://openai.com/research/whisper), and [code](https://github.com/openai/whisper) for more details.`