Whisper: Add pip distribution configuration to support pip installations. (#739)

* Whisper: rename whisper to mlx_whisper * Whisper: add setup.py config for publish * Whisper: add assets data to setup config * Whisper: pre-commit for setup.py * Whisper: Update README.md * Whisper: Update README.md * nits * fix package data * nit in readme --------- Co-authored-by: Awni Hannun <awni@apple.com>
2025-12-16 02:08:55 +08:00 · 2024-05-02 00:00:02 +08:00
parent 4bf2eb17f2
commit 6775d6cb3f
23 changed files with 116 additions and 74 deletions
--- a/whisper/README.md
+++ b/whisper/README.md
@@ -6,12 +6,6 @@ parameters.[^1]

 ### Setup

-First, install the dependencies:
-
-```
-pip install -r requirements.txt
-```
-
 Install [`ffmpeg`](https://ffmpeg.org/):

 ```
@@ -19,19 +13,72 @@ Install [`ffmpeg`](https://ffmpeg.org/):
 brew install ffmpeg
 ```

+Install the `mlx-whisper` package with:
+
+```
+pip install mlx-whisper
+```
+
+### Run
+
+Transcribe audio with:
+
+```python
+import mlx_whisper
+
+text = mlx_whisper.transcribe(speech_file)["text"]
+```
+
+The default model is "mlx-community/whisper-tiny". Choose the model by
+setting `path_or_hf_repo`. For example:
+
+```python
+result = mlx_whisper.transcribe(speech_file, path_or_hf_repo="models/large")
+```
+
+This will load the model contained in `models/large`. The `path_or_hf_repo` can
+also point to an MLX-style Whisper model on the Hugging Face Hub. In this case,
+the model will be automatically downloaded. A [collection of pre-converted
+Whisper
+models](https://huggingface.co/collections/mlx-community/whisper-663256f9964fbb1177db93dc)
+are in the Hugging Face MLX Community.
+
+The `transcribe` function also supports word-level timestamps. You can generate
+these with:
+
+```python
+output = mlx_whisper.transcribe(speech_file, word_timestamps=True)
+print(output["segments"][0]["words"])
+```
+
+To see more transcription options use:
+
+```
+>>> help(mlx_whisper.transcribe)
+```
+
+### Converting models
+
 > [!TIP]
 > Skip the conversion step by using pre-converted checkpoints from the Hugging
 > Face Hub. There are a few available in the [MLX
 > Community](https://huggingface.co/mlx-community) organization.

-To convert a model, first download the Whisper PyTorch checkpoint and convert
-the weights to the MLX format. For example, to convert the `tiny` model use:
+To convert a model, first clone the MLX Examples repo:
+
+```
+git clone https://github.com/ml-explore/mlx-examples.git
+```
+
+Then run `convert.py` from `mlx-examples/whisper`. For example, to convert the
+`tiny` model use:

 ```
 python convert.py --torch-name-or-path tiny --mlx-path mlx_models/tiny
 ```

-Note you can also convert a local PyTorch checkpoint which is in the original OpenAI format.
+Note you can also convert a local PyTorch checkpoint which is in the original
+OpenAI format.

 To generate a 4-bit quantized model, use `-q`. For a full list of options:

@@ -53,38 +100,4 @@ python convert.py --torch-name-or-path ${model} --dtype float32 --mlx-path mlx_m
 python convert.py --torch-name-or-path ${model} -q --q_bits 4 --mlx-path mlx_models/${model}_quantized_4bits
 ```

-### Run
-
-Transcribe audio with:
-
-```python
-import whisper
-
-text = whisper.transcribe(speech_file)["text"]
-```
-
-Choose the model by setting `path_or_hf_repo`. For example:
-
-```python
-result = whisper.transcribe(speech_file, path_or_hf_repo="models/large")
-```
-
-This will load the model contained in `models/large`. The `path_or_hf_repo`
-can also point to an MLX-style Whisper model on the Hugging Face Hub. In this
-case, the model will be automatically downloaded.
-
-The `transcribe` function also supports word-level timestamps. You can generate
-these with:
-
-```python
-output = whisper.transcribe(speech_file, word_timestamps=True)
-print(output["segments"][0]["words"])
-```
-
-To see more transcription options use:
-
-```
->>> help(whisper.transcribe)
-```
-
 [^1]: Refer to the [arXiv paper](https://arxiv.org/abs/2212.04356), [blog post](https://openai.com/research/whisper), and [code](https://github.com/openai/whisper) for more details.