mlx-examples/speechcommands/README.md

# Training a Vision Transformer on SpeechCommands

An example of training a Keyword Spotting Transformer[^1] on the Speech
Commands dataset[^2] with MLX. All supervised only configurations from the
paper are available.The example also illustrates how to use [MLX
Data](https://github.com/ml-explore/mlx-data) to load and process an audio
dataset.

## Pre-requisites

Follow the [installation
instructions](https://ml-explore.github.io/mlx-data/build/html/install.html)
for MLX Data.

Install the remaining python requirements:

```
pip install -r requirements.txt
```

## Running the example

Run the example with:

```
python main.py
```

By default the example runs on the GPU. To run it on the CPU, use:

```
python main.py --cpu
```

For all available options, run:

```
python main.py --help
```

## Results

After training with the `kwt1` architecture for 10 epochs, you
should see the following results:

```
Epoch: 9 | avg. Train loss 0.519 | avg. Train acc 0.857 | Throughput: 661.28 samples/sec
Epoch: 9 | Val acc 0.861 | Throughput: 2976.54 samples/sec
Testing best model from epoch 9
Test acc -> 0.841
```

For the `kwt2` model, you should see:
```
Epoch: 9 | avg. Train loss 0.374 | avg. Train acc 0.895 | Throughput: 395.26 samples/sec
Epoch: 9 | Val acc 0.879 | Throughput: 1542.44 samples/sec
Testing best model from epoch 9
Test acc -> 0.861
```

Note that this was run on an M1 Macbook Pro with 16GB RAM.

At the time of writing, `mlx` doesn't have built-in `cosine` learning rate
schedules, which is used along with the AdamW optimizer in the official
implementation. We intend to update this example once these features are added,
as well as with appropriate data augmentations.

[^1]: Based one the paper [Keyword Transformer: A Self-Attention Model for Keyword Spotting](https://www.isca-speech.org/archive/interspeech_2021/berg21_interspeech.html)
[^2]: We use version 0.02. See the [paper]((https://arxiv.org/abs/1804.03209) for more details.
Added Keyword Transformer + SpeechCommands 2023-12-17 06:30:33 +08:00			`# Training a Vision Transformer on SpeechCommands`

some updates / simplifications 2023-12-19 13:54:19 +08:00			`An example of training a Keyword Spotting Transformer[^1] on the Speech`
			`Commands dataset[^2] with MLX. All supervised only configurations from the`
			`paper are available.The example also illustrates how to use [MLX`
			`Data](https://github.com/ml-explore/mlx-data) to load and process an audio`
			`dataset.`
Added Keyword Transformer + SpeechCommands 2023-12-17 06:30:33 +08:00
			`## Pre-requisites`

some updates / simplifications 2023-12-19 13:54:19 +08:00			`Follow the [installation`
			`instructions](https://ml-explore.github.io/mlx-data/build/html/install.html)`
			`for MLX Data.`

			`Install the remaining python requirements:`
Added Keyword Transformer + SpeechCommands 2023-12-17 06:30:33 +08:00
			```
some updates / simplifications 2023-12-19 13:54:19 +08:00			`pip install -r requirements.txt`
Added Keyword Transformer + SpeechCommands 2023-12-17 06:30:33 +08:00			```

			`## Running the example`

			`Run the example with:`

			```
			`python main.py`
			```

some updates / simplifications 2023-12-19 13:54:19 +08:00			`By default the example runs on the GPU. To run it on the CPU, use:`
Added Keyword Transformer + SpeechCommands 2023-12-17 06:30:33 +08:00
			```
			`python main.py --cpu`
			```

			`For all available options, run:`

			```
			`python main.py --help`
			```

			`## Results`

fixed kwt skip connections 2023-12-20 05:41:22 +08:00			After training with the `kwt1` architecture for 10 epochs, you
Added Keyword Transformer + SpeechCommands 2023-12-17 06:30:33 +08:00			`should see the following results:`

			```
fixed kwt skip connections 2023-12-20 05:41:22 +08:00			`Epoch: 9 \| avg. Train loss 0.519 \| avg. Train acc 0.857 \| Throughput: 661.28 samples/sec`
			`Epoch: 9 \| Val acc 0.861 \| Throughput: 2976.54 samples/sec`
			`Testing best model from epoch 9`
			`Test acc -> 0.841`
Added Keyword Transformer + SpeechCommands 2023-12-17 06:30:33 +08:00			```

			For the `kwt2` model, you should see:
			```
fixed kwt skip connections 2023-12-20 05:41:22 +08:00			`Epoch: 9 \| avg. Train loss 0.374 \| avg. Train acc 0.895 \| Throughput: 395.26 samples/sec`
			`Epoch: 9 \| Val acc 0.879 \| Throughput: 1542.44 samples/sec`
			`Testing best model from epoch 9`
			`Test acc -> 0.861`
Added Keyword Transformer + SpeechCommands 2023-12-17 06:30:33 +08:00			```

			`Note that this was run on an M1 Macbook Pro with 16GB RAM.`

some updates / simplifications 2023-12-19 13:54:19 +08:00			At the time of writing, `mlx` doesn't have built-in `cosine` learning rate
			`schedules, which is used along with the AdamW optimizer in the official`
fixed kwt skip connections 2023-12-20 05:41:22 +08:00			`implementation. We intend to update this example once these features are added,`
some updates / simplifications 2023-12-19 13:54:19 +08:00			`as well as with appropriate data augmentations.`

			`[^1]: Based one the paper [Keyword Transformer: A Self-Attention Model for Keyword Spotting](https://www.isca-speech.org/archive/interspeech_2021/berg21_interspeech.html)`
			`[^2]: We use version 0.02. See the [paper]((https://arxiv.org/abs/1804.03209) for more details.`