Added Keyword Spotting Transformer + SpeechCommands example (#123)

* Added Keyword Transformer + SpeechCommands * minor fixes in README * some updates / simplifications * nits * fixed kwt skip connections * readme + format * updated acknowledgements --------- Co-authored-by: Awni Hannun <awni@apple.com>
2025-12-16 02:08:55 +08:00 · 2023-12-19 23:17:48 +01:00
parent ebbb7083cc
commit b6e62caf2e
5 changed files with 453 additions and 0 deletions
--- a/speechcommands/README.md
+++ b/speechcommands/README.md
@@ -0,0 +1,69 @@
+# Train a Keyword Spotting Transformer on Speech Commands
+
+An example of training a Keyword Spotting Transformer[^1] on the Speech
+Commands dataset[^2] with MLX. All supervised only configurations from the
+paper are available. The example also illustrates how to use [MLX
+Data](https://github.com/ml-explore/mlx-data) to load and process an audio
+dataset.
+
+## Pre-requisites
+
+Follow the [installation
+instructions](https://ml-explore.github.io/mlx-data/build/html/install.html)
+for MLX Data.
+
+Install the remaining python requirements:
+
+```
+pip install -r requirements.txt
+```
+
+## Running the example
+
+Run the example with:
+
+```
+python main.py
+```
+
+By default the example runs on the GPU. To run it on the CPU, use:
+
+```
+python main.py --cpu
+```
+
+For all available options, run:
+
+```
+python main.py --help
+```
+
+## Results
+
+After training with the `kwt1` architecture for 10 epochs, you
+should see the following results:
+
+```
+Epoch: 9 | avg. Train loss 0.519 | avg. Train acc 0.857 | Throughput: 661.28 samples/sec
+Epoch: 9 | Val acc 0.861 | Throughput: 2976.54 samples/sec
+Testing best model from epoch 9
+Test acc -> 0.841
+```
+
+For the `kwt2` model, you should see:
+```
+Epoch: 9 | avg. Train loss 0.374 | avg. Train acc 0.895 | Throughput: 395.26 samples/sec
+Epoch: 9 | Val acc 0.879 | Throughput: 1542.44 samples/sec
+Testing best model from epoch 9
+Test acc -> 0.861
+```
+
+Note that this was run on an M1 Macbook Pro with 16GB RAM.
+
+At the time of writing, `mlx` doesn't have built-in `cosine` learning rate
+schedules, which is used along with the AdamW optimizer in the official
+implementation. We intend to update this example once these features are added,
+as well as with appropriate data augmentations.
+
+[^1]: Based one the paper [Keyword Transformer: A Self-Attention Model for Keyword Spotting](https://www.isca-speech.org/archive/interspeech_2021/berg21_interspeech.html)
+[^2]: We use version 0.02. See the [paper]((https://arxiv.org/abs/1804.03209) for more details.