At the time of writing, `mlx` doesn't have built-in `cosine` learning rate
schedules, which is used along with the AdamW optimizer in the official
implementaiton. We intend to update this example once these features are added,
as well as with appropriate data augmentations.
[^1]: Based one the paper [Keyword Transformer: A Self-Attention Model for Keyword Spotting](https://www.isca-speech.org/archive/interspeech_2021/berg21_interspeech.html)
[^2]: We use version 0.02. See the [paper]((https://arxiv.org/abs/1804.03209) for more details.