Make attention faster for a some models (#574)

* make attention faster for a couple models * remove unused generation flags * add comment on lora * include text files as well
2025-12-16 02:08:55 +08:00 · 2024-03-14 21:35:54 -07:00
parent 3f3741d229
commit e4b19bb9e1
6 changed files with 35 additions and 56 deletions
--- a/llms/mlx_lm/LORA.md
+++ b/llms/mlx_lm/LORA.md
@@ -167,6 +167,12 @@ of memory. Here are some tips to reduce memory use should you need to do so:
   you can do is break your examples into smaller
   sequences when making the `{train, valid, test}.jsonl` files.

+5. Gradient checkpointing lets you trade-off memory use (less) for computation
+   (more) by recomputing instead of storing intermediate values needed by the
+   backward pass. You can use gradient checkpointing by passing the
+   `--grad-checkpoint` flag. Gradient checkpointing will be more helpful for
+   larger batch sizes or sequence lengths with smaller or quantized models.
+
 For example, for a machine with 32 GB the following should run reasonably fast:

 ```