* Pad mask with zeros for non-square attention matrices
The current implementation of the mask assumes the attention matrix is square, which is true if there is no cache. However, if one wishes to produce multiple tokens at a time, such as in speculative decoding implementations, a rectangular mask is necessary.
This change pads the bottom of the mask with zeros so multi-token decoding with a cache works correctly.
* Directly create mask instead of padding
* Update llama.py
* use nn.RMSNorm, use sdpa, cleanup
* bump mlx versions
* minor update
* use fast layer norm
* version bump
* update requirement for whisper
* update requirement for gguf
* lazy model import in mlx_lm
* change lora loading
* fix olmo lora
* remove a bunch of unused stuff from plamo
* move phixtral to mlx-lm and out of llms/