mirror of
https://github.com/ml-explore/mlx-examples.git
synced 2025-06-24 17:31:18 +08:00
![]() * Pad mask with zeros for non-square attention matrices The current implementation of the mask assumes the attention matrix is square, which is true if there is no cache. However, if one wishes to produce multiple tokens at a time, such as in speculative decoding implementations, a rectangular mask is necessary. This change pads the bottom of the mask with zeros so multi-token decoding with a cache works correctly. * Directly create mask instead of padding * Update llama.py |
||
---|---|---|
.. | ||
examples | ||
models | ||
tuner | ||
__init__.py | ||
convert.py | ||
fuse.py | ||
generate.py | ||
gguf.py | ||
LORA.md | ||
lora.py | ||
MANAGE.md | ||
manage.py | ||
MERGE.md | ||
merge.py | ||
py.typed | ||
README.md | ||
requirements.txt | ||
sample_utils.py | ||
SERVER.md | ||
server.py | ||
tokenizer_utils.py | ||
UPLOAD.md | ||
utils.py | ||
version.py |
Generate Text with MLX and 🤗 Hugging Face
This an example of large language model text generation that can pull models from the Hugging Face Hub.
For more information on this example, see the README in the parent directory.
This package also supports fine tuning with LoRA or QLoRA. For more information see the LoRA documentation.