mlx-examples/llms/issue.txt
2024-10-09 15:13:12 -04:00

22 lines
461 B
Plaintext

## Steps to reproduce
Run the following with and without `prefill_step_size=2` commented out:
```py
import mlx_lm
model, tokenizer = mlx_lm.load('/Users/llwu/models/mlx/Meta-Llama-3.1-8B-4bit')
mlx_lm.generate(
model,
tokenizer,
prompt="69 + 420= ",
verbose=True,
max_tokens=10,
max_kv_size=5,
prefill_step_size=2,
)
```
The output is different. I notice that the RotatingKVCache has length 5 with prefill and length 7 without.