This commit is contained in:
Awni Hannun
2023-12-28 15:18:40 -08:00
parent ef773beab6
commit 253cc31815

View File

@@ -49,15 +49,16 @@ are accepted by the larger model. That's more likely to happen if the models
are trained on similar data.
One way to increase the chance of accepting a draft token is with the parameter
`--delta`. This parameter can be in the range `[0, 1]`. If it is `1` then all
the draft tokens will be accepted by the model. If it is `0`, then only draft
tokens which match the original acceptance criterion kept.[^1] Values closer to
`1` increase the chance that a draft token is accepted.
`--delta`. This parameter can be in the range $[0, 1]$. If it is $1$ then all
the draft tokens will be accepted by the model. If it is $0$, then only draft
tokens which match the original acceptance criterion are kept.[^1] Values
closer to $1$ increase the chance that a draft token is accepted.
Conversely, the fewer draft tokens accepted by the model, the more expensive
speculative decoding is. You can use `--draft` to tune the number of draft
tokens per model evaluation in order to reduce the number of discarded draft
tokens.
Conversely, the fewer draft tokens accepted by the main model, the more
expensive speculative decoding is. You can use `--num-draft` to tune the number
of draft tokens per model evaluation in order to reduce the number of discarded
draft tokens. Decreasing `--num-draft` will decrease the number of discarded
draft tokens at the expense of more large model evaluations.
[^1]: See the paper [Fast Inference from Transformers via Speculative
Decoding](https://arxiv.org/abs/2211.17192)