BUG FIX: Decoding results in garbled text when multiple tokens represent a single character (e.g., Chinese). (#398)

* Decoding results in garbled text when multiple tokens represent a single character (e.g., Chinese).

* Decoding results in garbled text when multiple tokens represent a single character (e.g., Chinese).
This commit is contained in:
ZHAOKAI WANG 2024-02-01 11:27:29 +08:00 committed by GitHub
parent 94358219cf
commit 0340113e02
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -292,8 +292,9 @@ def generate(model, prompt, tokenizer, args):
tokens.append(token.item())
s = tokenizer.decode(tokens)
print(s[skip:], end="", flush=True)
skip = len(s)
if len(s) - skip > 1:
print(s[skip:-1], end="", flush=True)
skip = len(s) - 1
print(tokenizer.decode(tokens)[skip:], flush=True)
print("=" * 10)
if len(tokens) == 0: