gguf-tools

zhangyiss/gguf-tools

Fork 0

mirror of https://github.com/antirez/gguf-tools.git synced 2025-09-17 02:28:07 +08:00

Commit Graph

Author	SHA1	Message	Date
Justine Tunney	918234ce80	Remove flush to zero from bf16 After closely analyzing Google Brain codebases, we decided that flushing to zero was the wrong thing to do. Intel and AMD probably designed their microprocessors to always flush to zero for the wrong reasons. It should have been made conditional on FTZ being set in MXCSR like other opcodes. See ggerganov/llama.cpp#7843	2024-07-03 05:39:16 -07:00
Justine Tunney	ede59bb742	Add BF16 support and fix warnings This change updates the data type definitions to be the same as the latest source code. Support for the bfloat16 data type is available however it can't interpret the IQ quantization formats yet. Cleanup of compiler warnings and other nits have been fixed, but behavioral changes have been avoided, and no new features are as of yet added.	2024-05-25 22:58:50 -07:00

Author

SHA1

Message

Date

Justine Tunney

918234ce80

Remove flush to zero from bf16

After closely analyzing Google Brain codebases, we decided that flushing
to zero was the wrong thing to do. Intel and AMD probably designed their
microprocessors to always flush to zero for the wrong reasons. It should
have been made conditional on FTZ being set in MXCSR like other opcodes.

See ggerganov/llama.cpp#7843

2024-07-03 05:39:16 -07:00

Justine Tunney

ede59bb742

Add BF16 support and fix warnings

This change updates the data type definitions to be the same as the
latest source code. Support for the bfloat16 data type is available
however it can't interpret the IQ quantization formats yet. Cleanup
of compiler warnings and other nits have been fixed, but behavioral
changes have been avoided, and no new features are as of yet added.

2024-05-25 22:58:50 -07:00

2 Commits