Commit Graph

2 Commits

Author SHA1 Message Date
Justine Tunney
918234ce80 Remove flush to zero from bf16
After closely analyzing Google Brain codebases, we decided that flushing
to zero was the wrong thing to do. Intel and AMD probably designed their
microprocessors to always flush to zero for the wrong reasons. It should
have been made conditional on FTZ being set in MXCSR like other opcodes.

See ggerganov/llama.cpp#7843
2024-07-03 05:39:16 -07:00
Justine Tunney
ede59bb742 Add BF16 support and fix warnings
This change updates the data type definitions to be the same as the
latest source code. Support for the bfloat16 data type is available
however it can't interpret the IQ quantization formats yet. Cleanup
of compiler warnings and other nits have been fixed, but behavioral
changes have been avoided, and no new features are as of yet added.
2024-05-25 22:58:50 -07:00