Fast RMS Norm (#862)

* fast rmsnorm

* no rms gpu

* kernel

* fix shared mem

* looped rms and donation in softmax

* Make the squaring in float32 to avoid underflow

* Fix the default StreamOrDevice for rope and rms_norm in fast

* nits

---------

Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
This commit is contained in:
Awni Hannun
2024-03-21 07:20:54 -07:00
committed by GitHub
parent 4650d94d98
commit a54f06b16f
17 changed files with 493 additions and 41 deletions

View File

@@ -23,6 +23,7 @@ set(
"gemv"
"quantized"
"random"
"rms_norm"
"rope"
"scan"
"scaled_dot_product_attention"