* Use async cuda malloc managed with cuda 13
* add pool threshold
* refactor for regular cuda malloc
* load eval gpu for cuda
* remove use of cuda pool, use cuda free async
* fix
* fix
* fix
* fix
* fix + comment
* Add quantize/dequantize slow path for mxfp8 and nvfp4
* fast cuda kernel for mx/nv quantization
* fallback for cuda < 12.8 (#2697)
* format (#2700)
* fix (#2701)
* metal kernels
* docs
* fix jit
* add default bits and group sizes
* improve quant docs
* fix output type of mxfp4 matmuls