* Use async cuda malloc managed with cuda 13
* add pool threshold
* refactor for regular cuda malloc
* load eval gpu for cuda
* remove use of cuda pool, use cuda free async
* fix
* fix
* fix
* fix
* fix + comment
* Make CudaEvent a CudaHandle
* Add caching for CudaEvent
* Make sure cuda events are destroyed at last
* Fix headers
* SharedEvent => AtomicEvent
* RawCudaEvent => CudaEventHandle, CudaEventWrapper => CopyableCudaEvent
* Remove unneeded asserts