* redesign for faster cpu/gpu synch
* load + more async CPU
* use command encoder API and move more ops to use it
* make fence back-end generic + CPU only fence
* faster build
* fix async eval
* fixes + handle temporaries
* fix / improve cpu conv
* remove unused status, fix siblings
* fix extensions
* fix
* fix no cpu build
* format
* comments
* fix perf regression, remove unecessary abort
* fix events, task limit cpu
* fix waiting
* fix donation / temporaries in normalization
* export and import functions
* refactor + works for few primitives
* nit
* allow primitives with state
* nit
* nit
* simplify serialize / deserialize
* fix for constants
* python bindings
* maybe fix serialize failure case
* add example
* more primitives, training kind of works
* same result for python and c++
* some fixes
* fix export
* template it up
* some simplificatoin
* rebase
* allow kwargs and multiple functions
* exporter
* more primitives for exporting
* deal with endianness
* handle invalid stream
* add docstring
* add synchronize function
* fix linux
* fix linux
* fix and fix docs
* fix test
* try synchronize in stream destroy
* synchronize works for both cpu and gpu
* move all ObjC (via metal-cpp) interaction until post static initializers
- metal-cpp relies on static initializers to cache class and selector pointers
- code in mlx was using metal-cpp to set up NSAutoreleasePools during its own static init time
- but this code was silently failing as the class and selector pointers from metal-cpp were still nil
- defer the creation of NSAutoreleasePools until after static init time
- ensure that we have coverage where autorelease pools are needed
* Update device.cpp
remove commented code
* Update device.cpp
remove commented out code
* Update scheduler.h
update comment
* per discussion use the pool inside the task() -- this will be metal only, not needed for cpu
* Update allocator.cpp
move pool to release/alloc area