* start to cleanup/unify accelerate and common back-ends
* more progress
* simplify
* add half type and allow infs in simd exp
* unify softmax + quantized, more dispatches to simd quantized mm
* add sin/cos, use simd in vector-scalar ops
* faster CPU vectorize quant
* faster erf/erfinv
* faster binary and cleaner copy
* use recursive template for other ops
* more cleanup
* fix from cleanup
* more clean
* fix binary
* use contiguous iterator
* add 3d
* nits
* fix
* fix?
* fix
* fix rebase
* refactor cpu preamble
* fix include order
* fix some issues'
* fixes for linux
* try to fix includes
* add back warning suppression
* more linux fixes
* buffer donation
* fix to move shared pointer
* format
* gpu in place for copy and binary
* revert ops test
* cpu in place
* a little cleanup
* remove useless bench