Awni Hannun
1c0c118f7c
Fp64 on the CPU ( #1843 )
...
* add fp64 data type
* clean build
* update docs
* fix bug
2025-02-07 15:52:22 -08:00
Jagrit Digani
b6c6552d20
Add missing #pragma once ( #1838 )
2025-02-06 11:11:22 -08:00
Awni Hannun
af1b725fda
Fix a couple of slicing bugs ( #1827 )
...
* fix a few bugs
* fix conv grad
* speedup test
* comment
2025-02-05 19:50:08 -08:00
Awni Hannun
9174606d4c
fix sort ( #1835 )
2025-02-05 17:16:27 -08:00
Awni Hannun
fe5987b81d
faster sort ( #1831 )
2025-02-05 06:10:22 -08:00
Awni Hannun
a229c8cef0
don't duplicate malloc with custom kernel init ( #1830 )
2025-02-04 13:20:57 -08:00
Awni Hannun
1156c84e86
Refactor common into cpu specific and truly common ( #1817 )
...
* refactor
* fix extension example
* fix no-cpu
2025-02-03 15:58:02 -08:00
Jesper Stemann Andersen
2d8e667400
MinGW support ( #1806 )
...
* Changed /bin/bash to bash for generating compiling preamble
* Fix wrt jit_compiler mingw like msvc wrt. WEXITSTATUS
* Solved ambiguity wrt. bernoulli test shape
* Disabled distributed/ring on Windows
* Fixed jit_compiler command wrt. MinGW
* Extended jit_compiler patch wrt. WEXITSTATUS to FreeBSD
2025-02-01 12:40:06 -08:00
Angelos Katharopoulos
f5cc1eea72
Allow different value dimensions in sdpa_vector ( #1811 )
2025-01-31 20:58:59 -08:00
Awni Hannun
b7c9f1d38f
scatter axis + gather axis primitives ( #1813 )
...
* scatter axis + gather axis primitives
* add transforms
* comment
2025-01-31 20:48:08 -08:00
Angelos Katharopoulos
1f4c127fb9
Move some kernels to get_template_definition
( #1782 )
2025-01-21 08:59:44 -08:00
Awni Hannun
a4667da1eb
Faster synchronization Fence
primitive ( #1773 )
...
* try faster synchronization
move event
fixes
update bench
fix
fix
* non-functioning kernel
* try alternative fence
* cleanup barrier
* get rid of event_fence
* update benchmarks
* doc string in metal fence
2025-01-17 18:42:19 -08:00
Awni Hannun
f288db8d34
Fix synchronization bug for in stream async works ( #1768 )
2025-01-15 06:07:34 -08:00
Awni Hannun
252e423e81
fix and cleanup event signal/wait for metal ( #1765 )
2025-01-10 18:37:26 -08:00
Alex Barron
c7b0300af5
Fix batched qmv bug ( #1758 )
2025-01-09 11:45:57 -08:00
Awni Hannun
1ccaf80575
Dynamic broadcasting for shapeless compile/export ( #1722 )
...
* working towards dynamic broadcast
* shapeless broadcast
* fix build + nits
* use broadcast arrays in quantize matmul
* some cleanup / consistency
* mend
* some comments
* add vjp, jvp for broadcast axes
2025-01-09 11:04:24 -08:00
Cheng
b8f76f717a
Print exceptions in eval_cpu/eval_gpu and abort ( #1754 )
2025-01-08 06:31:09 -08:00
Awni Hannun
d1766f2c70
Add boolean mask support in vector SDPA ( #1757 )
2025-01-07 20:24:53 -08:00
Awni Hannun
516ded618b
Dynamic slicing ( #1741 )
...
* dynamic slice and slice update
* python bindings + tests + fix set item
* fix compile issue
* comment
* fix jit
2025-01-07 14:02:16 -08:00
Awni Hannun
058d6ce683
mpi send use input as output ( #1750 )
...
* mpi send use input as output
* move earlier
2025-01-06 06:08:43 -08:00
Awni Hannun
259025100e
Fix nd ternary on GPU ( #1746 )
2025-01-03 11:52:17 -08:00
Awni Hannun
6fa0501387
Fix concatenate/slice_update vjp + reduce binary size ( #1735 )
...
* fix concatenate vjp + reduce binary size
* also cast in slice update
2025-01-02 16:36:33 -08:00
Valentin Roussellet
88f993da38
Explicit parentheses around some logical operators ( #1732 )
...
* fix some warnings
* format
2024-12-24 07:02:20 -08:00
Awni Hannun
ebfe64b92d
shapeless slice update and broadcast when possible ( #1727 )
2024-12-23 11:25:15 -08:00
Awni Hannun
0308e9af71
Allow offset to be an mx.array for mx.fast.rope
( #1724 )
...
* allow offset for rope
* comment
2024-12-19 15:51:44 -08:00
Awni Hannun
e03f0372b1
More shape type ( #1705 )
...
* more shape type
* fix
2024-12-19 08:08:20 -08:00
Awni Hannun
7480059306
track resource limit and throw if exceeded ( #1718 )
2024-12-18 18:45:58 -08:00
Awni Hannun
9111999af3
Fix small sort with metal validation ( #1695 )
2024-12-12 09:21:45 -08:00
Awni Hannun
6bd28d246e
Allow no copy negative strides in as_strided and slice ( #1688 )
...
* allow no copy negative strides in as_strided and slice
* fix jit
* fix jit
2024-12-12 08:59:45 -08:00
Awni Hannun
4e1e9520e1
Flatten and unflatten ( #1692 )
...
* flatten and unflatten
* fix grad
* fix shape infer
* use squeeze + unsqueeze in get_item
2024-12-11 21:51:37 -08:00
Awni Hannun
f76a49e555
ExpandDims
primitive (#1687 )
...
* add squeeze primitive
* simplify squeeze, use in gather
* fix
* fix
* fix
* fix
* fix no cpu
* use squeeze in matmul and friends
* expand dims primitive
* comment
2024-12-10 16:39:07 -08:00
Awni Hannun
40c62c1321
Use int64 stride everywhere ( #1671 )
...
* use int64 stride everywhere
* fix ext
* fix ext
* more shape + cleanup
* one more
* few more
2024-12-09 11:09:02 -08:00
Alex Barron
95c4a2e3af
add back conditionaltype ( #1655 )
2024-12-06 11:12:01 -08:00
Jagrit Digani
9d40e521d7
Stop matrix copies with new attention kernel ( #1639 )
2024-12-02 14:12:38 -08:00
Jesper Stemann Andersen
e4eeb4e910
Added missing unordered_map includes ( #1635 )
...
* Added missing includes in mlx/io.h and mlx/backend/metal/metal.h
* Added additional missing unordered_map includes that fixes build on FreeBSD
2024-12-02 07:03:03 -08:00
Ikko Eltociear Ashimine
9bc2183a31
docs: update device.cpp ( #1632 )
...
unecessary -> unnecessary
2024-11-27 20:58:26 -08:00
Awni Hannun
d4b222b6d3
Fix some leaks and races ( #1629 )
...
* fix leak and fix potential race
* more leak fixes
* fix one more
2024-11-27 20:01:20 -08:00
Awni Hannun
211411faf2
fix large ops ( #1620 )
2024-11-24 09:17:10 -08:00
Alex Barron
6f7986d592
Cleaner qmv
/qvm
( #1616 )
2024-11-22 11:14:08 -08:00
Jagrit Digani
02bec0bb6d
Matrix Attention kernel ( #1610 )
...
* Rough INIT
* [WIP]: Loading and Matmuls added
* [WIP]: Reductions and min working aligned kernel at headdim = 64
* [WIP] Added headdim 80 for testing
* [WIP] Update dispatch params for testing
* [WIP] Add support for unaligned seq lengths - still looks messy
* Update sdpa_benchmarks
* Update sdpa_benchmarks
* Update sdpa_benchmarks
* Enable gqa support
* Update benchmark and switch off 128 headdim
* Update headdim 128 tuning
* Remove older fast attention code. Write out O strided
* Disable hd=128 until further optimizations
* Enable bf16
* Fix data size bug
* Enable attn build outside of jit
2024-11-22 10:34:05 -08:00
Alex Barron
c79f6a4a8c
3 and 6 bit quantization ( #1613 )
...
* Support 3 and 6 bit quantization
2024-11-22 10:22:13 -08:00
Awni Hannun
0c5eea226b
Reduce specializations ( #1607 )
...
* start of reduce specializations
* fix all reduce
* fix many dims
* fix
* non-jit tests clear
* cleanup instantiations
* cpu merges
* change dim specializations
* optimize
* fix jit
* fix jit
* use higher precision for integer sum+prod
* fixes
2024-11-21 19:53:00 -08:00
Awni Hannun
dcca0d7477
contiguous op / prim ( #1612 )
2024-11-21 19:51:49 -08:00
Awni Hannun
61d787726a
Fix view scalar bug segfault ( #1603 )
...
* fix view scalar bug
* fix view scalar bug
* one more fix
2024-11-19 10:54:05 -08:00
Awni Hannun
2419edd5b2
Faster indexing math in a few kernels ( #1589 )
...
* wip: faster compiled kernels
* faster general unary with uint specialization
* index type in compiled, unary, binary, ternary, copy
* fix jit
* jit fix
* specialize gather + scatter
* nit in docs
2024-11-18 19:52:00 -08:00
Awni Hannun
9d7fa6b8e6
Use osx deployment target to pick Metal version ( #1595 )
...
* choose metal based on deployment target rather than system version
* nit
* unused compile def
2024-11-18 19:16:49 -08:00
Angelos Katharopoulos
073076ac7d
2-Pass Sdpa Inference Kernel ( #1597 )
2024-11-18 17:31:53 -08:00
Awni Hannun
9bd03dd9b4
More buffer donation with no-ops ( #1591 )
...
* more donation
* fix test
* fix build
2024-11-18 08:35:41 -08:00
Awni Hannun
6931f84412
fix dispatch threads for a few kernels ( #1594 )
2024-11-18 08:35:25 -08:00
Awni Hannun
610af352d4
Dispatch bf16 at run time when using the JIT ( #1584 )
...
* Dispatch bf16 at run time when using the JIT
* fix extension
* fix extension build
* fix extension build
* Update utils.h
2024-11-15 16:54:36 -08:00
Awni Hannun
b35f1e3c9c
fix donation in sdpa ( #1587 )
2024-11-13 17:21:13 -08:00
Alex Barron
a4c47b0276
OOB QMV fix ( #1579 )
...
* fix oob access in qmv
* skip more
* fix small case
2024-11-08 17:59:45 -08:00
Alex Barron
111fefd5e9
Fix OOB access in qmv ( #1577 )
...
* fix oob access in qmv
* skip more
2024-11-08 15:41:30 -08:00
Awni Hannun
c1fe1ef081
Bfs width limit ( #1568 )
...
* width limit
* fix
* large limit
* put env vars in env namespace
2024-11-08 15:00:46 -08:00
Awni Hannun
9f0d5c12fc
Fully wrap the command encoder ( #1572 )
...
* fully wrap the command encoder
* use consistent style + fix extensions
2024-11-08 11:50:21 -08:00
Awni Hannun
9a3842a2d9
fix ( #1566 )
2024-11-06 17:10:33 -08:00
Alex Barron
26be608470
Add split_k qvm
for long context ( #1564 )
...
* Add splitk qvm
* configurable splitk
* tuning
* remove extra instantiation
* remove refactor
* separate test
* cpu tolerance
2024-11-05 11:25:19 -08:00
Angelos Katharopoulos
248431eb3c
Reductions update ( #1351 )
2024-11-04 22:25:16 -08:00
Awni Hannun
f1951d6cce
Use fewer barriers ( #1561 )
...
* use fewer barriers
* comment
2024-11-04 10:26:49 -08:00
Angelos Katharopoulos
62f297b51d
Sdpa fix ( #1558 )
2024-11-02 21:25:46 -07:00
Awni Hannun
4f72c66911
improvements to scatter / gather ( #1541 )
2024-10-30 19:30:54 -07:00
Jagrit Digani
960e3f0f05
Gemm update ( #1518 )
2024-10-30 19:30:28 -07:00
Awni Hannun
884af42da2
Fix thread group for large arrays ( #1543 )
...
* fix thread group for large arrays
* comment
* one more
2024-10-30 16:25:12 -07:00
Carlo Cabrera
1a992e31e8
Skip using Residency sets in VMs ( #1537 )
...
* Skip using Residency sets in VMs
Attempting to use residency sets in a VM throws[^1]
libc++abi: terminating due to uncaught exception of type std::runtime_error: [metal::Device] Unable to construct residency set.
Not quite sure if this is the best fix, but it does make the error go
away.
Note that it was previously possible to run simple programs that used
mlx in a VM prior to 0eb56d5be0
. See
related discussion at Homebrew/homebrew-core#195627 .
[^1]: https://github.com/Homebrew/homebrew-core/actions/runs/11525831492/job/32105148462#step:3:56
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* change residency check
---------
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
Co-authored-by: Awni Hannun <awni@apple.com>
2024-10-29 19:37:23 -07:00
Awni Hannun
015c247393
change wino dispatch conditoin ( #1534 )
2024-10-28 11:13:44 -07:00
Awni Hannun
d3cd26820e
Faster bits and bernoulli ( #1535 )
...
* faster bits and bernoulli
* fix bernoulli
2024-10-28 11:11:00 -07:00
Awni Hannun
0eb56d5be0
Wired ( #1510 )
...
* expose residency sets as wire/unwire
* returns wired size
* fix
* runtime support check
* fix os check
* fix test
* fix no metal build
* docs
* nit
* nits in docs
* nits
2024-10-25 09:35:33 -07:00
Awni Hannun
dad1b00b13
fix ( #1523 )
2024-10-24 19:17:46 -07:00
Angelos Katharopoulos
c9b41d460f
Working 64-bit scans ( #1506 )
2024-10-24 11:05:46 -07:00
xnorai
32972a5924
C++20 compatibility for fmt ( #1519 )
...
* C++20 compatibility for fmt
* Address review feedback
* Remove stray string
* Add newlines back
2024-10-24 08:54:51 -07:00
Dhruv Govil
f6afb9c09b
Remove use of vector<const T> ( #1514 )
2024-10-22 16:31:52 -07:00
Kashif Rasul
3ddc07e936
Eigenvalues and eigenvectors ( #1334 )
...
* initial eigvalsh
* add compute_vectors
* add compute_vectors_
* return a pair
* add eigh to return only eigenvectors
* fixed typo
* merge merge Eighvalsh and Eigh into a single primitive
* use the same primate with the flag
* fix primatives
* use MULTI
* fix eval_gpu
* fix decleration
* rename EighPrimitive to Eigh
* tests
* tests
* fix rebase and format
* cleanup lapack
* format
* add cblas.h
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2024-10-22 12:18:48 -07:00
Awni Hannun
c26208f67d
Remove Hazard tracking with Fences ( #1509 )
...
* remove hazard tracking
* with fence map
* no hazard tracking with fences
* nits
* fix fence retain
* cleanup
* fix quantized rebase
2024-10-21 19:33:32 -07:00
Alex Barron
d15fa13daf
Batched Quantized Matmul + Fast Small QMV ( #1503 )
...
* add fast qmv for small dims
* fix test
* batched cpu
* add batched template param
* refactor metal quantized.cpp
2024-10-21 16:23:17 -07:00
Awni Hannun
92d7cb71f8
Fix compile ( #1501 )
...
* fix compile
* fix space
2024-10-18 11:06:40 -07:00
Angelos Katharopoulos
50d8bed468
Fused attention for single query ( #1497 )
2024-10-18 00:58:52 -07:00
Awni Hannun
3f86399922
Real and Imag ( #1490 )
...
* real and imag
* fix
* fix
2024-10-15 16:23:15 -07:00
Awni Hannun
881615b072
Faster metal compiled kernels + some fixes ( #1486 )
...
* bump mac tests to use py39
* work per thread for compiled kernels
* fixe for large arrays
* fix
2024-10-14 12:45:38 -07:00
Awni Hannun
bf6ec92216
Make the GPU device more thread safe ( #1478 )
...
* gpu stream safety
* comment
* fix
2024-10-12 17:49:15 -07:00
Awni Hannun
1fa0d20a30
consistently handle all -inf in softmax ( #1470 )
2024-10-08 09:54:02 -07:00
Awni Hannun
3274c6a087
Fix array is_available race cases ( #1468 )
2024-10-07 19:13:50 -07:00
Awni Hannun
95d04805b3
Fix complex power on Metal ( #1460 )
2024-10-06 19:58:30 -07:00
Awni Hannun
e4534dac17
Conv grad with groups + bugfix ( #1449 )
...
* fix bug in flipped conv with groups, start of grad for groups
* fix
* fix
* fix + test
2024-10-06 07:08:53 -07:00
Angelos Katharopoulos
d878015228
Fix normalization check_input ( #1452 )
2024-10-03 13:26:56 -07:00
Angelos Katharopoulos
bacced53d3
Fix row reduce with very few rows ( #1447 )
2024-09-29 20:00:35 -07:00
Awni Hannun
11354d5bff
Avoid io timeout for large arrays ( #1442 )
2024-09-27 13:32:14 -07:00
Awni Hannun
5b6f38df2b
Faster cpu ops ( #1434 )
...
* faster binary and cleaner copy
* use recursive template for other ops
* more cleanup
* fix from cleanup
* more clean
* fix binary
* use contiguous iterator
* add 3d
* nits
* fix
* fix?
* fix
* fix rebase
2024-09-26 09:19:13 -07:00
Awni Hannun
0b4a58699e
Some overhead reductions in mx.fast.metal_kernel ( #1437 )
...
* some overhead reductions
* fix
* use +=
* use more +=
2024-09-25 17:25:21 -07:00
Awni Hannun
4f9f9ebb6f
Faster Metal unary and binary for general case ( #1431 )
...
* faster unary and binary for general case
* update ternary + jit fix
* fix jit
* unary work per thread
2024-09-25 12:07:43 -07:00
Awni Hannun
67b6bf530d
Optimization for general ND copies ( #1421 )
2024-09-17 17:59:51 -07:00
Awni Hannun
4f46e9c997
More fixes for arrays with large sizes ( #1405 )
...
* compile works for big arrays when contiguous
* style
* nits in docs
* a bunch more stuff
* update jit
* update jit
* use constant for shapes and strides and remove elem_to_loc overload
* use kernel instantiation
* docs nits
* update binary and ternary
* comments
2024-09-17 12:46:31 -07:00
Nripesh Niketan
669c27140d
Chore: add pre-commit hook for cmake ( #1362 )
...
* reset and lint
* format
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2024-09-16 12:53:01 -07:00
Awni Hannun
b3f52c9fbe
ensure io/comm streams are active before eval ( #1412 )
2024-09-14 06:17:36 -07:00
Angelos Katharopoulos
881f09b2e2
Allow querying the allocator for the buffer size ( #1404 )
2024-09-11 21:02:16 -07:00
Awni Hannun
02efb310ca
Xcode 160 ( #1384 )
...
* xcode 16.0 with debug tests
* limit nproc for builds
* vmap bug
* assert bug
* run python tests in debug mode
* fix view, bool copies preserve bits'
* actual view fix
2024-09-10 15:15:17 -07:00
Awni Hannun
e7e59c6f05
Fix copying scalars by adding fill_gpu ( #1402 )
...
* fix copying scalars by adding fill_gpu
* Another copy scalar changed to fill
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
2024-09-09 15:54:08 -07:00
Max-Heinrich Laves
efeb9c0f02
Transposed Convolution ( #1245 )
...
* initial implementation for conv_transpose
ran pre-commit
implemented conv_transpose
updated conv_general docstring
updated conv_general docstring
updated code comments
removed commented run_conv_checks
updated acknowledgments
added missing entry to ops.rst
added op to nn.layers
resolved merge conflicts
* removed ConvolutionTranspose primitive as suggested by reviewer
removed ConvolutionTranspose primitive as suggested by reviewer
* remove transpose flag, add another test
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2024-09-06 19:52:38 -07:00
Awni Hannun
7cca1727af
Fix slice data size ( #1394 )
...
* fix slice data size and add tests
* fix contiguous flag
* simplify stride and perform copy for non-contiguous arrays
* fix cpu
* comment
2024-09-04 19:10:43 -07:00
Awni Hannun
41c603d48a
fix jit reduce ( #1395 )
2024-09-04 14:03:10 -07:00
Angelos Katharopoulos
58dca7d846
Fix copy in the sort primitive ( #1383 )
2024-08-31 08:32:14 -07:00
Alex Barron
da691257ec
Fix overflow in quantize/dequantize ( #1379 )
...
* add 2d indices to prevent overflow
* use nthreads not out size
2024-08-30 13:32:41 -07:00
Awni Hannun
dba2bd1105
Even Even Faster IO ( #1374 )
...
* even more faster io
* make reader pool static
* make python reader thread safe
* one more optimization
2024-08-29 16:05:40 -07:00
Alex Barron
28be4de7c2
Fix JIT reductions ( #1373 )
2024-08-28 16:39:11 -07:00
Awni Hannun
a6c3b38fba
Async load ( #1372 )
...
* async load
* async load
2024-08-28 14:21:55 -07:00
Angelos Katharopoulos
cdb59faea6
Adds send/recv ops in distributed ( #1366 )
2024-08-26 23:01:37 -07:00
Awni Hannun
5f7d19d1f5
MPI ops in GPU stream for faster comms ( #1356 )
2024-08-26 15:12:50 -07:00
Awni Hannun
2fdf9eb535
Fix ternary for large arrays ( #1359 )
...
* fix ternary for large arrays
* fix
2024-08-26 11:22:27 -07:00
Awni Hannun
860d3a50d7
fix extension metal library finding ( #1361 )
2024-08-26 09:18:50 -07:00
Angelos Katharopoulos
8081df79be
Fix boolean all reduce bug ( #1355 )
2024-08-24 10:09:32 -07:00
Nripesh Niketan
64bec4fad7
Chore: update pre-commit hooks ( #1353 )
...
* Chore: update pre-commit refs
* run pre-commit
2024-08-24 06:46:36 -07:00
Alex Barron
b96e105244
Add grid_sample
example to metal_kernel
docs ( #1352 )
...
* Add `zero_outputs` and `atomic_outputs` options to `metal_kernel`
* add grid sample to docs
* zero_outputs -> init_value
* add missing header for linux
2024-08-23 18:24:16 -07:00
Angelos Katharopoulos
b57a52813b
Further reduction tuning ( #1349 )
...
* More reduction tuning
* Forgotten pdb
* Small column long row specialization
2024-08-23 10:35:25 -07:00
Awni Hannun
98b6ce3460
Refactor reductions and fix scatter atomics for large sizes ( #1300 )
...
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
2024-08-22 16:03:31 -07:00
Alex Barron
0fd2a1f4b0
Custom Metal Kernels from Python ( #1325 )
...
* start
* simple kernels working
* restructure
* inverse example working
* docs + fixes
* missing file
* fix imports
* address comments
* add docs + fix test
* Review comments + refactor to a single function
* update docs
* remove hashing
* fix contig bug in test
* back to a class
* trailing whitespace
* fix tests
* match c++ and python apis
* add link + make args kw_only
2024-08-22 13:46:29 -07:00
Awni Hannun
df3233454d
2d gather specialization ( #1339 )
2024-08-22 10:48:24 -07:00
Awni Hannun
d40e76809f
Fix rope ( #1340 )
...
* add test
* fix rope
* fix test
2024-08-20 17:37:52 -07:00
Awni Hannun
bb1b76d9dc
RoPE with frequencies as optional input ( #1337 )
...
* start rope with freq input
* rope with frequencies
* nits
* fix bug
* fix bug + test
* cleanup
* optional base
2024-08-19 18:30:50 -07:00
Angelos Katharopoulos
9d26441224
Fix contiguity check ( #1336 )
...
Co-authored-by: Alex Barron <abarron22@apple.com>
2024-08-19 16:05:06 -07:00
Awni Hannun
f12f24a77c
fix compiling with space in paths ( #1332 )
2024-08-15 16:39:24 -07:00
Alex Barron
99bb7d3a58
GPU mx.sign for complex64 ( #1326 )
2024-08-14 07:54:53 -07:00
Awni Hannun
30bbea2f08
Add gemv masked to JIT plus some fixes ( #1310 )
...
* add gemv masked to JIT plus some fixes
* some cleanup
* add utils
* fix
* fix 2
* more cleaning
* fix
* remove unused mps matmul support
* one more nit
* revert
2024-08-07 13:38:07 -07:00
Awni Hannun
58d0e199e1
add bfloat conv for windograd ( #1306 )
...
* add bfloat conv for windograd
* accumulate in fp32
* accumulate in fp32
* accumulate in bf16
2024-08-05 15:51:13 -07:00
Awni Hannun
43ffdab172
fix rope and random ( #1301 )
...
* fix rope and random
* comment
2024-07-31 16:18:25 -07:00
Awni Hannun
40b6d67333
Fixes for large arrays with a few ops ( #1299 )
...
* fixes for large arrays with a few ops
* fix bug
* fix all of copy
2024-07-30 17:18:39 -07:00
Alex Barron
c52d1600f0
Fused Affine Quantize/Dequantize ops ( #1282 )
...
* Add fast affine dequantize
* add full quantize kernel
* fused kernel with scale/bias computation
* fix docstring
* fix no jit error
* fix test
* test fix
* reduce fast api to only affine_quantize
2024-07-29 15:11:38 -07:00
Jagrit Digani
7f914365fd
Fix GPU sort for large arrays ( #1285 )
...
* Fix GPU sort for large arrays
2024-07-24 14:37:10 -07:00
Alex Barron
c34a5ae7f7
Fix bfloat16 Hadamard ( #1283 )
...
* fix bfloat16 hadamard
* add scale
* review comments
---------
Co-authored-by: Alex Barron <abarron22@apple.com>
2024-07-23 14:54:43 -07:00
Awni Hannun
e2aa6ec8ae
some fixes ( #1281 )
2024-07-23 11:49:05 -07:00
Tim Gymnich
6307d166eb
Fix overflow / underflow handling for expm1f ( #1278 )
...
* Fix overflow / underflow handling for expm1f
* update tests
2024-07-23 07:29:06 -07:00
Awni Hannun
1fba87b0df
Fix leak with multi-output primitives ( #1274 )
...
* fix leak with multi-output primitives
* hopefully an actual fix
2024-07-23 06:34:18 -07:00
Cheng
2f83d6e4b7
Do not release buffers on exit ( #1142 )
2024-07-15 15:12:24 -07:00
Angelos Katharopoulos
5c1fa64fb0
Custom transforms ( #1246 )
2024-07-10 18:00:01 -07:00
Alex Barron
a3c287354f
Fast Hadamard Transform ( #1249 )
...
* Working hadamard for powers of 2
* working for m*2^k
* add scale and check contiguity
* add size check
* clean up
* fix test
* add grads + vmap
* gpu only
* skip on linux
* test typo
* add cpu impl
* remove gpu only tests
* fix linux build + add is_equivalent
2024-07-09 20:39:01 -07:00
Angelos Katharopoulos
03cf033f82
Fix reshape copy bug ( #1253 )
2024-07-07 21:37:00 -07:00
Alex Barron
2615660e62
Fix strided sort bug ( #1236 )
...
* Use output strides in sort kernel
* fix zero strides bug
2024-06-26 14:32:11 -07:00
Awni Hannun
56c8a33439
Get metal version from xcode ( #1228 )
...
* get metal version from xcode
* typo
* fix
2024-06-26 07:02:11 -07:00
Jagrit Digani
2d6cd47713
Masked gemv ( #1211 )
2024-06-14 09:52:26 -07:00
Awni Hannun
31e134be35
Build for macOS 15 ( #1208 )
...
* Build for macos 15
* metal32 as well
* comment
---------
Co-authored-by: Awni Hannun <Awni Hannun>
2024-06-13 13:31:44 -07:00
Fangjun Kuang
f20e97b092
minor fixes ( #1194 )
...
* minor fixes
* fix build errors
2024-06-12 22:06:49 -07:00
Alex Barron
934683088e
Refactor JIT for unary/binary/ternary ops ( #1206 )
...
* refactor unary/binary/ternary ops
* get_primitive_string util
---------
2024-06-12 14:22:12 -07:00
Awni Hannun
de2b9e7d0a
Fix kernel deps to reduce build times ( #1205 )
2024-06-12 11:17:39 -07:00
Alex Barron
dd7d8e5e29
Add Quantized Ops to the JIT ( #1204 )
...
* JIT for quantized ops
* remove unused imports
* address comments
* fix imports
* second attempt to fix imports
---------
Co-authored-by: Alex Barron <abarron22@apple.com>
2024-06-12 09:47:12 -07:00
Awni Hannun
df964132fb
fix scatter + test ( #1202 )
...
* fix scatter + test
* fix test warnings
* fix metal validation
2024-06-11 14:35:12 -07:00
Alex Barron
27d70c7d9d
Feature complete Metal FFT ( #1102 )
...
* feature complete metal fft
* fix contiguity bug
* jit fft
* simplify rader/bluestein constant computation
* remove kernel/utils.h dep
* remove bf16.h dep
* format
---------
Co-authored-by: Alex Barron <abarron22@apple.com>
2024-06-06 12:57:25 -07:00
Awni Hannun
578842954c
fix jit scan when output doesn't have primitive ( #1190 )
2024-06-06 07:24:58 -07:00
Awni Hannun
496315fe1d
Fix scan ( #1188 )
...
* fix scan
* improve grid size
* fix cpu cummax
2024-06-05 14:21:58 -07:00
Awni Hannun
83b11bc58d
Fix Metal API validation for empty concat ( #1183 )
2024-06-04 13:17:08 -07:00
Alex Barron
375a8bbdcc
Add some internal GPU apis ( #1177 )
...
* Add unary/binary/ternay/slice/concat internal GPU ops
* add pad internal op
* formatting + no_cpu fix
2024-06-04 09:24:26 -07:00
Awni Hannun
ea9090bbc4
Add view op ( #1179 )
...
* add view primitive
* nit
* fix view
2024-06-04 08:05:27 -07:00
Alex Barron
4d485fca24
Add defines include ( #1176 )
...
Co-authored-by: Alex Barron <abarron22@apple.com>
2024-06-03 09:50:10 -07:00