Awni Hannun
c4230747a1
redesign for faster cpu/gpu synch ( #1869 )
...
* redesign for faster cpu/gpu synch
* load + more async CPU
* use command encoder API and move more ops to use it
* make fence back-end generic + CPU only fence
* faster build
* fix async eval
* fixes + handle temporaries
* fix / improve cpu conv
* remove unused status, fix siblings
* fix extensions
* fix
* fix no cpu build
* format
* comments
* fix perf regression, remove unecessary abort
* fix events, task limit cpu
* fix waiting
* fix donation / temporaries in normalization
2025-03-06 19:23:38 -08:00
Alex Barron
fd0d63ba5b
Affine quant always in fp32 ( #1925 )
...
* do affine quant in fp32
* static cast
2025-03-04 17:50:19 -08:00
Abe Leininger
3835a428c5
Adds nuclear norm support ( #1894 )
...
* adjust norm unit test tolerance
2025-03-04 13:26:02 -08:00
Awni Hannun
e613d0eaf0
SDPA support for small batch (over sequence) queries ( #1922 )
...
* batch query sdpa
* batch sdpa for query
2025-03-04 10:59:04 -08:00
Awni Hannun
6bcd6bcf70
fix donation in scan ( #1917 )
2025-03-03 11:30:59 -08:00
Awni Hannun
ba12e4999a
Use a heap for small sizes ( #1911 )
...
* use a heap for small sizes
* check if VM
2025-03-03 06:50:57 -08:00
Awni Hannun
4e7cd31d12
Fix slice data size ( #1913 )
...
* fix slice data size
* add test
2025-03-02 21:50:42 -08:00
Angelos Katharopoulos
5e6c130d93
RMS norm without scaling ( #1915 )
2025-02-28 20:26:57 -08:00
Jagrit Digani
89d327075f
Enabling fused attention for head dim 128 ( #1899 )
...
* Share KV smem
* Fix bfloat error
* Unroll O = S @ V loop
* Perf upgrade
* Remove commented out function
* Add -Wno-c++17-extensions flag to metal flags
* Add -Wno-c++17-extensions flag to metal extension flags
2025-02-26 10:02:06 -08:00
Awni Hannun
7d042f17fe
Double for lapack ( #1904 )
...
* double for lapack ops
* add double support for lapack ops
2025-02-25 11:39:36 -08:00
Awni Hannun
7face5d9fd
fix cpu compile ( #1897 )
2025-02-24 14:10:30 -08:00
Awni Hannun
a44dc4bdb0
fix leaking objc ( #1898 )
2025-02-24 13:57:59 -08:00
Awni Hannun
2d0f384b6f
fix simd erf_inv ( #1896 )
2025-02-24 13:57:47 -08:00
Awni Hannun
8ff84b5c43
fix version and expose command queue getter ( #1892 )
2025-02-20 15:25:15 -08:00
Jesper Stemann Andersen
0ebc8a3d25
Fixed issue where Clang on FreeBSD failed to compile mlx/backend/cpu/quantized.cpp ( #1890 )
2025-02-20 12:02:12 -08:00
Awni Hannun
bbda0fdbdb
Allow non-square lu ( #1889 )
2025-02-20 08:13:23 -08:00
Abe Leininger
344a29506e
Enforce triangular matrix form in tri_inv
( #1876 )
...
* fix tri_inv bug
* Revert "fix tri_inv bug"
This reverts commit b74b290201
.
* Make sure that tri_inv returns a triangular matrix
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
2025-02-19 12:42:33 -08:00
Angelos Katharopoulos
71de73a668
Fix convs by reverting #1803 ( #1882 )
2025-02-18 14:36:34 -08:00
Awni Hannun
5274c3c43f
compiler warnings are errors ( #1870 )
2025-02-17 00:07:49 -08:00
Angelos Katharopoulos
1762793989
Remove unused uniform ( #1867 )
2025-02-14 15:51:41 -08:00
Jagrit Digani
2dc307f2e6
Winograd Update for Small batches ( #1803 )
...
* Build in padding to Winograd kernels
* Add new fused Winograd kernel
* Enable weight flipping in Winograd kernels
2025-02-14 13:08:13 -08:00
Awni Hannun
7aea5b1895
Allow dynamic ops per buffer based on dispatches and memory ( #1864 )
...
* Allow dynamic ops per buffer based on dispatches and memory
* add initial arch values
2025-02-13 19:18:22 -08:00
Awni Hannun
428f589364
Revert "More buffer donation in some cases ( #1858 )" ( #1863 )
...
This reverts commit d274ae77f2
.
2025-02-13 14:21:44 -08:00
Alex Barron
5cd97f7ffe
Bitwise Inverse ( #1862 )
...
* add bitwise inverse
* add vmap + fix nojit
* inverse -> invert
* add to compile + remove unused
2025-02-13 08:44:14 -08:00
Awni Hannun
e425dc00c0
Faster small batch qmv ( #1861 )
...
* faster small batch qmv
* swap batch and block dims for qvm and qmv regular
2025-02-12 22:02:36 -08:00
Awni Hannun
d274ae77f2
More buffer donation in some cases ( #1858 )
...
* more donation
* fix
* add test
2025-02-12 19:41:37 -08:00
Angelos Katharopoulos
0145911bea
Fixes output donation for IO ops on the GPU ( #1857 )
2025-02-12 10:52:30 -08:00
Cheng
142b77751d
Fix compilation error on Windows ( #1844 )
2025-02-10 19:53:05 -08:00
Abe Leininger
a5ededf1c3
CPU LU factorization and linear solvers ( #1451 )
...
* linalg solve backend
* nits
* more nits + fix
* luf primitive and lu, solve, and solve_triangular backends
* changes / nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2025-02-10 12:32:24 -08:00
Awni Hannun
1c0c118f7c
Fp64 on the CPU ( #1843 )
...
* add fp64 data type
* clean build
* update docs
* fix bug
2025-02-07 15:52:22 -08:00
Jagrit Digani
b6c6552d20
Add missing #pragma once ( #1838 )
2025-02-06 11:11:22 -08:00
Awni Hannun
af1b725fda
Fix a couple of slicing bugs ( #1827 )
...
* fix a few bugs
* fix conv grad
* speedup test
* comment
2025-02-05 19:50:08 -08:00
Awni Hannun
9174606d4c
fix sort ( #1835 )
2025-02-05 17:16:27 -08:00
Awni Hannun
fe5987b81d
faster sort ( #1831 )
2025-02-05 06:10:22 -08:00
Awni Hannun
a229c8cef0
don't duplicate malloc with custom kernel init ( #1830 )
2025-02-04 13:20:57 -08:00
Awni Hannun
1156c84e86
Refactor common into cpu specific and truly common ( #1817 )
...
* refactor
* fix extension example
* fix no-cpu
2025-02-03 15:58:02 -08:00
Jesper Stemann Andersen
2d8e667400
MinGW support ( #1806 )
...
* Changed /bin/bash to bash for generating compiling preamble
* Fix wrt jit_compiler mingw like msvc wrt. WEXITSTATUS
* Solved ambiguity wrt. bernoulli test shape
* Disabled distributed/ring on Windows
* Fixed jit_compiler command wrt. MinGW
* Extended jit_compiler patch wrt. WEXITSTATUS to FreeBSD
2025-02-01 12:40:06 -08:00
Awni Hannun
80c863b972
Remove accelerate/ ( #1816 )
...
* remove accelerate
* comments
* neon reduction
2025-02-01 07:18:26 -08:00
Angelos Katharopoulos
f5cc1eea72
Allow different value dimensions in sdpa_vector ( #1811 )
2025-01-31 20:58:59 -08:00
Awni Hannun
b7c9f1d38f
scatter axis + gather axis primitives ( #1813 )
...
* scatter axis + gather axis primitives
* add transforms
* comment
2025-01-31 20:48:08 -08:00
Awni Hannun
c6fc07f1f4
Unify CPU matmuls, remove unused accelerate conv ( #1814 )
...
* unify matmuls
* Update mlx/backend/common/matmul.cpp
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
2025-01-31 14:43:37 -08:00
Awni Hannun
4758c8baa1
Start to cleanup/unify accelerate and common back-ends (Part 1/N) ( #1777 )
...
* start to cleanup/unify accelerate and common back-ends
* more progress
* simplify
* add half type and allow infs in simd exp
* unify softmax + quantized, more dispatches to simd quantized mm
* add sin/cos, use simd in vector-scalar ops
* faster CPU vectorize quant
* faster erf/erfinv
2025-01-29 14:34:49 -08:00
Awni Hannun
e6a7ab9675
non square qr ( #1783 )
2025-01-21 14:07:47 -08:00
Angelos Katharopoulos
1f4c127fb9
Move some kernels to get_template_definition
( #1782 )
2025-01-21 08:59:44 -08:00
Awni Hannun
a4667da1eb
Faster synchronization Fence
primitive ( #1773 )
...
* try faster synchronization
move event
fixes
update bench
fix
fix
* non-functioning kernel
* try alternative fence
* cleanup barrier
* get rid of event_fence
* update benchmarks
* doc string in metal fence
2025-01-17 18:42:19 -08:00
Awni Hannun
f288db8d34
Fix synchronization bug for in stream async works ( #1768 )
2025-01-15 06:07:34 -08:00
Awni Hannun
252e423e81
fix and cleanup event signal/wait for metal ( #1765 )
2025-01-10 18:37:26 -08:00
Alex Barron
c7b0300af5
Fix batched qmv bug ( #1758 )
2025-01-09 11:45:57 -08:00
Awni Hannun
da8c885784
Simplify removes no-ops from the tape ( #1759 )
...
* simplify removes no-ops from the tape
* comment
2025-01-09 11:23:19 -08:00
Awni Hannun
1ccaf80575
Dynamic broadcasting for shapeless compile/export ( #1722 )
...
* working towards dynamic broadcast
* shapeless broadcast
* fix build + nits
* use broadcast arrays in quantize matmul
* some cleanup / consistency
* mend
* some comments
* add vjp, jvp for broadcast axes
2025-01-09 11:04:24 -08:00
Cheng
ec36bfa317
Include command stdout in error message ( #1756 )
...
* Include command stdout in error message
* On Windows pclose returns the exit code
2025-01-08 07:17:03 -08:00
Cheng
b8f76f717a
Print exceptions in eval_cpu/eval_gpu and abort ( #1754 )
2025-01-08 06:31:09 -08:00
Awni Hannun
d1766f2c70
Add boolean mask support in vector SDPA ( #1757 )
2025-01-07 20:24:53 -08:00
Awni Hannun
516ded618b
Dynamic slicing ( #1741 )
...
* dynamic slice and slice update
* python bindings + tests + fix set item
* fix compile issue
* comment
* fix jit
2025-01-07 14:02:16 -08:00
Awni Hannun
d5ec172c95
Allow boolean mask in sdpa ( #1753 )
...
* allow boolean mask in sdpa
* more permissive donation in ternary
2025-01-06 16:57:07 -08:00
Awni Hannun
058d6ce683
mpi send use input as output ( #1750 )
...
* mpi send use input as output
* move earlier
2025-01-06 06:08:43 -08:00
Awni Hannun
259025100e
Fix nd ternary on GPU ( #1746 )
2025-01-03 11:52:17 -08:00
Awni Hannun
6fa0501387
Fix concatenate/slice_update vjp + reduce binary size ( #1735 )
...
* fix concatenate vjp + reduce binary size
* also cast in slice update
2025-01-02 16:36:33 -08:00
Cheng
935c8c4bb1
Make mx.compile work on Windows ( #1697 )
...
* Invoke MSVC on Windows in mx.compile
* Export kernel symbol on MSVC
* Remove unused template
* Parse env pairs in a robust way
* No need of cassert
* Remove unnecessary helpers
* Fix right trim
* Move command building to a separate file
* Missing header
* Do not pollute cwd with cl.exe
* Simplify str concat
* Pass output dir
* Fix styling
2024-12-24 07:02:33 -08:00
Valentin Roussellet
88f993da38
Explicit parentheses around some logical operators ( #1732 )
...
* fix some warnings
* format
2024-12-24 07:02:20 -08:00
Awni Hannun
ebfe64b92d
shapeless slice update and broadcast when possible ( #1727 )
2024-12-23 11:25:15 -08:00
Awni Hannun
0308e9af71
Allow offset to be an mx.array for mx.fast.rope
( #1724 )
...
* allow offset for rope
* comment
2024-12-19 15:51:44 -08:00
Awni Hannun
e03f0372b1
More shape type ( #1705 )
...
* more shape type
* fix
2024-12-19 08:08:20 -08:00
Awni Hannun
7480059306
track resource limit and throw if exceeded ( #1718 )
2024-12-18 18:45:58 -08:00
Cheng
070bd433ab
Shorter kernel name for Windows ( #1701 )
...
* Shorter kernel name for Windows
* Only hash the clipped part
2024-12-17 18:51:38 -08:00
Awni Hannun
9111999af3
Fix small sort with metal validation ( #1695 )
2024-12-12 09:21:45 -08:00
Awni Hannun
6bd28d246e
Allow no copy negative strides in as_strided and slice ( #1688 )
...
* allow no copy negative strides in as_strided and slice
* fix jit
* fix jit
2024-12-12 08:59:45 -08:00
Cheng
4d595a2a39
Make compiled preamble work in MSVC ( #1675 )
...
* Make compiled preamble work in MSVC
* Remove logging
* Only use powershell for MSVC
2024-12-12 08:55:49 -08:00
Awni Hannun
4e1e9520e1
Flatten and unflatten ( #1692 )
...
* flatten and unflatten
* fix grad
* fix shape infer
* use squeeze + unsqueeze in get_item
2024-12-11 21:51:37 -08:00
Awni Hannun
f3dfa36a3a
Fix x86 tests ( #1691 )
...
* fix x86 tests
* comment
2024-12-11 07:47:18 -08:00
Awni Hannun
f76a49e555
ExpandDims
primitive (#1687 )
...
* add squeeze primitive
* simplify squeeze, use in gather
* fix
* fix
* fix
* fix
* fix no cpu
* use squeeze in matmul and friends
* expand dims primitive
* comment
2024-12-10 16:39:07 -08:00
Awni Hannun
40c62c1321
Use int64 stride everywhere ( #1671 )
...
* use int64 stride everywhere
* fix ext
* fix ext
* more shape + cleanup
* one more
* few more
2024-12-09 11:09:02 -08:00
Cheng
6ae5423b4a
Do not pass integers to isnan ( #1664 )
2024-12-07 18:26:23 -08:00
Cheng
3ceb341a75
Use correct complex type for MSVC ( #1660 )
2024-12-07 18:25:22 -08:00
Alex Barron
95c4a2e3af
add back conditionaltype ( #1655 )
2024-12-06 11:12:01 -08:00
Jagrit Digani
9d40e521d7
Stop matrix copies with new attention kernel ( #1639 )
2024-12-02 14:12:38 -08:00
Jesper Stemann Andersen
e4eeb4e910
Added missing unordered_map includes ( #1635 )
...
* Added missing includes in mlx/io.h and mlx/backend/metal/metal.h
* Added additional missing unordered_map includes that fixes build on FreeBSD
2024-12-02 07:03:03 -08:00
Ikko Eltociear Ashimine
9bc2183a31
docs: update device.cpp ( #1632 )
...
unecessary -> unnecessary
2024-11-27 20:58:26 -08:00
Awni Hannun
d4b222b6d3
Fix some leaks and races ( #1629 )
...
* fix leak and fix potential race
* more leak fixes
* fix one more
2024-11-27 20:01:20 -08:00
Jesper Stemann Andersen
af2af818a6
Enables build for *-linux-musl ( #1627 )
...
Also contributes to being able to build for *-w64-mingw32.
Cf. https://github.com/JuliaPackaging/Yggdrasil/pull/9761
2024-11-27 13:14:24 -08:00
Awni Hannun
211411faf2
fix large ops ( #1620 )
2024-11-24 09:17:10 -08:00
Alex Barron
6f7986d592
Cleaner qmv
/qvm
( #1616 )
2024-11-22 11:14:08 -08:00
Jagrit Digani
02bec0bb6d
Matrix Attention kernel ( #1610 )
...
* Rough INIT
* [WIP]: Loading and Matmuls added
* [WIP]: Reductions and min working aligned kernel at headdim = 64
* [WIP] Added headdim 80 for testing
* [WIP] Update dispatch params for testing
* [WIP] Add support for unaligned seq lengths - still looks messy
* Update sdpa_benchmarks
* Update sdpa_benchmarks
* Update sdpa_benchmarks
* Enable gqa support
* Update benchmark and switch off 128 headdim
* Update headdim 128 tuning
* Remove older fast attention code. Write out O strided
* Disable hd=128 until further optimizations
* Enable bf16
* Fix data size bug
* Enable attn build outside of jit
2024-11-22 10:34:05 -08:00
Alex Barron
c79f6a4a8c
3 and 6 bit quantization ( #1613 )
...
* Support 3 and 6 bit quantization
2024-11-22 10:22:13 -08:00
Awni Hannun
0c5eea226b
Reduce specializations ( #1607 )
...
* start of reduce specializations
* fix all reduce
* fix many dims
* fix
* non-jit tests clear
* cleanup instantiations
* cpu merges
* change dim specializations
* optimize
* fix jit
* fix jit
* use higher precision for integer sum+prod
* fixes
2024-11-21 19:53:00 -08:00
Awni Hannun
dcca0d7477
contiguous op / prim ( #1612 )
2024-11-21 19:51:49 -08:00
Awni Hannun
61d787726a
Fix view scalar bug segfault ( #1603 )
...
* fix view scalar bug
* fix view scalar bug
* one more fix
2024-11-19 10:54:05 -08:00
Awni Hannun
2419edd5b2
Faster indexing math in a few kernels ( #1589 )
...
* wip: faster compiled kernels
* faster general unary with uint specialization
* index type in compiled, unary, binary, ternary, copy
* fix jit
* jit fix
* specialize gather + scatter
* nit in docs
2024-11-18 19:52:00 -08:00
Awni Hannun
9d7fa6b8e6
Use osx deployment target to pick Metal version ( #1595 )
...
* choose metal based on deployment target rather than system version
* nit
* unused compile def
2024-11-18 19:16:49 -08:00
Angelos Katharopoulos
073076ac7d
2-Pass Sdpa Inference Kernel ( #1597 )
2024-11-18 17:31:53 -08:00
Awni Hannun
9bd03dd9b4
More buffer donation with no-ops ( #1591 )
...
* more donation
* fix test
* fix build
2024-11-18 08:35:41 -08:00
Awni Hannun
6931f84412
fix dispatch threads for a few kernels ( #1594 )
2024-11-18 08:35:25 -08:00
Awni Hannun
610af352d4
Dispatch bf16 at run time when using the JIT ( #1584 )
...
* Dispatch bf16 at run time when using the JIT
* fix extension
* fix extension build
* fix extension build
* Update utils.h
2024-11-15 16:54:36 -08:00
Awni Hannun
b35f1e3c9c
fix donation in sdpa ( #1587 )
2024-11-13 17:21:13 -08:00
Awni Hannun
dfa0b9aab4
Cpu fast quantize ( #1578 )
...
* cpu quantize
* fix
2024-11-08 20:10:39 -08:00
Alex Barron
a4c47b0276
OOB QMV fix ( #1579 )
...
* fix oob access in qmv
* skip more
* fix small case
2024-11-08 17:59:45 -08:00
Alex Barron
111fefd5e9
Fix OOB access in qmv ( #1577 )
...
* fix oob access in qmv
* skip more
2024-11-08 15:41:30 -08:00
Awni Hannun
c1fe1ef081
Bfs width limit ( #1568 )
...
* width limit
* fix
* large limit
* put env vars in env namespace
2024-11-08 15:00:46 -08:00
Awni Hannun
9f0d5c12fc
Fully wrap the command encoder ( #1572 )
...
* fully wrap the command encoder
* use consistent style + fix extensions
2024-11-08 11:50:21 -08:00
Awni Hannun
9a3842a2d9
fix ( #1566 )
2024-11-06 17:10:33 -08:00