ATurker
a7fae8a176
fix: conv_general differences between gpu, cpu ( #2070 )
...
* fix general_conv padding
* fix bugs
* add test
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2025-05-09 10:26:52 -07:00
Cheng
0cae0bdac8
CUDA backend: backbone ( #2075 )
2025-05-06 21:26:46 -07:00
Awni Hannun
5a1a5d5ed1
fix input coherent kernel launch ( #2153 )
2025-05-05 17:30:50 -07:00
Cheng
1683975acf
Move common gpu primitives to backend/gpu ( #2145 )
2025-05-05 13:45:29 -07:00
Awni Hannun
af705590ac
fix batched vector sdpa ( #2152 )
2025-05-05 13:13:03 -07:00
Awni Hannun
825124af8f
fix bw for elementwise ops ( #2151 )
...
* fix bw for elementwise ops
* add compile
* fix
* fix
* fix
* fix
2025-05-05 06:15:04 -07:00
Awni Hannun
9c5e7da507
fix compile merging ( #2150 )
2025-05-02 15:08:50 -07:00
Angelos Katharopoulos
481349495b
GPU Hadamard for large N ( #1879 )
2025-05-01 17:19:17 -07:00
Awni Hannun
9daa6b003f
fix shapeless export ( #2148 )
2025-05-01 15:02:02 -07:00
Awni Hannun
e496c5a4b4
fix integer overflow in qmm ( #2143 )
2025-04-30 09:28:56 -07:00
Awni Hannun
f1606486d2
Generalize gpu backend ( #2138 )
...
* generalize gpu backend
* fix no_gpu build
* fix no_gpu build
* generalize gpu backend
2025-04-30 09:08:17 -07:00
Aashiq Dheeraj
bb6565ef14
add fftshift and ifftshift fft helpers ( #2135 )
...
* add fftshift and ifftshift fft helpers
* address comments
* axes have to be iterable
* fix fp error in roll + add test
---------
Co-authored-by: Aashiq Dheeraj <aashiq@aashiq-mbp-m4.local>
2025-04-29 22:13:45 -07:00
Awni Hannun
7bb063bcb3
Enable vjp for quantized scale and bias ( #2129 )
...
* Enable vjp for quantized scale and bias
* higher tol
2025-04-29 13:03:09 -07:00
Alex Chi Z.
b36dd472bb
return library if it is successfully loaded ( #2131 )
2025-04-29 07:30:36 -07:00
hdeng-apple
167b759a38
Fix typos ( #2136 )
2025-04-29 07:26:05 -07:00
1ndig0
6b2d5448f2
Fix the error message in mx.right_shift
and mx.left_shift
( #2121 )
...
* update right_shift and lef_shift
* simplify
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2025-04-25 09:14:28 -07:00
Awni Hannun
eaf709b83e
patch ( #2119 )
2025-04-24 16:11:07 -07:00
Angelos Katharopoulos
f0e70afff0
Fix swift pm load ( #2117 )
2025-04-24 10:58:29 -07:00
hdeng-apple
86984cad68
Remove static initializers ( #2059 )
...
* Remove static initializers in device.cpp, load.cpp, pocketfft.h
* Remove static initializer InTracing::trace_stack
* Remove static initializer of CompilerCache cache
* Revert changes in pocketfft.h
* Remove duplicate private section of thread_pool()
2025-04-24 06:14:49 -07:00
Awni Hannun
fbc89e3ced
fix pinv ( #2110 )
2025-04-23 13:08:28 -07:00
hdeng-apple
38c1e720c2
Search mlx.metallib in macOS framework "Resources" dir ( #2061 )
...
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
2025-04-23 09:53:13 -07:00
Param Thakkar
600e87e03c
Added output_padding parameters in conv_transpose ( #2092 )
2025-04-23 09:26:33 -07:00
Yury Popov
1d2c9d6a07
Complex scan ( #2094 )
2025-04-22 18:56:28 -07:00
Awni Hannun
e8ac6bd2f5
irfft throws instead of segfaults on scalars ( #2109 )
2025-04-22 10:25:55 -07:00
Awni Hannun
fdadc4f22c
Add more complex unary ops ( #2101 )
2025-04-21 13:04:54 -07:00
Awni Hannun
79b527f45f
conv vmap ( #2102 )
2025-04-21 13:04:39 -07:00
Awni Hannun
dc4eada7f0
Use unordered map for kwargs in export/import ( #2087 )
...
* use unordered map for kwargs in export/import
* comment
2025-04-21 07:17:22 -07:00
Cheng
70ebc3b598
Return const ref in array::data_shared_ptr ( #2100 )
2025-04-21 07:17:09 -07:00
Cheng
b13f2aed16
Introduce macros for dispatching dynamic dtypes as static types ( #2073 )
2025-04-19 06:16:30 -07:00
Param Thakkar
5f04c0f818
Fixed shift operations issue ( #2080 )
...
* Fixed shift operations issue
* Added tests and fixes
* Fixed loop syntax error
* Added tests for bool
* Fixed typo
2025-04-18 14:28:33 -07:00
Awni Hannun
b529515eb1
minor bump ( #2081 )
2025-04-17 14:57:11 -07:00
Angelos Katharopoulos
3cde719eb7
Route to gather qmm only for many tokens per expert ( #2082 )
2025-04-17 14:53:08 -07:00
Angelos Katharopoulos
5de6d94a90
Gather qmm batched kernel and refactoring of quantized ( #2078 )
2025-04-17 13:53:11 -07:00
Angelos Katharopoulos
99eefd2ec0
Gather mm new kernel and small refactoring ( #2040 )
2025-04-14 16:37:36 -07:00
Yury Popov
e9e268336b
LogCumSumExp ( #2069 )
2025-04-13 01:27:29 -07:00
Angelos Katharopoulos
c4189a38e4
Add float mask to sdpa vector ( #2068 )
2025-04-11 17:29:40 -07:00
Awni Hannun
ef7ece9851
fix fft bug ( #2062 )
2025-04-10 19:41:27 -07:00
Angelos Katharopoulos
ddaa4b7dcb
Fix the test and add custom min/max reductions for uncommon MPI types ( #2060 )
2025-04-10 17:01:17 -07:00
Cheng
dfae2c6989
Fix MSVC build due to use of M_LN2 ( #2058 )
2025-04-10 07:41:41 -07:00
Anastasiia Filippova
515f104926
Min / max reductions ( #2041 )
2025-04-09 23:22:20 -07:00
Angelos Katharopoulos
9ecefd56db
Do not load the default lib if another is requested ( #2055 )
2025-04-09 13:31:38 -07:00
Awni Hannun
e5d35aa187
no sdpa in grad ( #2054 )
2025-04-08 19:13:54 -07:00
Awni Hannun
00794c42bc
Fix causal mask sdpa vec ( #2053 )
...
* fix sdpa vector causal mask
* test
2025-04-08 09:11:23 -07:00
Cheng
08a1bf3f10
Remove Event::Signal() ( #2052 )
2025-04-08 06:20:27 -07:00
Awni Hannun
60c4154346
Only request residency once ( #2051 )
2025-04-07 10:47:51 -07:00
Awni Hannun
f2c85308c1
add a half simd gemm fallback ( #2046 )
...
* add a half simd gemm fallback
* nit
2025-04-07 09:31:29 -07:00
Awni Hannun
1a28b69ee2
only add to residency set once ( #2049 )
2025-04-06 17:38:25 -07:00
Cheng
6cf48872b7
wait_for_one should wait for task to finish ( #2047 )
2025-04-05 20:05:16 -07:00
Awni Hannun
86389bf970
patch bump ( #2043 )
2025-04-03 13:15:18 -07:00
Jagrit Digani
3290bfa690
Add new sdpa function overload ( #2035 )
...
* Add new sdpa function overload
* Address comments
* Remove std::varaint from cpp sdpa function
2025-04-03 11:58:28 -07:00
Jagrit Digani
8777fd104f
Depthwise Conv2D optimization ( #2036 )
...
- Add new specialized kernel for small kernel (kernels size <= 7), small strides (strides <= 2) depthwise 2d convolutions
- Add related tests
2025-04-03 09:42:04 -07:00
Awni Hannun
c41f7565ed
fix softmax / logsumexp ( #2042 )
2025-04-03 08:32:59 -07:00
Awni Hannun
9ba81e3da4
tune quant dispatch ( #2031 )
2025-04-02 20:05:54 -07:00
Awni Hannun
c23888acd7
Fix build warning ( #2033 )
2025-04-01 14:42:27 -07:00
Awni Hannun
f98ce25ab9
fix residency set for real ( #2032 )
2025-04-01 12:59:48 -07:00
Awni Hannun
de5f38fd48
Custom logsumexp ( #2028 )
...
* initial custom logsumexp
* more tests
* comments + fix
2025-03-31 07:36:55 -07:00
Angelos Katharopoulos
ec2854b13a
Swap -inf for finite_minimum value ( #2029 )
2025-03-30 21:55:04 -07:00
Jesper Stemann Andersen
5f5770e3a2
Fix CPU sign for unsigned ints ( #2024 )
...
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
2025-03-30 17:56:59 -07:00
Awni Hannun
28f39e9038
Log for complex numbers in Metal ( #2025 )
...
* Log for complex numbers in Metal
* fix log2
2025-03-30 17:04:38 -07:00
Awni Hannun
b2d2b37888
fix residency set clearing ( #2027 )
2025-03-30 16:27:26 -07:00
Awni Hannun
13b26775f1
use minimum deployment target ( #2016 )
2025-03-28 14:31:53 -07:00
Awni Hannun
05d7118561
causal vector sdpa ( #2018 )
...
* causal vector sdpa
* get rid of memory threshold
2025-03-28 12:36:13 -07:00
Awni Hannun
98b901ad66
enable complex gemm ( #2017 )
2025-03-28 10:45:13 -07:00
Awni Hannun
5580b47291
iinfo and scalar overflow detection ( #2009 )
2025-03-27 19:54:56 -07:00
Awni Hannun
bc62932984
sdpa specialization for head dim 256 ( #2007 )
2025-03-27 19:31:25 -07:00
Awni Hannun
916fd273ea
wire cache ( #2006 )
2025-03-25 18:54:01 -07:00
Cheng
eda7a7b43e
Do not join threads during process exit on Windows ( #1738 )
2025-03-25 06:33:08 -07:00
Awni Hannun
aba899cef8
patch bump ( #2000 )
2025-03-24 12:47:05 -07:00
Jagrit Digani
6a40e1c176
Fix looping limit in causal attention ( #1999 )
2025-03-24 12:28:00 -07:00
Jesper Stemann Andersen
9307b2ab8b
Fixed 32-bit platform support for distributed/ring implementation ( #1996 )
...
Replaced unsigned long integer literals with size_t literals in ring implementation, e.g., 1UL with size_t(1).
2025-03-24 08:08:40 -07:00
Jesper Stemann Andersen
522d8d3917
Added missing netinet/in.h include that fixes build on FreeBSD ( #1997 )
...
Defines IPPROTO_TCP.
2025-03-24 08:07:34 -07:00
Awni Hannun
a84cc0123f
promote mask when needed ( #1998 )
2025-03-23 19:58:28 -07:00
Andrey Velichkevich
f018e248cd
fix(backend): Include algorithm library in Allocator ( #1992 )
...
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2025-03-22 21:27:51 -07:00
Angelos Katharopoulos
4eef8102c9
Distributed layers ( #1270 )
2025-03-21 13:52:17 -07:00
Angelos Katharopoulos
69e4dd506b
Add a ring all gather ( #1985 )
2025-03-21 13:36:51 -07:00
Angelos Katharopoulos
25814a9458
Disable mpi on version mismatch ( #1989 )
2025-03-21 13:36:26 -07:00
Awni Hannun
2a980a76ce
Add stats and limit to common allocator and enable tests ( #1988 )
...
* add stats to common allocator and enable tests
* linux memory and default
* fix
2025-03-21 12:28:36 -07:00
Angelos Katharopoulos
d343782c8b
Cross platform libmpi loading ( #1975 )
2025-03-21 11:23:10 -07:00
Awni Hannun
4e1994e9d7
move memory APIs into top level mlx.core ( #1982 )
2025-03-21 07:25:12 -07:00
Awni Hannun
7b7e2352cd
fix malloc or wait deadlock ( #1976 )
2025-03-20 16:48:43 -07:00
Awni Hannun
1177d28395
patch bump ( #1981 )
2025-03-20 15:12:22 -07:00
Awni Hannun
005e7efa64
fix mask in sdpa ( #1980 )
...
* fix mask in sdpa
* fix attention mask
* Re-enable routing for array mask
---------
Co-authored-by: Jagrit Digani <digani@apple.com>
2025-03-20 14:53:12 -07:00
Jagrit Digani
b42d13ec84
Update attention tests to show diff, disable array masks ( #1978 )
2025-03-20 14:25:38 -07:00
Jagrit Digani
9adcd1a650
Support fused masking in Attention ( #1924 )
...
* Update API to allow mask='causal' in fast::sdpa
* Add fallback
* Update steel::AttnParams
* Fix typo
* WIP, basic causal
* Update tests
* Update benchmarking
* Update masking loop limits
* Add bool masking and update tests
* Update additive mask
* Update benchmarks
* Update benchmarks
* Update tests
* Update for bfloat error
* Update early exit
* Add random seed to tests
2025-03-20 11:01:32 -07:00
Awni Hannun
3c164fca8c
Fix multistream GPU deadlock ( #1969 )
...
* fix multistream GPU deadlock
* comments
2025-03-20 07:19:47 -07:00
Awni Hannun
f90206ad74
Guard nullptr dereference ( #1972 )
...
* guard nullptr dereference
* comment
2025-03-19 16:24:10 -07:00
Cheng
0a9777aa5c
Do not define MLX_VERSION globally ( #1966 )
2025-03-18 07:12:40 -07:00
Awni Hannun
c6ea2ba329
Use same accumulation precision in gemv as gemm ( #1962 )
...
* use same accumulation precision in gemv as gemm
* faster
* fix compile
2025-03-16 07:13:24 -07:00
Awni Hannun
32da94507a
fix vmap for flatten ( #1955 )
2025-03-11 10:42:22 -07:00
Awni Hannun
736a340478
reduce binary size ( #1952 )
2025-03-11 06:30:44 -07:00
Awni Hannun
117e1355a2
fix copy for large arrays ( #1953 )
2025-03-10 15:04:25 -07:00
Awni Hannun
3c3e558c60
Support transposed head/seq for kv ( #1950 )
...
* support transposed head/seq for kv
* fix flaky test
* nit
2025-03-10 10:53:45 -07:00
Awni Hannun
c4230747a1
redesign for faster cpu/gpu synch ( #1869 )
...
* redesign for faster cpu/gpu synch
* load + more async CPU
* use command encoder API and move more ops to use it
* make fence back-end generic + CPU only fence
* faster build
* fix async eval
* fixes + handle temporaries
* fix / improve cpu conv
* remove unused status, fix siblings
* fix extensions
* fix
* fix no cpu build
* format
* comments
* fix perf regression, remove unecessary abort
* fix events, task limit cpu
* fix waiting
* fix donation / temporaries in normalization
2025-03-06 19:23:38 -08:00
Awni Hannun
5245f12a46
always use json ( #1938 )
2025-03-06 15:35:56 -08:00
Awni Hannun
f599c11bc8
bump ( #1931 )
2025-03-05 13:16:53 -08:00
Angelos Katharopoulos
0792ff02ff
Only fail when 10 consecutive socket errors occur ( #1928 )
2025-03-05 13:16:19 -08:00
Alex Barron
fd0d63ba5b
Affine quant always in fp32 ( #1925 )
...
* do affine quant in fp32
* static cast
2025-03-04 17:50:19 -08:00
Abe Leininger
3835a428c5
Adds nuclear norm support ( #1894 )
...
* adjust norm unit test tolerance
2025-03-04 13:26:02 -08:00
Awni Hannun
e613d0eaf0
SDPA support for small batch (over sequence) queries ( #1922 )
...
* batch query sdpa
* batch sdpa for query
2025-03-04 10:59:04 -08:00
Awni Hannun
6bcd6bcf70
fix donation in scan ( #1917 )
2025-03-03 11:30:59 -08:00