Commit Graph

  • 6b6e1d3ac4 fix custom kernel test Awni Hannun 2025-08-17 06:11:46 -0700
  • f433c9a421 Revert back to old rms norm kernel Cheng 2025-08-13 04:13:26 -0700
  • 3a2b90fc1a Implement forward rms_norm with cuDNN Cheng 2025-08-12 04:42:58 -0700
  • 290b45eba3 Add RAII managed CudaGraph class Cheng 2025-08-11 18:20:00 -0700
  • c422050ca7
    Update cuDNN Frontend to v1.14 (#2505) Cheng 2025-08-17 19:13:01 +0900
  • 753081052e Ensure no oob read in gemv_masked Angelos Katharopoulos 2025-08-17 00:40:32 -0700
  • b33457ae4d Ensure small sort doesn't use indices if not argsort Angelos Katharopoulos 2025-08-16 21:43:24 -0700
  • ebfc6777dd Update cuDNN Frontend to v1.14 Cheng 2025-08-16 17:18:24 -0700
  • 708ca546b2 use defaults for scalar ops Awni Hannun 2025-08-16 06:36:46 -0700
  • 1ba18ff7d9
    [CUDA] Fix conv grads with groups (#2495) Cheng 2025-08-16 10:09:18 +0900
  • 37b440faa8
    Clean up code handling both std::vector and SmallVector (#2493) Cheng 2025-08-16 09:01:10 +0900
  • 888b13ed63
    Remove the hack around SmallVector in cpu compile (#2494) Cheng 2025-08-16 08:17:24 +0900
  • 4abb218d21
    The naive_conv_2d is no longer used (#2496) Cheng 2025-08-16 07:57:30 +0900
  • 6441c21a94
    Faster general unary op (#2472) Awni Hannun 2025-08-15 15:04:12 -0700
  • bd9977acbb copy general Awni Hannun 2025-08-15 13:38:56 -0700
  • 102f3ba579 binary two Awni Hannun 2025-08-15 12:36:44 -0700
  • 5e542d98e0 fix + comment Awni Hannun 2025-08-15 12:11:23 -0700
  • f403ea1764 faster general ops + reorg Awni Hannun 2025-08-15 10:59:34 -0700
  • 1034009b82 faster general unary op Awni Hannun 2025-08-06 19:57:37 -0700
  • f852acdeed The naive_conv_2d is no longer used Cheng 2025-08-15 10:29:53 +0900
  • 400f8457ea Experimenting with a gemm based on the cuda steel utils jagrit06/cuda-gemm-experiment Jagrit Digani 2025-08-14 11:27:50 -0700
  • adb64ea409 Put the reshape utils in gpu/copy.h Cheng 2025-08-14 03:31:12 -0700
  • 5b05aaad95 [CUDA] Fix conv grads with groups Cheng 2025-08-14 03:21:00 -0700
  • 8bf8034ffd Put reshape utils in one file Cheng 2025-08-14 16:59:38 +0900
  • d00175807f Remove the hack around SmallVector in cpu compile Cheng 2025-08-14 10:58:08 +0900
  • e0ed742ef2 Clean up code handling both std::vector and SmallVector Cheng 2025-08-14 09:07:45 +0900
  • dfb5022eab
    Rename cu::Matmul to CublasGemm (#2488) Cheng 2025-08-13 09:37:40 +0900
  • ac207ce7aa
    make code blocks copyable (#2480) Daniel Yeh 2025-08-12 21:29:02 +0200
  • 3ebc047168 Add RAII managed CudaGraph class Cheng 2025-08-11 18:20:00 -0700
  • 3ec3eadf5d Rename cu::Matmul to CublasGemm Cheng 2025-08-11 08:49:38 +0900
  • fce53b61d6
    Fix reduce sum/prod overflow (#2477) Abe Leininger 2025-08-12 02:05:33 -0500
  • 8ae4a76308
    Use CMake <4.1 to avoid the nvpl error (#2489) Angelos Katharopoulos 2025-08-12 00:03:42 -0700
  • f2073a0d46 Use CMake <4.1 to avoid the nvpl error Angelos Katharopoulos 2025-08-11 23:15:05 -0700
  • c9e8cc2856 Add acknowledgment for adaptive max pooling contribution Vincent Amato 2025-08-11 23:49:37 -0400
  • 68c7be55bb Add AdaptiveMaxPool1d, AdaptiveMaxPool2d, and AdaptiveMaxPool3d layers Vincent Amato 2025-08-11 23:45:52 -0400
  • c59b46a488 Restyled pooling Vincent Amato 2025-08-11 23:15:38 -0400
  • 652a143b64 Update ACKNOWLEDGMENTS.md to include AdaptiveAvgPool1d Vincent Amato 2025-08-11 23:13:38 -0400
  • e4530007ae Add AdaptiveAvgPool1d layer Vincent Amato 2025-08-11 23:10:56 -0400
  • c320e72635 Add AdaptiveAvgPool2d and AdaptiveAvgPool3d to docs Vincent Amato 2025-08-11 22:34:58 -0400
  • 89b3f69a56 Refactor adaptive pooling for style consistency Vincent Amato 2025-08-11 22:12:14 -0400
  • 634ce07a3e Add AdaptiveAvgPool2d and AdaptiveAvgPool3d Vincent Amato 2025-08-11 21:17:49 -0400
  • e9272203a2 Add Vincent Amato to ACKNOWLEDGMENTS.md Vincent Amato 2025-08-11 20:48:26 -0400
  • 1d9ce9d744 Add cyclic_lr scheduler Vincent Amato 2025-08-11 20:40:33 -0400
  • b7e5e64b13 Add Vincent Amato to ACKNOWLEDGMENTS.md Vincent Amato 2025-08-11 19:42:01 -0400
  • d74a34d2fc Add cosine_annealing_warm_restarts to scheduler documentation Vincent Amato 2025-08-11 19:25:04 -0400
  • 84ef89f548 Add CosineAnnealingWarmRestarts scheduler Vincent Amato 2025-08-11 19:17:47 -0400
  • 02d6079626 make code blocks copyable Chen-Chen Yeh 2025-08-09 18:39:47 +0200
  • 984cefb14d CUDA_VISIBLE_DEVICES to local rank Anastasiia Filippova 2025-08-09 01:43:14 +0200
  • 7fde1b6a1e
    Fix logsumexp/softmax not fused for some cases (#2474) Cheng 2025-08-09 06:07:17 +0900
  • c28cff7763 fix reduce sum/prod overflow aleinin 2025-08-08 03:40:25 -0500
  • 79f46f1146 Fix logsumexp/softmax not fused for some cases Cheng 2025-08-07 15:55:22 +0900
  • aa7b47481a
    [CUDA] Optimize set_mm_device_pointers for small ndim (#2473) Cheng 2025-08-08 15:23:30 +0900
  • dadf8d9c93 repeat host -> proc per node Anastasiia Filippova 2025-08-07 15:09:46 +0200
  • 389276e2b8 typo Anastasiia Filippova 2025-08-07 14:16:34 +0200
  • 2e255c8eb4 fixed typo Anastasiia Filippova 2025-08-07 14:02:38 +0200
  • 062aa80b84 minor changer to mlx.launch Anastasiia Filippova 2025-08-07 13:20:55 +0200
  • f540b1d612 nccl backend Anastasiia Filippova 2025-08-07 13:11:56 +0200
  • 56be773610
    version (#2470) v0.28.0 Awni Hannun 2025-08-07 00:36:04 -0700
  • a9bdd67baa
    Add CUDA sdpa vector (#2468) Jagrit Digani 2025-08-06 21:40:26 -0700
  • 7fa520e955 Remove batch sdpa Angelos Katharopoulos 2025-08-06 20:26:01 -0700
  • a22d0bf273 Add stricter condition to matrix sdpa sdpav-backup Angelos Katharopoulos 2025-08-06 19:51:14 -0700
  • b18f80164b [CUDA] Optimize set_mm_device_pointers for small ndim Cheng 2025-08-06 19:23:58 -0700
  • f2adb5638d
    Fix typo in metal command encoder (#2471) Angelos Katharopoulos 2025-08-06 16:58:23 -0700
  • b684d76e8e Fix typo in metal command encoder Angelos Katharopoulos 2025-08-06 16:30:02 -0700
  • f3787b84ba version Awni Hannun 2025-08-06 15:49:03 -0700
  • 728d4db582
    Support destination arg in tree flatten/unflatten (#2450) Luca Vivona 2025-08-06 18:34:59 -0400
  • 99d8de8445 Fix cudnn routing Jagrit Digani 2025-08-06 15:05:58 -0700
  • c66b76a8c8 Update routing Jagrit Digani 2025-08-06 15:01:15 -0700
  • f81edd184f Complete 2 pass sdpav Jagrit Digani 2025-08-06 13:57:40 -0700
  • 7f8ba2a003 [WIP] 2 pass sdpav Jagrit Digani 2025-08-06 09:54:41 -0700
  • c28249b81a Add more nvtx range for debug Jagrit Digani 2025-08-01 12:49:24 -0700
  • e74bcdc5e3 Add sdpa file Jagrit Digani 2025-07-25 12:30:50 -0700
  • d8ed6c1aa3 Add base cudnn attention support Jagrit Digani 2025-07-25 12:30:22 -0700
  • db5c7efcf6
    revert default cuda install (#2465) Awni Hannun 2025-08-06 06:19:12 -0700
  • 7bb96e4249
    fix cublas on h100 (#2466) Awni Hannun 2025-08-06 06:18:58 -0700
  • 193dcb8553 fix cublas on h100 Awni Hannun 2025-08-05 20:05:13 -0700
  • d491cc9d8a revert default cuda install Awni Hannun 2025-08-05 15:53:49 -0700
  • 985b50619b revert default cuda install Awni Hannun 2025-08-05 15:52:36 -0700
  • fa89f0b150
    faster gather qmm sorted test (#2463) Awni Hannun 2025-08-05 06:27:40 -0700
  • 8ff54a9595 Simplify the utils a bit Angelos Katharopoulos 2025-08-04 23:01:36 -0700
  • ca973d1e83
    fix install tags (#2464) Awni Hannun 2025-08-04 20:01:23 -0700
  • 87cefe0905 fix install tags Awni Hannun 2025-08-04 19:56:51 -0700
  • 696c66d3d3
    Merge bc6f00c00e into 828c5f1137 Anastasiia Filippova 2025-08-04 20:07:46 -0600
  • 828c5f1137
    Use SmallVector for shapes and strides (#2454) Cheng 2025-08-05 09:41:03 +0900
  • 296860d2fa faster gather qmm sorted test Awni Hannun 2025-08-04 17:05:36 -0700
  • bc6f00c00e Changed nccl reduction to be a parrt of cuda grapph Anastasiia Filippova 2025-08-05 02:00:52 +0200
  • 58eed7e0b5 Merge branch 'main' into nccl_backend Anastasiia Filippova 2025-08-05 01:51:14 +0200
  • 7d86a5c108
    Feat: add USE_SYSTEM_FMT CMake option (#2219) Gaétan Lepage 2025-08-05 01:36:11 +0200
  • 0b807893a7
    fix wraps compile (#2461) Awni Hannun 2025-08-04 16:14:18 -0700
  • 6ad0889c8a
    default install cuda on linux (#2462) Awni Hannun 2025-08-04 15:33:05 -0700
  • 434ed933c5 default install cuda on linux Awni Hannun 2025-08-04 14:52:22 -0700
  • 737dd6d1ac
    Add missing <algorithm> header to jit_compiler.cpp (#2460) Zamderax 2025-08-04 14:00:46 -0700
  • d6d994a385 fix wraps compile Awni Hannun 2025-08-04 13:47:29 -0700
  • 372bffcf1b Add missing <algorithm> header to jit_compiler.cpp Zamderax 2025-08-04 11:35:54 -0700
  • 8d1181d2b9 Convert SmallVector to tuple Cheng 2025-08-04 19:29:25 +0900
  • df9ac5f2f9 Add top-level namespace access for gradient control functions Yannick Muller 2025-08-03 00:46:27 -0400
  • 323a8e958d Format code with pre-commit hooks Yannick Müller 2025-08-02 23:33:00 -0400
  • 1e4bd653db Implement no_grad functionality following PyTorch API Yannick Müller 2025-08-02 23:27:37 -0400
  • 68e9c60d22 Use SmallVector for shapes and strides Cheng 2025-08-01 21:38:10 +0900
  • aaf78f4c6b
    Use LRU cache for cuda graph (#2448) Cheng 2025-08-02 21:28:57 +0900