Commit Graph

  • ece087892d
    Merge c093fa72c8 into 3dcb286baf Awni Hannun 2025-08-26 08:03:38 -0700
  • c093fa72c8 increase cache size Awni Hannun 2025-08-26 07:49:09 -0700
  • b6b8347b00
    Merge cb4dc59a9e into 3dcb286baf Arkar Min Aung 2025-08-26 11:11:19 +0200
  • 92d2cbc2fa
    Merge c7cdd51f50 into 3dcb286baf TianyiZhao1437 2025-08-26 11:10:56 +0200
  • f20b56497b
    Merge 7c99acb799 into 3dcb286baf Awni Hannun 2025-08-26 08:57:10 +0000
  • 5a33628db9
    Merge bfb34cee6b into 3dcb286baf Cheng 2025-08-26 16:29:53 +0900
  • 218dc1c5cf
    Merge eeaf1fa463 into 3dcb286baf Param Thakkar 2025-08-26 08:52:08 +0530
  • bfb34cee6b Run CPP tests for CUDA build in CI cuda-cpp-tests Cheng 2025-08-24 01:34:26 -0700
  • 4987e7615a Improve the cutlass gemm simple-gemm Angelos Katharopoulos 2025-08-25 18:18:19 -0700
  • f83af0c5d5
    Merge a9716cd34c into 3dcb286baf Awni Hannun 2025-08-25 16:05:58 -0700
  • 2e37f68a99
    Merge c2511fd83a into 3dcb286baf Awni Hannun 2025-08-25 22:57:29 +0000
  • c2511fd83a add load with path tests test_load_with_path Awni Hannun 2025-08-25 15:18:09 -0700
  • 3dcb286baf
    Remove stream from average grads so it uses default (#2532) main Awni Hannun 2025-08-25 15:56:29 -0700
  • 4822c3dbe9
    [CUDA] Implement DynamicSlice/DynamicSliceUpdate (#2533) Cheng 2025-08-26 07:31:39 +0900
  • 2ca75bb529
    Remove nccl install in release (#2542) Awni Hannun 2025-08-25 15:20:18 -0700
  • a9e136a6c1 Remove nccl install in release Awni Hannun 2025-08-25 15:15:50 -0700
  • 3bb7b671a9 comment Awni Hannun 2025-08-25 15:03:12 -0700
  • db14e29a0b
    allow pathlib.Path to save/load functions (#2541) Awni Hannun 2025-08-25 14:58:49 -0700
  • d2f540f4e0
    Use nccl header only when nccl is not present (#2539) Awni Hannun 2025-08-25 14:17:25 -0700
  • 066d77244c allow pathlib.Path to save/load functions Awni Hannun 2025-08-25 13:52:12 -0700
  • 4ba4544549 enable cuda graph toggle Awni Hannun 2025-08-25 12:56:56 -0700
  • 87faf7c5e5 larger machine for cuda build Awni Hannun 2025-08-25 09:52:09 -0700
  • 6e2c392815 use nccl header only when nccl is not present Awni Hannun 2025-08-25 09:14:43 -0700
  • e527c6040a
    Merge cc4de6a607 into 333ffea273 Nripesh Niketan 2025-08-25 15:50:54 +0200
  • 333ffea273
    [CUDA] Remove thrust in arange (#2535) Cheng 2025-08-24 16:22:36 +0900
  • f55b6f1f2f
    Enable COMPILE_WARNING_AS_ERROR for linux builds in CI (#2534) Cheng 2025-08-24 15:33:08 +0900
  • 7d1f157f8d [CUDA] Remove thrust in arange Cheng 2025-08-23 21:16:58 -0700
  • 4f5f4d85bd Enable COMPILE_WARNING_AS_ERROR for linux builds in CI Cheng 2025-08-23 19:01:53 -0700
  • 57b2b8817a Implement compute_dynamic_offset in CUDA Cheng 2025-08-23 18:36:58 -0700
  • 5746c0c658 Move DynamicSlice to gpu/primitives Cheng 2025-08-24 09:13:45 +0900
  • d08fa4bef8 Remove stream from average grads so it uses default Awni Hannun 2025-08-23 06:09:42 -0700
  • b04d6c224c [CUDA] Use ConcurrentContext in concatenate_gpu Cheng 2025-08-22 18:51:59 -0700
  • 30561229c7
    Fix allocation bug in NCCL (#2530) Awni Hannun 2025-08-22 14:39:43 -0700
  • ef123f46e9 Fix allocation bug in NCCL Awni Hannun 2025-08-22 14:11:52 -0700
  • a9716cd34c refactor quantize_mode Awni Hannun 2025-08-22 13:15:32 -0700
  • 27e31ab249 fix Awni Hannun 2025-08-21 15:54:30 -0700
  • 068a4612e9
    nccl default for backend=any (#2528) Awni Hannun 2025-08-22 12:24:27 -0700
  • 2afdf380b1 comment Awni Hannun 2025-08-22 09:42:46 -0700
  • 51505c2d5a check num gpus + ensure row contiguous for all reduce Awni Hannun 2025-08-22 09:39:36 -0700
  • 1eb589cd77 nccl default for backend=any Awni Hannun 2025-08-22 07:06:54 -0700
  • 5722c147de
    [CUDA] Update calls to cudaMemAdvise and cudaGraphAddDependencies for CUDA 13 (#2525) Andrey Portnoy 2025-08-21 22:57:20 -0400
  • f6819a1f26
    Fix warning 186-D from nvcc (#2527) Cheng 2025-08-22 10:29:55 +0900
  • f93f87c802
    nccl dep + default for cuda (#2526) Awni Hannun 2025-08-21 17:57:49 -0700
  • f3d58b3aaf Fix warning 186-D from nvcc Cheng 2025-08-21 17:55:07 -0700
  • af98676871 nccl dep + default for cuda Awni Hannun 2025-08-21 17:06:29 -0700
  • 80868ee4fb fix test tol Awni Hannun 2025-08-21 14:45:35 -0700
  • 861a5ab060
    Merge 688e421184 into 9392fc3f88 Awni Hannun 2025-08-21 17:28:37 -0400
  • 3ce23755ea fix Awni Hannun 2025-08-21 08:58:36 -0700
  • 4e2d25c4e8 Mention NVIDIA in ACKNOWLEDGMENTS.md Andrey Portnoy 2025-08-21 15:29:49 -0400
  • 9c3259cb5c [CUDA] Update cudaMemAdvise and cudaGraphAddDependencies for CUDA 13 Andrey Portnoy 2025-08-21 15:09:24 -0400
  • 9392fc3f88
    NCCL backend (#2476) Anastasiia Filippova 2025-08-21 20:56:15 +0200
  • 8da1c64fe9 cpu mxfp4 Awni Hannun 2025-08-20 17:18:47 -0700
  • 51449428dd speedup Awni Hannun 2025-08-20 14:05:35 -0700
  • 6295e53216 mxfp4 works Awni Hannun 2025-08-19 07:49:56 -0700
  • 4cf90c9762 mxfp4 quantize/dequantize + start of optional biases Awni Hannun 2025-08-18 12:59:03 -0700
  • 8ec8d44ee6 add mode parameter for quantization Awni Hannun 2025-08-15 17:36:55 -0700
  • e843c4d8d5
    fix power (#2523) Awni Hannun 2025-08-21 06:46:01 -0700
  • e1303f6160 Reset cutlass gemm to working state again Angelos Katharopoulos 2025-08-21 01:29:43 -0700
  • cf5eef095d tmp Angelos Katharopoulos 2025-08-14 12:29:53 -0700
  • 395d582719 Add a cutlass gemm Angelos Katharopoulos 2025-08-09 22:47:14 -0700
  • 05583bcd10 More pipelining for the sm_80 gemm Angelos Katharopoulos 2025-08-09 22:46:31 -0700
  • 6fce01593a Improve gemm Angelos Katharopoulos 2025-08-07 16:13:18 -0700
  • 97afe40b7b Remove duplicate register tile Angelos Katharopoulos 2025-08-07 00:55:08 -0700
  • f70c62d69c Simple gemm example Angelos Katharopoulos 2025-07-29 18:23:40 -0700
  • a51dc30cd3 Add the NCCL library in CI Angelos Katharopoulos 2025-08-20 22:57:45 -0700
  • 034e078c2c Remove iostream Angelos Katharopoulos 2025-08-20 18:28:04 -0700
  • eeb5a0d63f Put the decision of the comm stream to the group Angelos Katharopoulos 2025-08-20 18:21:07 -0700
  • 0c5fc63a36
    Fix docs omission (#2524) Angelos Katharopoulos 2025-08-20 17:56:06 -0700
  • cd234474c4 Fix docs omission Angelos Katharopoulos 2025-08-20 17:41:10 -0700
  • e397177f6e
    Custom cuda kernel (#2517) Angelos Katharopoulos 2025-08-20 17:20:22 -0700
  • 6f608857db Address more comments Angelos Katharopoulos 2025-08-20 17:19:36 -0700
  • f5bbcf14cf fix power Awni Hannun 2025-08-20 17:08:32 -0700
  • f4c8888cbe
    [CUDA] Fix stride of singleton dims before passing to cuDNN (#2521) Cheng 2025-08-21 08:55:26 +0900
  • 3bb6b1d44a added get_device to do reductions on the cpu if metal Anastasiia Filippova 2025-08-20 18:00:16 +0200
  • 25c1e03205
    Fix overflow in large filter small channels (#2520) Angelos Katharopoulos 2025-08-20 08:03:29 -0700
  • 4ee0d0bb55 removed nproc-per-node Anastasiia Filippova 2025-08-20 15:49:32 +0200
  • cd53eb1ae3 dispatch types with dtype_utils Anastasiia Filippova 2025-08-20 15:09:41 +0200
  • f7c11b965e Merge branch 'main' into nccl_backend Anastasiia Filippova 2025-08-20 13:37:18 +0200
  • 8cc93eea23 [CUDA] Fix stride of singleton dims before passing to cuDNN Cheng 2025-08-19 21:20:41 -0700
  • 512281781c
    Remove state return from function example in compile documentation (#2518) russellizadi 2025-08-20 03:45:05 -0400
  • d6b204b528 comments Angelos Katharopoulos 2025-08-20 00:28:28 -0700
  • fa56bf2feb Remove completion handler from custom kernel Angelos Katharopoulos 2025-08-19 14:18:21 -0700
  • 39dbd92df5 Make threadgroup size less or equal to grid size Angelos Katharopoulos 2025-08-19 01:13:20 -0700
  • 432c02dabc Typo in test Angelos Katharopoulos 2025-08-19 00:02:27 -0700
  • fa555c536a Remove regex Angelos Katharopoulos 2025-08-18 23:52:59 -0700
  • 169476deb8 Remove iostream Angelos Katharopoulos 2025-08-18 23:50:38 -0700
  • bffadc2cb9 Add all tests except the custom caching Angelos Katharopoulos 2025-08-18 23:45:13 -0700
  • 14efd9c35a Fix compilation Angelos Katharopoulos 2025-08-18 22:46:21 -0700
  • d2ae81b413 A bit of refactoring Angelos Katharopoulos 2025-08-18 19:18:41 -0700
  • 3938aaaf24 tmp Angelos Katharopoulos 2025-08-18 01:21:53 -0700
  • 055c1ca929 tmp Angelos Katharopoulos 2025-08-15 14:08:03 -0700
  • 3b94e37270 Working custom kernels jointly Angelos Katharopoulos 2025-08-12 14:30:29 -0700
  • 0b309e8edc Add custom kernel for CUDA Angelos Katharopoulos 2025-08-10 01:55:06 -0700
  • 9b226a929e Add the test Angelos Katharopoulos 2025-08-19 23:52:27 -0700
  • 2581a9ab85 Fix the input loading for small channels large filters Angelos Katharopoulos 2025-08-19 23:41:48 -0700
  • ac85ddfdb7
    [CUDA] Add GEMM-based fallback convolution kernels (#2511) Cheng 2025-08-20 10:06:22 +0900
  • 849fee90f3 Add gemm_grouped_conv Cheng 2025-08-17 17:23:23 -0700
  • c81aeedec5 Add gemm_conv Cheng 2025-08-16 03:34:58 -0700
  • 65d0d40232
    Split cuDNN helpers into a separate header (#2491) Cheng 2025-08-20 09:29:28 +0900
  • 8ecdbcadb8 Remove state return from function example in compile documentation russellizadi 2025-08-19 18:08:06 -0400