Commit Graph

  • 8bd4bf2393 Fixes for transpositions and expands cuda-reduce Angelos Katharopoulos 2025-06-23 05:49:49 -0700
  • fd1d0821d2 Make sure softmax doesn't change the actual max Angelos Katharopoulos 2025-06-22 23:34:32 -0700
  • 818e8e663e Add an init reduce Angelos Katharopoulos 2025-06-22 21:28:41 -0700
  • 9d86a4d5ba
    Merge b3c1aaafd2 into 5adf185f86 John Mai 2025-06-22 14:39:55 +0800
  • cc4b995723 Working col reduce Angelos Katharopoulos 2025-06-21 23:39:40 -0700
  • 664d8e42b8 Add comments and clean up Angelos Katharopoulos 2025-06-21 12:44:26 -0700
  • abdb21f27c Add helpers and atomic kernel Angelos Katharopoulos 2025-06-21 12:37:35 -0700
  • 323cc645ab
    Merge 992eac905a into 5adf185f86 acsweet 2025-06-21 11:11:23 +0200
  • 880751a084 Remove segmented reduce and fix row reduce Angelos Katharopoulos 2025-06-19 02:53:41 -0700
  • cd523ffd9f Working row reduce looped Angelos Katharopoulos 2025-06-19 02:42:15 -0700
  • 4d2b682a13 Simple row reduce Angelos Katharopoulos 2025-06-18 23:17:16 -0700
  • b70a964cde Optimize all reduce a bit Angelos Katharopoulos 2025-06-18 14:33:27 -0700
  • 9cf7ef1068 Add all reduce and atomic updates Angelos Katharopoulos 2025-06-17 23:58:51 -0700
  • ab7c310914 Adapt the torch benchmark to run in CUDA Angelos Katharopoulos 2025-06-17 15:56:33 -0700
  • 382270d8b5
    Merge 4d68bd3250 into 5adf185f86 Eric Buehler 2025-06-21 00:37:59 -0400
  • c4b30485f2
    Merge cc4de6a607 into 5adf185f86 Nripesh Niketan 2025-06-21 02:23:45 +0100
  • 0d371aaffc
    Merge 043c37cccd into 5adf185f86 Anastasiia Filippova 2025-06-20 19:19:51 -0600
  • 5feed6cb77
    Merge cb4dc59a9e into 5adf185f86 Arkar Min Aung 2025-06-21 10:45:06 +1000
  • 5adf185f86
    Fix update_modules() when providing a subset (#2308) main Angelos Katharopoulos 2025-06-20 17:19:46 -0700
  • 8ec7713893 Fix style Angelos Katharopoulos 2025-06-20 16:45:00 -0700
  • c9a9180584
    Cuda perf tuning (#2307) Awni Hannun 2025-06-20 14:50:57 -0700
  • 90072445e7 Fix the update_modules subset Angelos Katharopoulos 2025-06-20 14:43:11 -0700
  • de190bfe82 fix Awni Hannun 2025-06-20 13:24:24 -0700
  • 6bb0b254fd format Awni Hannun 2025-06-20 13:01:27 -0700
  • 1a0e884036 fix adding inputs arrays in matmul / srot Awni Hannun 2025-06-20 12:56:40 -0700
  • 72e21b7d51 perf tuning Awni Hannun 2025-06-18 16:42:39 -0700
  • 043c37cccd Use last cuda stream instead of new one Anastasiia Filippova 2025-06-20 16:07:41 +0200
  • 755fb4f970
    Merge 7c99acb799 into 76831ed83d Awni Hannun 2025-06-20 09:17:34 +0800
  • 76831ed83d
    Build CUDA release in Circle (#2306) Awni Hannun 2025-06-19 15:26:36 -0700
  • 4749c57bdb add license Awni Hannun 2025-06-19 06:22:42 -0700
  • 12322095a8 cuda release Awni Hannun 2025-06-18 08:18:11 -0700
  • 64af1f8920
    Merge d2e0b0465c into b3d7b85376 Gaétan Lepage 2025-06-19 01:06:09 +0100
  • cc4de6a607 Increment 2: Implement major ops and add structure similar to cuda Nripesh Niketan 2025-06-19 00:50:06 +0100
  • ac5adfa963 increment 1: few ops and jit update Nripesh Niketan 2025-06-19 00:33:57 +0100
  • 709e3aa875
    Merge 9a5d162ebf into b3d7b85376 DavitGrigoryan132 2025-06-18 13:36:09 +0100
  • b3d7b85376
    Make ptx cache settable by environment variable (#2304) Angelos Katharopoulos 2025-06-17 23:55:56 -0700
  • 445478c98b
    Merge eeaf1fa463 into cad5c0241c Param Thakkar 2025-06-18 10:49:45 +0800
  • fa0615d39b Make ptx cache settable by environment variable Angelos Katharopoulos 2025-06-17 15:45:03 -0700
  • cad5c0241c
    [CUDA] synch properly waits for all tasks to finish and clear (#2303) Awni Hannun 2025-06-17 12:03:25 -0700
  • 873cfa292e fix copy Awni Hannun 2025-06-17 10:51:09 -0700
  • 3d94859ea2 cuda synch properly waits for all tasks to finish and clear Awni Hannun 2025-06-17 07:20:05 -0700
  • e6ae350999 Deleted comments, renamed the function Anastasiia Filippova 2025-06-17 08:55:02 +0200
  • b8022c578a
    divmod, partition, sort fixes (#2302) Awni Hannun 2025-06-16 18:49:32 -0700
  • 870208eff5 Start sdpa vector cuda-sdpa-vector Angelos Katharopoulos 2025-06-15 21:58:34 -0700
  • 3e276d6890 divmod, partition, sort fixes Awni Hannun 2025-06-16 17:12:12 -0700
  • 8bb8b76ae4 [Experiment] ROCM backend initial push Nripesh Niketan 2025-06-16 22:42:56 +0100
  • 1fba0176e1
    Merge 688e421184 into bc53f8293f Awni Hannun 2025-06-16 14:14:03 -0700
  • 74d6ebd4bd update ackn. Goekdeniz-Guelmez 2025-06-16 22:49:53 +0200
  • a315af8981 format Goekdeniz-Guelmez 2025-06-16 22:45:09 +0200
  • 3713832e5e adding test for silu and clipped silu Goekdeniz-Guelmez 2025-06-16 22:44:13 +0200
  • 9cb6df5960 adding to __init__.py Goekdeniz-Guelmez 2025-06-16 22:36:35 +0200
  • a426880baf format Goekdeniz-Guelmez 2025-06-16 22:36:04 +0200
  • 60cd4a5a6f initial commit Goekdeniz-Guelmez 2025-06-16 22:33:24 +0200
  • bc53f8293f
    Cuda bug fixes 2 (#2298) Awni Hannun 2025-06-16 13:14:46 -0700
  • abcf62ee55 format Awni Hannun 2025-06-16 12:35:26 -0700
  • ff1f9ca5e8 more bug fixes Awni Hannun 2025-06-16 12:28:50 -0700
  • 70f2baf39f Removed commented nogpu for all_reduce Anastasiia Filippova 2025-06-16 19:11:28 +0200
  • 71a47bc10d Deleted useless import Anastasiia Filippova 2025-06-16 19:08:38 +0200
  • 7429613f76 more bug fixes Awni Hannun 2025-06-16 09:35:58 -0700
  • e9fbdd20fb Helper function to parse types Anastasiia Filippova 2025-06-16 18:35:49 +0200
  • c552ff2451
    [CUDA] Fix back-end bugs and enable corresponding tests (#2296) Awni Hannun 2025-06-16 08:45:40 -0700
  • 91817a165b format Awni Hannun 2025-06-16 07:46:40 -0700
  • 14531cb14f enable more tests Awni Hannun 2025-06-16 07:45:01 -0700
  • f15a127900 nccl backend (all reduce + init) Anastasiia Filippova 2025-06-16 14:28:53 +0200
  • 85869fda0c more fixes Awni Hannun 2025-06-15 20:44:32 -0700
  • b13c7ef8f8 Fix some cuda back-end bugs and enable corresponding tests Awni Hannun 2025-06-15 13:09:06 -0700
  • 4fda5fbdf9
    add python testing for cuda with ability to skip list of tests (#2295) Awni Hannun 2025-06-15 10:56:48 -0700
  • 5971bf3506 add python testing for cuda with ability to skip list of tests Awni Hannun 2025-06-15 08:28:51 -0700
  • 580776559b
    RoPE for CUDA (#2293) Angelos Katharopoulos 2025-06-15 06:08:07 -0700
  • b3c1aaafd2 update: format code John Mai 2025-06-15 17:35:33 +0800
  • 989e8bab66 feat: Add benchmarking for ReLUSquared activation function John Mai 2025-06-15 17:34:10 +0800
  • fe0672a9d2 docs: Update documentation to include ReLUSquared activation function John Mai 2025-06-15 17:33:58 +0800
  • cbd353bf73 test: Add unit test for ReLUSquared activation function John Mai 2025-06-15 17:07:33 +0800
  • 940f64fe6a feat: Add ReLUSquared activation function John Mai 2025-06-15 17:07:22 +0800
  • cb4dc59a9e feat(benchmarks): add comprehensive SVD performance benchmarks Arkar Min Aung 2025-06-15 17:51:45 +1000
  • e5c8773371 feat(metal): implement complete Metal SVD with Jacobi algorithm Arkar Min Aung 2025-06-15 17:44:38 +1000
  • 229e3a29a6 Fix random Angelos Katharopoulos 2025-06-14 23:53:03 -0700
  • bfe105990b First working CUDA rope Angelos Katharopoulos 2025-06-14 15:10:40 -0700
  • a14aaa7c9d
    Fix cuda arg reduce (#2291) Awni Hannun 2025-06-14 17:54:00 -0700
  • 3110982b0e fp16 matmul fix + tf32 env var Awni Hannun 2025-06-14 07:17:04 -0700
  • c353af5998 fix cuda arg reduce Awni Hannun 2025-06-14 06:16:09 -0700
  • a6d780154f
    fix cuda gemm for bf16 (#2288) Awni Hannun 2025-06-13 22:10:46 -0700
  • ffef01cf68 fix cuda gemm for bf16 Awni Hannun 2025-06-13 20:04:44 -0700
  • 6871e2eeb7
    fix cuda jit (#2287) Awni Hannun 2025-06-13 19:21:46 -0700
  • f2d0ea0607 fix cuda jit Awni Hannun 2025-06-13 15:01:16 -0700
  • 8402a2acf4
    Fix complex power and print (#2286) Awni Hannun 2025-06-13 11:13:00 -0700
  • fddb6933e1
    Collection of refactors (#2274) Jagrit Digani 2025-06-13 10:44:56 -0700
  • 628e36f7d9 fix complex matmul shape Awni Hannun 2025-06-13 07:42:12 -0700
  • ea451af9a0 Update no copy condition in normalization to account for axis size 1 Jagrit Digani 2025-06-11 09:58:15 -0700
  • 53fa981caf Add architecture gen to device Jagrit Digani 2025-06-11 09:56:01 -0700
  • b1d95a3880 Some cleanup Jagrit Digani 2025-06-11 09:43:34 -0700
  • 4b02d3e738 Comments and format Jagrit Digani 2025-06-11 09:35:52 -0700
  • dd5e833068 Update addmm Jagrit Digani 2025-06-11 09:30:49 -0700
  • b3013042ca Redirect steel_gemm Jagrit Digani 2025-06-11 09:26:07 -0700
  • fc2f6bc51c Refactor AddMM step 1 Jagrit Digani 2025-06-11 09:01:45 -0700
  • 9dbaa35be3 Add axpby routing to steel_matmul_regular Jagrit Digani 2025-06-11 08:54:42 -0700
  • 13eccfa887 Redirect steel_gemm_regular Jagrit Digani 2025-06-11 08:49:07 -0700
  • 96a7017442 Rearrange steel_gemm_regular Jagrit Digani 2025-06-11 08:38:52 -0700
  • c2f1c2a338 Refactor split k axpby Jagrit Digani 2025-06-11 07:47:56 -0700
  • 9fd8eb357c Refactor splitk step 1 Jagrit Digani 2025-06-11 07:29:36 -0700