mlx-examples/llms
mah-chey' | /ˈmɑː.tʃeɪ/ | /ˈmat͡ɕɛj/ 5adbd358b5 Add DeciLM/Nemotron-NAS architecture support for MLX
This commit introduces native MLX support for DeciLM models, including NVIDIA's
Nemotron series that use Neural Architecture Search (NAS) optimizations.

Key features:
- Support for dummy layers (no-op attention/FFN components)
- FFN fusion for improved efficiency
- Variable Grouped Query Attention (VGQA) with different KV heads per layer
- Block configuration handling for NAS architectures
- Full conversion pipeline from HuggingFace to MLX format
- Comprehensive test suite

Tested with:
- nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 (Q5: 3.86 tokens/sec on M3 Ultra)
- Memory usage: ~175GB peak for 253B model

This enables running massive NAS-optimized models on Apple Silicon that were
previously incompatible with MLX due to their unique architecture.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-02 05:59:09 +02:00
..
decilm Add DeciLM/Nemotron-NAS architecture support for MLX 2025-07-02 05:59:09 +02:00
gguf_llm Made llama and mistral files mypy compatible (#1359) 2025-04-23 14:23:46 -07:00
llama Made llama and mistral files mypy compatible (#1359) 2025-04-23 14:23:46 -07:00
mistral Quantize embedding / Update quantize API (#680) 2024-04-18 18:16:10 -07:00
mixtral Made llama and mistral files mypy compatible (#1359) 2025-04-23 14:23:46 -07:00
speculative_decoding Made llama and mistral files mypy compatible (#1359) 2025-04-23 14:23:46 -07:00
README.md remove mlx lm (#1353) 2025-03-18 18:47:55 -07:00

MOVE NOTICE

The mlx-lm package has moved to a new repo.

The package has been removed from the MLX Examples repo. Send new contributions and issues to the MLX LM repo.