This commit introduces native MLX support for DeciLM models, including NVIDIA's
Nemotron series that use Neural Architecture Search (NAS) optimizations.
Key features:
- Support for dummy layers (no-op attention/FFN components)
- FFN fusion for improved efficiency
- Variable Grouped Query Attention (VGQA) with different KV heads per layer
- Block configuration handling for NAS architectures
- Full conversion pipeline from HuggingFace to MLX format
- Comprehensive test suite
Tested with:
- nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 (Q5: 3.86 tokens/sec on M3 Ultra)
- Memory usage: ~175GB peak for 253B model
This enables running massive NAS-optimized models on Apple Silicon that were
previously incompatible with MLX due to their unique architecture.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>