* Add option to load customized mlx model
* Add quantization
* Apply reviews
* Separate model conversion and loading
* Update test
* Fix benchmark
* Add notes about conversion
* Improve doc
* feat: add example for deepseek coder
* chore: remove hardcoded rope_scaling_factor
* feat: add quantization support
* chore: update readme
* chore: clean up the rope scalling factor param in create cos sin theta
* feat: add repetition_penalty
* style /consistency changes to ease future integration
* nits in README
* one more typo
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Large-v3 requires 128 Mel frequency bins
* extract correct model dimensions and use argparse
* format
* format
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Add `.DS_Store` files to `.gitignore`
* Fix variable naming of `config` in `mixtral/convert.py`
* Align CLI args and minor fixes
* standardize
* one more
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Add qwen model draft
* Add readme and requirements for qwen example
* Add model and tokenizer options
* Fix convert and tokenizer
* some updates / style consistency
* move to llm subdir
* readme nit
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Add skeleton
* Load all encoder weights
* Pass config to all modules, fix ln
* Load position bias embeddings
* Load decoder weights
* Move position biases to attention module
* translate pytorch to mx
* Fix default prompt
* Fix relative_attention_max_distance config
* No scaling, no encoder mask
* LM head
* Decode (broken after 1st token)
* Use position bias in all layers
* Utils to compare encoder output
* Fix layer norm
* Fix decoder mask
* Use position bias in decoder
* Concatenate tokens
* Remove prints
* Stop on eos
* Measure tokens/s
* with cache
* bug fix with bidirectional only for encoder, add offset to position bias
* format
* Fix T5.__call__
* Stream output
* Add argument to generate float16 npz
* Load config from HF to support any model
* Uncomment bidirectional param
* Add gitignore
* Add readme.md for t5
* Fix relative position scale
* Fix --encode-only
* Run hf_t5 with any model
* Add hf generation for comparison
* Fix type for attention mask
* Increase hf max_length
* Rescale output before projecting on vocab
* readme updates
* nits
* Pass ln2 to cross attention
* Fix example
* Fix attention for 3b model
* fp16, abstract tokenizer a bit, format
* clamp for low precision
* higher clipping, remove non-helpful casts
* default to fp32 for now
* Adds support for flan-t5
* Update t5 docs on variant support
* readme flan
* nit
---------
Co-authored-by: Awni Hannun <awni@apple.com>