* fix rotating kv cache for chat use case
* reorg + fixes to caching, unify prompt caching across types and use cases for e.g. caching during a chat
* nit in chat
* fix tests
* fix tests
* fix tests
* docs
* chat command
* comments + docs
* Define meta_state on all Cache implementations
* fixes + trim_prompt_cache api
* fix default model
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
* feat: QDoRA with tests and a small bug fix for recalculation of self.m
* some simplifications and fixes
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Add logits_processor option for the generation as in huggingface transformers library
* concatenation correction
* Rename the tokens variable for clarity
* remove the logit_bias argument from generate_step method
* fix the variable name
* nits + test
* test
* add back logit bias + test
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Add 'models' endpoint to server
* Add test for new 'models' server endpoint
* Check hf_cache for mlx models
* update tests to check hf_cache for models
* simplify test
* doc
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* initial commit
* initial commit
* Adding first lines
* adding x, and dt projection layers
* adding the clamping mechanism
* First succesful inference
* last commit for today - added custom geenrate function and it works as expected, will try training and then with loading a model from the hub
* clean up
* save up
* almost
* update
* update
* fixed cache handeling
* fixed loading
* added seperate generat_step method in the model and also in the utils to automaticaly use the generate step mthod in the model class
* quick update
* still not working
* save
* still not working
* initial commit
* utils.py logits = logits[:, -1, :] TypeError: tuple indices must be integers or slices, not tuple
* update
* update
* Fixing the Batching Depfwise Comnvolution and multi token input
* fixing generate and logits outputs
* Done!
* Fixing the cache handling, generating works now trying training
* update ACKNOWLEDGEMENTS
* removing the model_type if stuff in the _step loop in generate_step and adding MambaCache in base.py for training easier generations and removing mamba in tuner/utils.
* quick clean up
* update trainer/utils for right initialisation of the layers for LoRA, but not working.
* clean up
* Forther update to trainer/utils for correct layer selection. Successfull training
* removing extra mamba-infer.py file
* clean up, reformating will come later
* reformat and big clean up, final commit
* some speedups and cleanups
* fix test
* nits
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* feature: LoRA adapter for Embeddings
* feature: wire in LoRAEmbedding into the tuner. Allow the embedding and non model.layers Linear layers to be targeted for fine tuning
* feature: DoRA adapter for Embeddings
* feature: wire in DoRAEmbedding
* bugfix: ensure self.m is recalculated when the linear layer is changed in DoRALinear.from_linear
* refactor: prefer from_base over from_linear or from_embedding. prefer fuse over to_linear or to_embedding
* cleanup: remove unused imports in test_dora.py
* refactor: remove unnecessary non_layer_modules
* cleanup: remove wrong comments for lora embedding dropout. remove uncessary parens in dora embedding dropout
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Predict stop sequence matches during streaming
Check for overlap of stop sequences and the tokens array for potential sequence matches after more tokens get generated. Generate tokens until we can confirm that the stop sequence is not met.
* fix typo
* Change sequence_overlap logic
* range isn't inclusive, add 1 to max_overlap
* Add test_server.py
Added a test for the sequence_overlap method
* nits
* eos sequence
* finalize
---------
Co-authored-by: Y4hL <43219534+Y4hL@users.noreply.github.com>
Co-authored-by: Awni Hannun <awni@apple.com>
* Added functionality to load in adapters through post-requests so you do not need to restart the server
* ran pre-commit
* nits
* fix test
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* add dynamicNTK scaling rope
* remove unused var
* fix rope base
* llama3.1 fixes
* TODO for rope eval
* vectorise llama3 base freq calculation
* removed the arbitrary 2.0 rope_scale default case
* fix slow llama3.1 generation by evaluating stateless part of DynamicNTKScalingRoPE in init
* nits + format
* use mx.pi
* fix tests and add test for 3.1
---------
Co-authored-by: Prince Canuma <prince.gdt@gmail.com>
Co-authored-by: Awni Hannun <awni@apple.com>
* Add hf_dataset configuration for using HF hub-hosted datasets for (Q)LoRA training
* Pre-commit formatting
* Fix YAML config example
* Print DS info
* Include name
* Add hf_dataset parameter default
* Remove TextHFDataset and CompletionsHFDataset and use Dataset and CompletionsDataset instead, adding a text_key constructor argument to the former (and changing it to work with a provided data structure instead of just from a JSON file), and prompt_key and completion_key arguments to the latter with defaults for backwards compatibility.
* nits
* update docs
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* GPT-2 model support
* Add test for gpt2 model
* Fix weight sanitizing for quantization
* use approx gelu
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* LoRA: Extract pre_processing_model function
* LoRA: Extract small functions(train_model,evaluate_model)
* move test case to test_tuner_utils.py
* nits
* nits
* remove extra param, validate at it 0
* version
* fix test
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Initial config handler and test
* Added means to run from CLI
* Update lora config loading and tests
* Constrain scheduler config (warmup and minimum LR) for each kind
* Update reference to moved schedule_config module
* Minor fix
* Fix typos
* Moved build_schedule and tests
* nits in schedule config
* flake
* fix path
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* use nn.RMSNorm, use sdpa, cleanup
* bump mlx versions
* minor update
* use fast layer norm
* version bump
* update requirement for whisper
* update requirement for gguf
* chore(mlx-lm): clean up the top p imp
* chore: clean up
* chore: add test
* chore: address comments
* chore: clean up docs string
* chore: clean up test
* chore(mlx-lm): fix print_trainable_parameters for quant models
* chore: clean up
* refactor: use layer type to check quant bits
* chore: address comment
* Add --lora-all-linear option to apply LoRa to all linear transfer block layers
* Moved to YAML config and added specification of rank & alpha
* nits in conifg, more tests
* nit
* run tests for prs
---------
Co-authored-by: Awni Hannun <awni@apple.com>