Commit Graph

485 Commits

Author SHA1 Message Date
tidely
df744c98e6 Predict stop sequence matches during streaming (#541)
* Predict stop sequence matches during streaming

Check for overlap of stop sequences and the tokens array for potential sequence matches after more tokens get generated. Generate tokens until we can confirm that the stop sequence is not met.

* fix typo

* Change sequence_overlap logic

* range isn't inclusive, add 1 to max_overlap

* Add test_server.py

Added a test for the sequence_overlap method

* nits

* eos sequence

* finalize

---------

Co-authored-by: Y4hL <43219534+Y4hL@users.noreply.github.com>
Co-authored-by: Awni Hannun <awni@apple.com>
2024-08-06 15:24:15 -07:00
Khush Gupta
8fa12b0058 Adapters loading (#902)
* Added functionality to load in adapters through post-requests so you do not need to restart the server

* ran pre-commit

* nits

* fix test

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-08-01 16:18:18 -07:00
madroid
85dc76f6e0 Server: support stream_options (#913)
* Server: support stream_options

see https://x.com/OpenAIDevs/status/1787573348496773423

* Server: support stream_options

* Server: check None type
2024-07-26 08:58:52 -07:00
otriscon
46da74fea2 Unify attention mask in LLMs (#911)
* Unify attention mask creation in LLMs.

Currently, each model implementation in `mlx-examples/llms/models` has ad-hoc
code to create a mask for the attention mechanism. This usually takes the form:

```
    mask = None
    if h.shape[1] > 1:
        mask = nn.MultiHeadAttention.create_additive_causal_mask(h.shape[1])
        mask = mask.astype(h.dtype)
```

This correctly creates a mask only if the input consists of more than one token.
But this code assumes the multi-token input is at the beginning of inference.
If, for example, we are evaluating multiple tokens because of speculative
decoding or prompt cache reuse, this mask will not have the correct shape and
and will cause the raising of an exception in the attention computation.

Some of the models correctly implement the mask creation with code like this:

```
    mask = None
    if h.shape[1] > 1:
        mask = create_additive_causal_mask(
            h.shape[1], cache[0].offset if cache is not None else 0
        )
        mask = mask.astype(h.dtype)
```

This commit unifies the attention mask creation for all models with a new
function `create_attention_mask`, reducing code duplication and helping all
models support inference performance enhancements like those mentioned above.

* Allow batches in LLM key-value cache

The current implementation of the LLM key-value cache assumes that
the input batch is of size 1. Input batching (evaluating multiple
alterative inputs at the same time) can be a valuable tool for
speculative sampling and other techniques.

This change removes the hard-coded batch size from the code that
resizes the key-value cache.

* Simplify causal mask creation

Use the same codepath regardless of whether there's an offset or
not. Addresses [this comment](https://github.com/ml-explore/mlx-examples/pull/911#discussion_r1691459717).

* Use old-style type annotation to avoid linter error
2024-07-25 16:45:22 -07:00
Anchen
7a3ab1620a support load model by custom get_model_classes (#899)
* feature(mlx_lm): support load model by custom get classes

* rename the param
2024-07-25 11:01:17 -07:00
Alex Cheema
cd8efc7fbc Add support for Llama-3.1 (#907)
* add dynamicNTK scaling rope

* remove unused var

* fix rope base

* llama3.1 fixes

* TODO for rope eval

* vectorise llama3 base freq calculation

* removed the arbitrary 2.0 rope_scale default case

* fix slow llama3.1 generation by evaluating stateless part of DynamicNTKScalingRoPE in init

* nits + format

* use mx.pi

* fix tests and add test for 3.1

---------

Co-authored-by: Prince Canuma <prince.gdt@gmail.com>
Co-authored-by: Awni Hannun <awni@apple.com>
2024-07-23 13:21:32 -07:00
M. Ali Bayram
47060a8130 refactor: add force_download parameter to get_model_path function (#800) 2024-07-23 13:10:20 -07:00
Prince Canuma
3f337e0f0a Add Mistral NeMo (fix) (#895)
* fix head_dim

* Update llms/mlx_lm/models/llama.py

* fix kv error

* formatting

* Delete test.py

---------

Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
2024-07-22 06:09:24 -07:00
Prince Canuma
3d365b612a Add support for InternLM-2.5 (#871)
* fix internlm-2

* formatting

* add dynamic ntk rope

* formatting

* move dynamic scaling rope to intermlm2.py

* add default max_position_embeddings
2024-07-17 16:38:22 -07:00
Anchen
561dcf5643 Add support for deepseek coder v2 lite (#882)
* feat: add support for deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct

* fix softmax + some cleanup

* more nits

* fix rope

* fix original_max_position_embeddings in rope

* fix original_max_position_embeddings in rope config

* add group greedy

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-07-17 07:23:28 -07:00
Awni Hannun
f0c6c6e226 keep the server in a valid state (#889) 2024-07-15 18:35:36 -07:00
JosefAlbers
bfc1f2763b longrope (#886) 2024-07-12 07:19:11 -07:00
Chime Ogbuji
8bf397e450 Pass use_dora parameter to linear_to_lora_layers (#885) 2024-07-11 14:34:34 -07:00
nicolov
fbe3247772 Add GPT-neox model (#863) 2024-07-11 06:13:17 -07:00
James A Capozzoli
9717307ff0 Validation with full data set, results in NaN validation score (#879)
* CLI arguments may set num_batches to -1

The CLI arguments allow you to validate with the entire dataset by passing a negative one value, but this quickly results in a division by zero `NaN` to appear as the validation score!

* Must properly assemble the mini batches when validating with entire dataset.

Tested locally, a validation of a novel took about an hour, with a loss of 0.928. Thanks @awni for the correction!

* Set up the pre-commit hooks and run them so that black may format lora.py.
2024-07-10 08:36:11 -07:00
Alex Wozniakowski
63800c8feb Example of response generation with optional arguments (#853)
* Generate response with optional arguments

* Reference response generation example

* Include transformers and sentencepiece

* Update example to run Mistral-7B-Instruct-v0.3

* Link to generation example

* Style changes from pre-commit
2024-07-09 06:49:59 -07:00
Awni Hannun
68e88d42fb Fix server for openai package (#877)
* fix

* fixes for 9b
2024-07-08 12:34:31 -07:00
Awni Hannun
20e221f7f7 Add recurrent gemma (#856)
* add recurrent gemma

* fix window cache
2024-07-07 12:10:04 -07:00
n8programs
1e05aef344 Add logit soft capping to gemma, and fix precision issues (#857)
* Add logit soft capping to gemma, and fix precision issues

Gemma was babbling nonsense - so I figured out it was due to not having logit softcapping and precision issues causing NaNs (so I implemented the softcapping and added more float32 inference). gemma-27b-it-4bit now works flawlessly (or near-flawlessly, no sliding-window attention).

* get rid of comments

* get rid of last comments (sry lol)

* nits

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-07-02 07:52:39 -07:00
Angelos Katharopoulos
f212b770d8 Server loads the model on demand from the request (#851) 2024-06-27 11:37:57 -07:00
Awni Hannun
538339b599 gemma2 (#855) 2024-06-27 10:06:28 -07:00
Awni Hannun
9f10728145 fix yi (#852) 2024-06-27 06:38:19 -07:00
Volodymyr Kyrylov
7979b84a9e transformer_lm: add --dataset enwik8 (#838)
* transformer_lm: add --dataset enwik8

* nits

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-26 11:59:01 -07:00
Chime Ogbuji
df6bc09d74 Configuration-based use of HF hub-hosted datasets for training (#701)
* Add hf_dataset configuration for using HF hub-hosted datasets for (Q)LoRA training

* Pre-commit formatting

* Fix YAML config example

* Print DS info

* Include name

* Add hf_dataset parameter default

* Remove TextHFDataset and CompletionsHFDataset and use Dataset and CompletionsDataset instead, adding a text_key constructor argument to the former (and changing it to work with a provided data structure instead of just from a JSON file), and prompt_key and completion_key arguments to the latter with defaults for backwards compatibility.

* nits

* update docs

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-26 10:20:50 -07:00
Chime Ogbuji
1d701a1831 Logprobs info to completion API (#806)
* Initial implementation

* Fix handling of return_step_logits in return

* Fixed OpenAI parameter expectations and logprob structure and datatypes

* pre-commit black formatting

* Remove unused parameter

* fix log probs

* fix colorize

* nits in server

* nits in server

* Fix top_logprobs structure (a dict) and include tokens in logprobs response

* nits

* fix types

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-23 10:35:13 -07:00
Yi Wang
a7598e9456 Fix mypy errors with models/{qwen2,qwen2_moe,startcoder2}.py (#835)
* Fix starcoder.py

* Fix qwen2

* Remvoe unnecessary assert not None
2024-06-14 09:44:50 -07:00
Awni Hannun
d8b073e3a7 Add eos token to lora fine-tunes (#818)
* add eos token to lora fine-tunes

* Comment
2024-06-12 07:44:21 -07:00
Nada Amin
3cc58e17fb Tweaks to run dspy-produced calls to the server, with gemma template. (#810)
* Tweaks to run dspy-produced calls to the server, with gemma template.

following comment https://github.com/stanfordnlp/dspy/issues/385#issuecomment-1998939936

can try it out with:
```sh
python -m server --model mlx-community/gemma-1.1-7b-it-4bit --port 1143
```
modulo patching the relative imports in server.py
```
-from .tokenizer_utils import TokenizerWrapper
-from .utils import generate_step, load
+from mlx_lm.tokenizer_utils import TokenizerWrapper
+from mlx_lm.utils import generate_step, load
```

and then, ont the dspy side:
```python
import dspy
lm = dspy.OpenAI(model_type="chat", api_base="http://localhost:11434/v1/", api_key="not_needed", max_tokens=250)
lm("hello")
```

* simpler way to validate float or int

* remove logic that works around incompatible templates, too gemma specific

* tweak messages for common denominator

* use generate.py workaround for DBXR

* put behind flag

* oops

* Solution to chat template issue: pass in a custom template!

The template should likely adhere to the OpenAI chat model.
Here is such a template for Gemma.

--chat-template "{{ bos_token }}{% set extra_system = '' %}{% for message in messages %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{% if role == 'system' %}{% set extra_system = extra_system + message['content'] %}{% else %}{% if role == 'user' and extra_system %}{% set message_system = 'System: ' + extra_system %}{% else %}{% set message_system = '' %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message_system + message['content'] | trim + '<end_of_turn>\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"

* remove convoluted solution

* Tweak for when None is provided explicitly, and must be set to [] too.

For example, the outlines library provides None explicitly.

* style

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-12 07:17:06 -07:00
Yi Wang
6da07fb1b0 make models/phi3.py and models/phi3small.py compatible with mypy (#833) 2024-06-12 06:53:55 -07:00
JosefAlbers
fda41545a6 Su-RoPE(Rotary Position Embedding) for Phi-3 (#813)
* Su-RoPE

* nits

* Update su_rope.py

* Update su_rope.py

Per GPT4: "The error TypeError: 'type' object is not subscriptable is caused by using the type hint list[float] in a version of Python that does not support it. This syntax is only available in Python 3.9 and later."

* Ran isort

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-11 06:20:04 -07:00
Yi Wang
a54dfd698e Correct the type annotation of cache in llama.py (#828)
* Update

* Fix isort
2024-06-10 15:18:34 -07:00
Yi Wang
bb8227f181 Correct type annotation of llama.ModelArgs.num_key_value_heads (#827) 2024-06-10 14:47:31 -07:00
Awni Hannun
c5da302fc4 gpu featurization (#824) 2024-06-07 08:59:44 -07:00
Robin Glauser
4872727f14 Fixing "NameError: name 'resume_adapter_file' is not defined" (#817)
args. is missing from resume_adapter_file so the name is not defined.
2024-06-05 10:07:31 -07:00
Michał Kurc
43d6deb3c1 mlx_lm: Add Streaming Capability to Generate Function (#807)
* Add streaming feature to text generation function

* separate stream and regular functions

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-03 09:04:39 -07:00
Shiyu
8353bbbf93 Segment Anything Model (#552)
* add segment anything model

* add readme

* reorg file structure

* update

* lint

* minor updates

* ack

* fix weight loading

* simplify

* fix to run notebooks

* amg in mlx

* remove torch dependency

* nit in README

* return indices in nms

* simplify

* bugfix / simplify

* fix bug'

* simplify

* fix notebook and remove output

* couple more nits

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-02 16:45:51 -07:00
Derek Lewis
89b0b75250 GPT2 Support (#798)
* GPT-2 model support

* Add test for gpt2 model

* Fix weight sanitizing for quantization

* use approx gelu

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-02 16:33:20 -07:00
madroid
c457a3f88b LoRA: Extract small function (#614)
* LoRA: Extract pre_processing_model  function

* LoRA: Extract small functions(train_model,evaluate_model)

* move test case to test_tuner_utils.py

* nits

* nits

* remove extra param, validate at it 0

* version

* fix test

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-02 06:38:42 -07:00
Awni Hannun
81318ad4a8 Port of phi3small (#794)
* start port of phi3small

* fix phi3

* use block sparsity

* compile activation

* nits in readme / mlx lm version
2024-05-31 12:54:14 -07:00
Awni Hannun
09aaeac72c fix moe conversion (#802) 2024-05-31 12:36:05 -07:00
Behnam Moh
f49c5f2829 fixed the requirements (#803) 2024-05-29 06:14:19 -07:00
Chen Xin
aac98ca6f4 support internlm2 (#797)
* support internlm2

* only attention projections

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-05-27 06:22:21 -07:00
Awni Hannun
ca7ce60c91 Rename block sparse to gather (#793)
* rename block sparse to gather

* pin mlx version
2024-05-23 19:47:35 -07:00
Prince Canuma
69700d8431 Add support for Phi-3 Medium (#790)
* update to support phi-3 medium

* fuse qkv split
2024-05-22 16:47:06 -07:00
Prince Canuma
b044ce2acf Add support for ibm granite (#758)
* add support for granite 3-8B config

* add gpt_bigcode

* add positional embedding condition.

* add support for granite 3-8B config

* add gpt_bigcode

* add positional embedding condition.

* remove unused function

* rebase fix

* move position emebedding to mask creation

* add to tuner and format

* add support for granite 3-8B config

* add gpt_bigcode

* add positional embedding condition.

* add support for granite 3-8B config

* add gpt_bigcode

* add positional embedding condition.

* rebase fix

* move position emebedding to mask creation

* add to tuner and format

* refactor mask

* remove dropout layers
2024-05-21 20:16:31 -07:00
Awni Hannun
9fc6efbd90 version bump + some fixes (#792) 2024-05-21 20:09:35 -07:00
Angelos Katharopoulos
9f671228cd Block sparse MM MoEs (#782)
- Adds SwitchLinear
- Adds QuantizedSwitchLinear
2024-05-21 15:58:08 -07:00
AtakanTekparmak
199df9e110 fix: Added dedicated error handling to load and get_model_path (#775)
* fix: Added dedicated error handling to load and get_model_path

Added proper error handling to load and get_model_path by adding a dedicated exception class, because when the local path is not right, it still throws the huggingface RepositoryNotFoundError

* fix: Changed error message and resolved lack of import

* fix: Removed redundant try-catch block

* nits in message

* nits in message

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-05-20 06:39:05 -07:00
Awni Hannun
e92de216fd rid warning (#789) 2024-05-20 06:05:33 -07:00
alexC-nonsense4k
42458914c8 support dora finetune in mlx-examples/llms/mlx_lm (#779)
* support dora finetune

* solve problems in lora.py and tuner.utils.py

* add use_dora (bool) in functions of load adapters

* delete all unsupported quantization code and fix all the calculate problems in mlx_lm/tuner/dora.py

* Using stop_gradient to prevent gradients from flowing through ‘norm’ during backpropagation

* set DEFAULT_USE_DORA in mlx_lm/generate.py

* add annotation for all the use_dora

* mlx_lm/fuse.py support fuse dora layers and fix a bug of to_linear() in mlx_lm/tuner/dora.py

* simplify code of juding type of a fused layer in mlx_lm/fuse.py

* add use_dora in mlx_lm/fuse.py when apply_lora_layers()

* style + nits

* style + nits

* more updates

---------

Co-authored-by: chenyifei08 <chenyifei08@baidu.com>
Co-authored-by: Awni Hannun <awni@apple.com>
2024-05-16 08:21:26 -07:00