LlaVA in MLX (#461)

* add: llava mlx first draft

* add: weights comparision

* add forward pass skeleton

* update: now  imports weights correctly

* delete base

* latest

* adding config

* fix: use config

* add mlx config

* feat: add image processor for llava processor

* wip

* feat: llava working example

* chore: refactor generate script

* chore: clean up

* add: warning to user if no <image> token despite using one

* add: __call__ to LlavaModel

* add: call to LlavaModel

* update fp

* clean up var names

* update: native GeLU

* Cleanup

* update generate and readme

* remove todo comment

* rearrange tests

* fix example code

* nits in README

* update readme

* nit in readme

* nits in README

* chore(llava): refactor image embedding merging logic

* min mlx version

* nits in readmes

* fix cli prompt, some nits

* updates, slight simplify

---------

Co-authored-by: anchen <li.anchen.au@gmail.com>
Co-authored-by: Awni Hannun <awni@apple.com>
This commit is contained in:
Noah Kasmanoff
2024-03-01 13:28:35 -05:00
committed by GitHub
parent 261f1280f6
commit a429263905
9 changed files with 994 additions and 0 deletions

61
llava/README.md Normal file
View File

@@ -0,0 +1,61 @@
# LLaVA
An example of LLaVA: Large Language and Vision Assistant in MLX.[^1] LLlava is
a multimodal model that can generate text given combined image and text inputs.
## Setup
Install the dependencies:
```bash
pip install -r requirements.txt
```
## Run
You can use LLaVA to ask questions about images.
For example, using the command line:
```bash
python generate.py \
--model llava-hf/llava-1.5-7b-hf \
--image "http://images.cocodataset.org/val2017/000000039769.jpg" \
--prompt "USER: <image>\nWhat are these?\nASSISTANT:" \
--max-tokens 128 \
--temp 0
```
This uses the following image:
![alt text](http://images.cocodataset.org/val2017/000000039769.jpg)
And generates the output:
```
These are two cats lying on a pink couch.
```
You can also use LLaVA in Python:
```python
from generate import load_model, prepare_inputs, generate_text
processor, model = load_model("llava-hf/llava-1.5-7b-hf")
max_tokens, temperature = 128, 0.0
prompt = "USER: <image>\nWhat are these?\nASSISTANT:"
image = "http://images.cocodataset.org/val2017/000000039769.jpg"
input_ids, pixel_values = prepare_inputs(processor, image, prompt)
reply = generate_text(
input_ids, pixel_values, model, processor, max_tokens, temperature
)
print(reply)
```
[^1]:
Refer to [LLaVA project webpage](https://llava-vl.github.io/) for more
information.