mlx-examples/llava/README.md

# LLaVA

An example of LLaVA: Large Language and Vision Assistant in MLX.[^1] LLlava is
a multimodal model that can generate text given combined image and text inputs.

## Setup

Install the dependencies:

```bash
pip install -r requirements.txt
```

## Run

You can use LLaVA to ask questions about images.

For example, using the command line:

```bash
python generate.py \
  --model llava-hf/llava-1.5-7b-hf \
  --image "http://images.cocodataset.org/val2017/000000039769.jpg" \
  --prompt "USER: <image>\nWhat are these?\nASSISTANT:" \
  --max-tokens 128 \
  --temp 0
```

This uses the following image:

![alt text](http://images.cocodataset.org/val2017/000000039769.jpg)
 
And generates the output:

```
These are two cats lying on a pink couch.
```

You can also use LLaVA in Python:

```python
from generate import load_model, prepare_inputs, generate_text

processor, model = load_model("llava-hf/llava-1.5-7b-hf")

max_tokens, temperature = 128, 0.0

prompt = "USER: <image>\nWhat are these?\nASSISTANT:"
image = "http://images.cocodataset.org/val2017/000000039769.jpg"
input_ids, pixel_values = prepare_inputs(processor, image, prompt)

reply = generate_text(
    input_ids, pixel_values, model, processor, max_tokens, temperature
)

print(reply)
```

[^1]:
    Refer to [LLaVA project webpage](https://llava-vl.github.io/) for more
    information.
LlaVA in MLX (#461) * add: llava mlx first draft * add: weights comparision * add forward pass skeleton * update: now imports weights correctly * delete base * latest * adding config * fix: use config * add mlx config * feat: add image processor for llava processor * wip * feat: llava working example * chore: refactor generate script * chore: clean up * add: warning to user if no <image> token despite using one * add: __call__ to LlavaModel * add: call to LlavaModel * update fp * clean up var names * update: native GeLU * Cleanup * update generate and readme * remove todo comment * rearrange tests * fix example code * nits in README * update readme * nit in readme * nits in README * chore(llava): refactor image embedding merging logic * min mlx version * nits in readmes * fix cli prompt, some nits * updates, slight simplify --------- Co-authored-by: anchen <li.anchen.au@gmail.com> Co-authored-by: Awni Hannun <awni@apple.com> 2024-03-02 02:28:35 +08:00			`# LLaVA`

			`An example of LLaVA: Large Language and Vision Assistant in MLX.[^1] LLlava is`
			`a multimodal model that can generate text given combined image and text inputs.`

			`## Setup`

			`Install the dependencies:`

			```bash
			`pip install -r requirements.txt`
			```

			`## Run`

			`You can use LLaVA to ask questions about images.`

			`For example, using the command line:`

			```bash
			`python generate.py \`
			`--model llava-hf/llava-1.5-7b-hf \`
			`--image "http://images.cocodataset.org/val2017/000000039769.jpg" \`
			`--prompt "USER: <image>\nWhat are these?\nASSISTANT:" \`
			`--max-tokens 128 \`
			`--temp 0`
			```

			`This uses the following image:`

			`![alt text](http://images.cocodataset.org/val2017/000000039769.jpg)`

			`And generates the output:`

			```
			`These are two cats lying on a pink couch.`
			```

			`You can also use LLaVA in Python:`

			```python
			`from generate import load_model, prepare_inputs, generate_text`

			`processor, model = load_model("llava-hf/llava-1.5-7b-hf")`

			`max_tokens, temperature = 128, 0.0`

			`prompt = "USER: <image>\nWhat are these?\nASSISTANT:"`
			`image = "http://images.cocodataset.org/val2017/000000039769.jpg"`
			`input_ids, pixel_values = prepare_inputs(processor, image, prompt)`

			`reply = generate_text(`
			`input_ids, pixel_values, model, processor, max_tokens, temperature`
			`)`

			`print(reply)`
			```

			`[^1]:`
			`Refer to [LLaVA project webpage](https://llava-vl.github.io/) for more`
			`information.`