mirror of
https://github.com/ml-explore/mlx-examples.git
synced 2025-11-08 07:48:08 +08:00
LlaVA in MLX (#461)
* add: llava mlx first draft * add: weights comparision * add forward pass skeleton * update: now imports weights correctly * delete base * latest * adding config * fix: use config * add mlx config * feat: add image processor for llava processor * wip * feat: llava working example * chore: refactor generate script * chore: clean up * add: warning to user if no <image> token despite using one * add: __call__ to LlavaModel * add: call to LlavaModel * update fp * clean up var names * update: native GeLU * Cleanup * update generate and readme * remove todo comment * rearrange tests * fix example code * nits in README * update readme * nit in readme * nits in README * chore(llava): refactor image embedding merging logic * min mlx version * nits in readmes * fix cli prompt, some nits * updates, slight simplify --------- Co-authored-by: anchen <li.anchen.au@gmail.com> Co-authored-by: Awni Hannun <awni@apple.com>
This commit is contained in:
61
llava/README.md
Normal file
61
llava/README.md
Normal file
@@ -0,0 +1,61 @@
|
||||
# LLaVA
|
||||
|
||||
An example of LLaVA: Large Language and Vision Assistant in MLX.[^1] LLlava is
|
||||
a multimodal model that can generate text given combined image and text inputs.
|
||||
|
||||
## Setup
|
||||
|
||||
Install the dependencies:
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Run
|
||||
|
||||
You can use LLaVA to ask questions about images.
|
||||
|
||||
For example, using the command line:
|
||||
|
||||
```bash
|
||||
python generate.py \
|
||||
--model llava-hf/llava-1.5-7b-hf \
|
||||
--image "http://images.cocodataset.org/val2017/000000039769.jpg" \
|
||||
--prompt "USER: <image>\nWhat are these?\nASSISTANT:" \
|
||||
--max-tokens 128 \
|
||||
--temp 0
|
||||
```
|
||||
|
||||
This uses the following image:
|
||||
|
||||

|
||||
|
||||
And generates the output:
|
||||
|
||||
```
|
||||
These are two cats lying on a pink couch.
|
||||
```
|
||||
|
||||
You can also use LLaVA in Python:
|
||||
|
||||
```python
|
||||
from generate import load_model, prepare_inputs, generate_text
|
||||
|
||||
processor, model = load_model("llava-hf/llava-1.5-7b-hf")
|
||||
|
||||
max_tokens, temperature = 128, 0.0
|
||||
|
||||
prompt = "USER: <image>\nWhat are these?\nASSISTANT:"
|
||||
image = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||
input_ids, pixel_values = prepare_inputs(processor, image, prompt)
|
||||
|
||||
reply = generate_text(
|
||||
input_ids, pixel_values, model, processor, max_tokens, temperature
|
||||
)
|
||||
|
||||
print(reply)
|
||||
```
|
||||
|
||||
[^1]:
|
||||
Refer to [LLaVA project webpage](https://llava-vl.github.io/) for more
|
||||
information.
|
||||
Reference in New Issue
Block a user