mlx-examples/clip/README.md

# CLIP

An example of OpenAI's CLIP in MLX. The CLIP (contrastive language-image
pre-training) model embeds images and text in the same space.[^1]

### Setup

Install the dependencies:

```shell
pip install -r requirements.txt
```

Next, download a CLIP model from Hugging Face and convert it to MLX. The
default model is
[openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32).

```
python convert.py
```

The script will by default download the model and configuration files to the
directory ``mlx_model/``.

### Run

You can use the CLIP model to embed images and text. 

```python
from PIL import Image
import clip

model, tokenizer, img_processor = clip.load("mlx_model")
inputs = {
    "input_ids": tokenizer(["a photo of a cat", "a photo of a dog"]),
    "pixel_values": img_processor(
        [Image.open("assets/cat.jpeg"), Image.open("assets/dog.jpeg")]
    ),
}
output = model(**inputs)

# Get text and image embeddings:
text_embeds = output.text_embeds
image_embeds = output.image_embeds
```

Run the above example with `python clip.py`.

To embed only images or only the text, pass only the ``input_ids`` or
``pixel_values``, respectively.

This example re-implements minimal image preprocessing and tokenization to reduce
dependencies. For additional preprocessing functionality, you can use
``transformers``. The file `hf_preproc.py` has an example.

MLX CLIP has been tested and works with the following Hugging Face repos:

- [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)
- [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)

You can run the tests with:

```shell
python test.py
```

To test new models, update the `MLX_PATH` and `HF_PATH` in `test.py`.

### Attribution

- `assets/cat.jpeg` is a "Cat" by London's, licensed under CC BY-SA 2.0.
- `assets/dog.jpeg` is a "Happy Dog" by tedmurphy, licensed under CC BY 2.0.

[^1]: Refer to the original paper [Learning Transferable Visual Models From
  Natural Language Supervision ](https://arxiv.org/abs/2103.00020) or [blog
  post](https://openai.com/research/clip)
CLIP (ViT) (#315) * probably approximatelly correct CLIPTextEncoder * implemented CLIPEncoderLayer as built-in nn.TransformerEncoderLayer * replaced embedding layer with simple matrix * implemented ViT * added ViT tests * fixed tests * added pooler_output for text * implemented complete CLIPModel * implemented init * implemented convert.py and from_pretrained * fixed some minor bugs and added the README.md * removed tokenizer unused comments * removed unused deps * updated ACKNOWLEDGEMENTS.md * Feat: Image Processor for CLIP (#1) @nkasmanoff: * clip image processor * added example usage * refactored image preprocessing * deleted unused image_config.py * removed preprocessing port * added dependency to mlx-data * fixed attribution and moved photos to assets * implemented a simple port of CLIPImageProcessor * review changes * PR review changes * renamed too verbose arg * updated README.md * nits in readme / conversion * simplify some stuff, remove unneeded inits * remove more init stuff * more simplify * make test a unit test * update main readme * readme nits --------- Co-authored-by: Noah Kasmanoff <nkasmanoff@gmail.com> Co-authored-by: Awni Hannun <awni@apple.com> 2024-02-01 06:19:53 +08:00			`# CLIP`

			`An example of OpenAI's CLIP in MLX. The CLIP (contrastive language-image`
			`pre-training) model embeds images and text in the same space.[^1]`

			`### Setup`

			`Install the dependencies:`

			```shell
			`pip install -r requirements.txt`
			```

			`Next, download a CLIP model from Hugging Face and convert it to MLX. The`
			`default model is`
			`[openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32).`

			```
			`python convert.py`
			```

			`The script will by default download the model and configuration files to the`
			directory ``mlx_model/``.

			`### Run`

			`You can use the CLIP model to embed images and text.`

			```python
			`from PIL import Image`
			`import clip`

			`model, tokenizer, img_processor = clip.load("mlx_model")`
			`inputs = {`
			`"input_ids": tokenizer(["a photo of a cat", "a photo of a dog"]),`
			`"pixel_values": img_processor(`
			`[Image.open("assets/cat.jpeg"), Image.open("assets/dog.jpeg")]`
			`),`
			`}`
			`output = model(**inputs)`

			`# Get text and image embeddings:`
			`text_embeds = output.text_embeds`
			`image_embeds = output.image_embeds`
			```

			Run the above example with `python clip.py`.

			To embed only images or only the text, pass only the ``input_ids`` or
			``pixel_values``, respectively.

			`This example re-implements minimal image preprocessing and tokenization to reduce`
			`dependencies. For additional preprocessing functionality, you can use`
			``transformers``. The file `hf_preproc.py` has an example.

			`MLX CLIP has been tested and works with the following Hugging Face repos:`

			`- [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)`
			`- [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)`

			`You can run the tests with:`

			```shell
			`python test.py`
			```

			To test new models, update the `MLX_PATH` and `HF_PATH` in `test.py`.

			`### Attribution`

			- `assets/cat.jpeg` is a "Cat" by London's, licensed under CC BY-SA 2.0.
			- `assets/dog.jpeg` is a "Happy Dog" by tedmurphy, licensed under CC BY 2.0.

			`[^1]: Refer to the original paper [Learning Transferable Visual Models From`
			`Natural Language Supervision ](https://arxiv.org/abs/2103.00020) or [blog`
			`post](https://openai.com/research/clip)`