mlx-examples/clip
2025-02-22 06:08:54 -08:00
..
assets CLIP (ViT) (#315) 2024-01-31 14:19:53 -08:00
.gitignore CLIP (ViT) (#315) 2024-01-31 14:19:53 -08:00
clip.py CLIP (ViT) (#315) 2024-01-31 14:19:53 -08:00
convert.py Fix FutureWarning in torch.load by setting weights_only=True (#1295) 2025-02-22 06:08:54 -08:00
hf_preproc.py CLIP (ViT) (#315) 2024-01-31 14:19:53 -08:00
image_processor.py CLIP (ViT) (#315) 2024-01-31 14:19:53 -08:00
linear_probe.py feat(clip): add linear probe evaluation script (#960) 2024-10-24 21:56:17 -07:00
model.py - Removed unused Python imports (#683) 2024-04-16 07:50:32 -07:00
README.md CLIP (ViT) (#315) 2024-01-31 14:19:53 -08:00
requirements.txt feat(clip): add linear probe evaluation script (#960) 2024-10-24 21:56:17 -07:00
test.py chore(clip): update the clip example to make it compatible with HF format (#472) 2024-02-23 06:49:53 -08:00
tokenizer.py CLIP (ViT) (#315) 2024-01-31 14:19:53 -08:00

CLIP

An example of OpenAI's CLIP in MLX. The CLIP (contrastive language-image pre-training) model embeds images and text in the same space.1

Setup

Install the dependencies:

pip install -r requirements.txt

Next, download a CLIP model from Hugging Face and convert it to MLX. The default model is openai/clip-vit-base-patch32.

python convert.py

The script will by default download the model and configuration files to the directory mlx_model/.

Run

You can use the CLIP model to embed images and text.

from PIL import Image
import clip

model, tokenizer, img_processor = clip.load("mlx_model")
inputs = {
    "input_ids": tokenizer(["a photo of a cat", "a photo of a dog"]),
    "pixel_values": img_processor(
        [Image.open("assets/cat.jpeg"), Image.open("assets/dog.jpeg")]
    ),
}
output = model(**inputs)

# Get text and image embeddings:
text_embeds = output.text_embeds
image_embeds = output.image_embeds

Run the above example with python clip.py.

To embed only images or only the text, pass only the input_ids or pixel_values, respectively.

This example re-implements minimal image preprocessing and tokenization to reduce dependencies. For additional preprocessing functionality, you can use transformers. The file hf_preproc.py has an example.

MLX CLIP has been tested and works with the following Hugging Face repos:

You can run the tests with:

python test.py

To test new models, update the MLX_PATH and HF_PATH in test.py.

Attribution

  • assets/cat.jpeg is a "Cat" by London's, licensed under CC BY-SA 2.0.
  • assets/dog.jpeg is a "Happy Dog" by tedmurphy, licensed under CC BY 2.0.