zhangyiss/mlx-examples

mirror of https://github.com/ml-explore/mlx-examples.git synced 2025-12-10 22:46:48 +08:00

Files

History

Usama Ahmed 09b641aaa7 Fix FutureWarning in torch.load by setting weights_only=True (#1295 )

2025-02-22 06:08:54 -08:00

..

CLIP (ViT) (#315 )

2024-01-31 14:19:53 -08:00

.gitignore

CLIP (ViT) (#315 )

2024-01-31 14:19:53 -08:00

clip.py

CLIP (ViT) (#315 )

2024-01-31 14:19:53 -08:00

convert.py

Fix FutureWarning in torch.load by setting weights_only=True (#1295 )

2025-02-22 06:08:54 -08:00

hf_preproc.py

CLIP (ViT) (#315 )

2024-01-31 14:19:53 -08:00

image_processor.py

CLIP (ViT) (#315 )

2024-01-31 14:19:53 -08:00

linear_probe.py

feat(clip): add linear probe evaluation script (#960 )

2024-10-24 21:56:17 -07:00

model.py

- Removed unused Python imports (#683 )

2024-04-16 07:50:32 -07:00

README.md

CLIP (ViT) (#315 )

2024-01-31 14:19:53 -08:00

requirements.txt

feat(clip): add linear probe evaluation script (#960 )

2024-10-24 21:56:17 -07:00

test.py

chore(clip): update the clip example to make it compatible with HF format (#472 )

2024-02-23 06:49:53 -08:00

tokenizer.py

CLIP (ViT) (#315 )

2024-01-31 14:19:53 -08:00

README.md

CLIP

An example of OpenAI's CLIP in MLX. The CLIP (contrastive language-image pre-training) model embeds images and text in the same space.¹

Setup

Install the dependencies:

pip install -r requirements.txt

Next, download a CLIP model from Hugging Face and convert it to MLX. The default model is openai/clip-vit-base-patch32.

python convert.py

The script will by default download the model and configuration files to the directory mlx_model/.

Run

You can use the CLIP model to embed images and text.

from PIL import Image
import clip

model, tokenizer, img_processor = clip.load("mlx_model")
inputs = {
    "input_ids": tokenizer(["a photo of a cat", "a photo of a dog"]),
    "pixel_values": img_processor(
        [Image.open("assets/cat.jpeg"), Image.open("assets/dog.jpeg")]
    ),
}
output = model(**inputs)

# Get text and image embeddings:
text_embeds = output.text_embeds
image_embeds = output.image_embeds

Run the above example with python clip.py.

To embed only images or only the text, pass only the input_ids or pixel_values, respectively.

This example re-implements minimal image preprocessing and tokenization to reduce dependencies. For additional preprocessing functionality, you can use transformers. The file hf_preproc.py has an example.

MLX CLIP has been tested and works with the following Hugging Face repos:

You can run the tests with:

python test.py

To test new models, update the MLX_PATH and HF_PATH in test.py.

Attribution

assets/cat.jpeg is a "Cat" by London's, licensed under CC BY-SA 2.0.
assets/dog.jpeg is a "Happy Dog" by tedmurphy, licensed under CC BY 2.0.

Refer to the original paper Learning Transferable Visual Models From Natural Language Supervision or blog post ↩︎