mirror of https://github.com/ml-explore/mlx-examples.git synced 2025-06-24 01:17:28 +08:00

Gabrijel Boduljak 94358219cf

* probably approximatelly correct CLIPTextEncoder

* implemented CLIPEncoderLayer as built-in nn.TransformerEncoderLayer

* replaced embedding layer with simple matrix

* implemented ViT

* added ViT tests

* fixed tests

* added pooler_output for text

* implemented complete CLIPModel

* implemented init

* implemented convert.py and from_pretrained

* fixed some minor bugs and added the README.md

* removed tokenizer unused comments

* removed unused deps

* updated ACKNOWLEDGEMENTS.md

* Feat: Image Processor for CLIP (#1)

@nkasmanoff:
* clip image processor
* added example usage

* refactored image preprocessing

* deleted unused image_config.py

* removed preprocessing port

* added dependency to mlx-data

* fixed attribution and moved photos to assets

* implemented a simple port of CLIPImageProcessor

* review changes

* PR review changes

* renamed too verbose arg

* updated README.md

* nits in readme / conversion

* simplify some stuff, remove unneeded inits

* remove more init stuff

* more simplify

* make test a unit test

* update main readme

* readme nits

---------

Co-authored-by: Noah Kasmanoff <nkasmanoff@gmail.com>
Co-authored-by: Awni Hannun <awni@apple.com>

2024-01-31 14:19:53 -08:00

2.1 KiB

Raw Permalink Blame History

CLIP

An example of OpenAI's CLIP in MLX. The CLIP (contrastive language-image pre-training) model embeds images and text in the same space.¹

Setup

Install the dependencies:

pip install -r requirements.txt

Next, download a CLIP model from Hugging Face and convert it to MLX. The default model is openai/clip-vit-base-patch32.

python convert.py

The script will by default download the model and configuration files to the directory mlx_model/.

Run

You can use the CLIP model to embed images and text.

from PIL import Image
import clip

model, tokenizer, img_processor = clip.load("mlx_model")
inputs = {
    "input_ids": tokenizer(["a photo of a cat", "a photo of a dog"]),
    "pixel_values": img_processor(
        [Image.open("assets/cat.jpeg"), Image.open("assets/dog.jpeg")]
    ),
}
output = model(**inputs)

# Get text and image embeddings:
text_embeds = output.text_embeds
image_embeds = output.image_embeds

Run the above example with python clip.py.

To embed only images or only the text, pass only the input_ids or pixel_values, respectively.

This example re-implements minimal image preprocessing and tokenization to reduce dependencies. For additional preprocessing functionality, you can use transformers. The file hf_preproc.py has an example.

MLX CLIP has been tested and works with the following Hugging Face repos:

You can run the tests with:

python test.py

To test new models, update the MLX_PATH and HF_PATH in test.py.

Attribution

assets/cat.jpeg is a "Cat" by London's, licensed under CC BY-SA 2.0.
assets/dog.jpeg is a "Happy Dog" by tedmurphy, licensed under CC BY 2.0.

Refer to the original paper Learning Transferable Visual Models From Natural Language Supervision or blog post ↩︎

2.1 KiB Raw Permalink Blame History

CLIP

Setup

Run

Attribution

2.1 KiB

Raw Permalink Blame History