
* probably approximatelly correct CLIPTextEncoder * implemented CLIPEncoderLayer as built-in nn.TransformerEncoderLayer * replaced embedding layer with simple matrix * implemented ViT * added ViT tests * fixed tests * added pooler_output for text * implemented complete CLIPModel * implemented init * implemented convert.py and from_pretrained * fixed some minor bugs and added the README.md * removed tokenizer unused comments * removed unused deps * updated ACKNOWLEDGEMENTS.md * Feat: Image Processor for CLIP (#1) @nkasmanoff: * clip image processor * added example usage * refactored image preprocessing * deleted unused image_config.py * removed preprocessing port * added dependency to mlx-data * fixed attribution and moved photos to assets * implemented a simple port of CLIPImageProcessor * review changes * PR review changes * renamed too verbose arg * updated README.md * nits in readme / conversion * simplify some stuff, remove unneeded inits * remove more init stuff * more simplify * make test a unit test * update main readme * readme nits --------- Co-authored-by: Noah Kasmanoff <nkasmanoff@gmail.com> Co-authored-by: Awni Hannun <awni@apple.com>
2.1 KiB
CLIP
An example of OpenAI's CLIP in MLX. The CLIP (contrastive language-image pre-training) model embeds images and text in the same space.1
Setup
Install the dependencies:
pip install -r requirements.txt
Next, download a CLIP model from Hugging Face and convert it to MLX. The default model is openai/clip-vit-base-patch32.
python convert.py
The script will by default download the model and configuration files to the
directory mlx_model/
.
Run
You can use the CLIP model to embed images and text.
from PIL import Image
import clip
model, tokenizer, img_processor = clip.load("mlx_model")
inputs = {
"input_ids": tokenizer(["a photo of a cat", "a photo of a dog"]),
"pixel_values": img_processor(
[Image.open("assets/cat.jpeg"), Image.open("assets/dog.jpeg")]
),
}
output = model(**inputs)
# Get text and image embeddings:
text_embeds = output.text_embeds
image_embeds = output.image_embeds
Run the above example with python clip.py
.
To embed only images or only the text, pass only the input_ids
or
pixel_values
, respectively.
This example re-implements minimal image preprocessing and tokenization to reduce
dependencies. For additional preprocessing functionality, you can use
transformers
. The file hf_preproc.py
has an example.
MLX CLIP has been tested and works with the following Hugging Face repos:
You can run the tests with:
python test.py
To test new models, update the MLX_PATH
and HF_PATH
in test.py
.
Attribution
assets/cat.jpeg
is a "Cat" by London's, licensed under CC BY-SA 2.0.assets/dog.jpeg
is a "Happy Dog" by tedmurphy, licensed under CC BY 2.0.
-
Refer to the original paper Learning Transferable Visual Models From Natural Language Supervision or blog post ↩︎