CLIP (ViT) (#315)

* probably approximatelly correct CLIPTextEncoder * implemented CLIPEncoderLayer as built-in nn.TransformerEncoderLayer * replaced embedding layer with simple matrix * implemented ViT * added ViT tests * fixed tests * added pooler_output for text * implemented complete CLIPModel * implemented init * implemented convert.py and from_pretrained * fixed some minor bugs and added the README.md * removed tokenizer unused comments * removed unused deps * updated ACKNOWLEDGEMENTS.md * Feat: Image Processor for CLIP (#1) @nkasmanoff: * clip image processor * added example usage * refactored image preprocessing * deleted unused image_config.py * removed preprocessing port * added dependency to mlx-data * fixed attribution and moved photos to assets * implemented a simple port of CLIPImageProcessor * review changes * PR review changes * renamed too verbose arg * updated README.md * nits in readme / conversion * simplify some stuff, remove unneeded inits * remove more init stuff * more simplify * make test a unit test * update main readme * readme nits --------- Co-authored-by: Noah Kasmanoff <nkasmanoff@gmail.com> Co-authored-by: Awni Hannun <awni@apple.com>
2025-12-16 02:08:55 +08:00 · 2024-01-31 23:19:53 +01:00
parent ba3a9355d1
commit 94358219cf
14 changed files with 890 additions and 0 deletions
--- a/clip/README.md
+++ b/clip/README.md
@@ -0,0 +1,76 @@
+# CLIP
+
+An example of OpenAI's CLIP in MLX. The CLIP (contrastive language-image
+pre-training) model embeds images and text in the same space.[^1]
+
+### Setup
+
+Install the dependencies:
+
+```shell
+pip install -r requirements.txt
+```
+
+Next, download a CLIP model from Hugging Face and convert it to MLX. The
+default model is
+[openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32).
+
+```
+python convert.py
+```
+
+The script will by default download the model and configuration files to the
+directory ``mlx_model/``.
+
+### Run
+
+You can use the CLIP model to embed images and text. 
+
+```python
+from PIL import Image
+import clip
+
+model, tokenizer, img_processor = clip.load("mlx_model")
+inputs = {
+    "input_ids": tokenizer(["a photo of a cat", "a photo of a dog"]),
+    "pixel_values": img_processor(
+        [Image.open("assets/cat.jpeg"), Image.open("assets/dog.jpeg")]
+    ),
+}
+output = model(**inputs)
+
+# Get text and image embeddings:
+text_embeds = output.text_embeds
+image_embeds = output.image_embeds
+```
+
+Run the above example with `python clip.py`.
+
+To embed only images or only the text, pass only the ``input_ids`` or
+``pixel_values``, respectively.
+
+This example re-implements minimal image preprocessing and tokenization to reduce
+dependencies. For additional preprocessing functionality, you can use
+``transformers``. The file `hf_preproc.py` has an example.
+
+MLX CLIP has been tested and works with the following Hugging Face repos:
+
+- [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)
+- [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)
+
+You can run the tests with:
+
+```shell
+python test.py
+```
+
+To test new models, update the `MLX_PATH` and `HF_PATH` in `test.py`.
+
+### Attribution
+
+- `assets/cat.jpeg` is a "Cat" by London's, licensed under CC BY-SA 2.0.
+- `assets/dog.jpeg` is a "Happy Dog" by tedmurphy, licensed under CC BY 2.0.
+
+[^1]: Refer to the original paper [Learning Transferable Visual Models From
+  Natural Language Supervision ](https://arxiv.org/abs/2103.00020) or [blog
+  post](https://openai.com/research/clip)