mirror of https://github.com/ml-explore/mlx-examples.git synced 2025-12-16 02:08:55 +08:00

Files

dmdaksh 7d7e236061 - Removed unused Python imports (#683 )

- bert/model.py:10: tree_unflatten
  - bert/model.py:2: dataclass
  - bert/model.py:8: numpy
  - cifar/resnet.py:6: Any
  - clip/model.py:15: tree_flatten
  - clip/model.py:9: Union
  - gcn/main.py:8: download_cora
  - gcn/main.py:9: cross_entropy
  - llms/gguf_llm/models.py:12: tree_flatten, tree_unflatten
  - llms/gguf_llm/models.py:9: numpy
  - llms/mixtral/mixtral.py:12: tree_map
  - llms/mlx_lm/models/dbrx.py:2: Dict, Union
  - llms/mlx_lm/tuner/trainer.py:5: partial
  - llms/speculative_decoding/decoder.py:1: dataclass, field
  - llms/speculative_decoding/decoder.py:2: Optional
  - llms/speculative_decoding/decoder.py:5: mlx.nn
  - llms/speculative_decoding/decoder.py:6: numpy
  - llms/speculative_decoding/main.py:2: glob
  - llms/speculative_decoding/main.py:3: json
  - llms/speculative_decoding/main.py:5: Path
  - llms/speculative_decoding/main.py:8: mlx.nn
  - llms/speculative_decoding/model.py:6: tree_unflatten
  - llms/speculative_decoding/model.py:7: AutoTokenizer
  - llms/tests/test_lora.py:13: yaml_loader
  - lora/lora.py:14: tree_unflatten
  - lora/models.py:11: numpy
  - lora/models.py:3: glob
  - speechcommands/kwt.py:1: Any
  - speechcommands/main.py:7: mlx.data
  - stable_diffusion/stable_diffusion/model_io.py:4: partial
  - whisper/benchmark.py:5: sys
  - whisper/test.py:5: subprocess
  - whisper/whisper/audio.py:6: Optional
  - whisper/whisper/decoding.py:8: mlx.nn

2024-04-16 07:50:32 -07:00

assets

CLIP (ViT) (#315 )

2024-01-31 14:19:53 -08:00

.gitignore

CLIP (ViT) (#315 )

2024-01-31 14:19:53 -08:00

clip.py

CLIP (ViT) (#315 )

2024-01-31 14:19:53 -08:00

convert.py

chore(clip): update the clip example to make it compatible with HF format (#472 )

2024-02-23 06:49:53 -08:00

hf_preproc.py

CLIP (ViT) (#315 )

2024-01-31 14:19:53 -08:00

image_processor.py

CLIP (ViT) (#315 )

2024-01-31 14:19:53 -08:00

model.py

- Removed unused Python imports (#683 )

2024-04-16 07:50:32 -07:00

README.md

CLIP (ViT) (#315 )

2024-01-31 14:19:53 -08:00

requirements.txt

chore(clip): update the clip example to make it compatible with HF format (#472 )

2024-02-23 06:49:53 -08:00

test.py

chore(clip): update the clip example to make it compatible with HF format (#472 )

2024-02-23 06:49:53 -08:00

tokenizer.py

CLIP (ViT) (#315 )

2024-01-31 14:19:53 -08:00

README.md

CLIP

An example of OpenAI's CLIP in MLX. The CLIP (contrastive language-image pre-training) model embeds images and text in the same space.¹

Setup

Install the dependencies:

pip install -r requirements.txt

Next, download a CLIP model from Hugging Face and convert it to MLX. The default model is openai/clip-vit-base-patch32.

python convert.py

The script will by default download the model and configuration files to the directory mlx_model/.

Run

You can use the CLIP model to embed images and text.

from PIL import Image
import clip

model, tokenizer, img_processor = clip.load("mlx_model")
inputs = {
    "input_ids": tokenizer(["a photo of a cat", "a photo of a dog"]),
    "pixel_values": img_processor(
        [Image.open("assets/cat.jpeg"), Image.open("assets/dog.jpeg")]
    ),
}
output = model(**inputs)

# Get text and image embeddings:
text_embeds = output.text_embeds
image_embeds = output.image_embeds

Run the above example with python clip.py.

To embed only images or only the text, pass only the input_ids or pixel_values, respectively.

This example re-implements minimal image preprocessing and tokenization to reduce dependencies. For additional preprocessing functionality, you can use transformers. The file hf_preproc.py has an example.

MLX CLIP has been tested and works with the following Hugging Face repos:

You can run the tests with:

python test.py

To test new models, update the MLX_PATH and HF_PATH in test.py.

Attribution

assets/cat.jpeg is a "Cat" by London's, licensed under CC BY-SA 2.0.
assets/dog.jpeg is a "Happy Dog" by tedmurphy, licensed under CC BY 2.0.

Refer to the original paper Learning Transferable Visual Models From Natural Language Supervision or blog post ↩︎