![]() - bert/model.py:10: tree_unflatten - bert/model.py:2: dataclass - bert/model.py:8: numpy - cifar/resnet.py:6: Any - clip/model.py:15: tree_flatten - clip/model.py:9: Union - gcn/main.py:8: download_cora - gcn/main.py:9: cross_entropy - llms/gguf_llm/models.py:12: tree_flatten, tree_unflatten - llms/gguf_llm/models.py:9: numpy - llms/mixtral/mixtral.py:12: tree_map - llms/mlx_lm/models/dbrx.py:2: Dict, Union - llms/mlx_lm/tuner/trainer.py:5: partial - llms/speculative_decoding/decoder.py:1: dataclass, field - llms/speculative_decoding/decoder.py:2: Optional - llms/speculative_decoding/decoder.py:5: mlx.nn - llms/speculative_decoding/decoder.py:6: numpy - llms/speculative_decoding/main.py:2: glob - llms/speculative_decoding/main.py:3: json - llms/speculative_decoding/main.py:5: Path - llms/speculative_decoding/main.py:8: mlx.nn - llms/speculative_decoding/model.py:6: tree_unflatten - llms/speculative_decoding/model.py:7: AutoTokenizer - llms/tests/test_lora.py:13: yaml_loader - lora/lora.py:14: tree_unflatten - lora/models.py:11: numpy - lora/models.py:3: glob - speechcommands/kwt.py:1: Any - speechcommands/main.py:7: mlx.data - stable_diffusion/stable_diffusion/model_io.py:4: partial - whisper/benchmark.py:5: sys - whisper/test.py:5: subprocess - whisper/whisper/audio.py:6: Optional - whisper/whisper/decoding.py:8: mlx.nn |
||
---|---|---|
.. | ||
assets | ||
.gitignore | ||
clip.py | ||
convert.py | ||
hf_preproc.py | ||
image_processor.py | ||
model.py | ||
README.md | ||
requirements.txt | ||
test.py | ||
tokenizer.py |
CLIP
An example of OpenAI's CLIP in MLX. The CLIP (contrastive language-image pre-training) model embeds images and text in the same space.1
Setup
Install the dependencies:
pip install -r requirements.txt
Next, download a CLIP model from Hugging Face and convert it to MLX. The default model is openai/clip-vit-base-patch32.
python convert.py
The script will by default download the model and configuration files to the
directory mlx_model/
.
Run
You can use the CLIP model to embed images and text.
from PIL import Image
import clip
model, tokenizer, img_processor = clip.load("mlx_model")
inputs = {
"input_ids": tokenizer(["a photo of a cat", "a photo of a dog"]),
"pixel_values": img_processor(
[Image.open("assets/cat.jpeg"), Image.open("assets/dog.jpeg")]
),
}
output = model(**inputs)
# Get text and image embeddings:
text_embeds = output.text_embeds
image_embeds = output.image_embeds
Run the above example with python clip.py
.
To embed only images or only the text, pass only the input_ids
or
pixel_values
, respectively.
This example re-implements minimal image preprocessing and tokenization to reduce
dependencies. For additional preprocessing functionality, you can use
transformers
. The file hf_preproc.py
has an example.
MLX CLIP has been tested and works with the following Hugging Face repos:
You can run the tests with:
python test.py
To test new models, update the MLX_PATH
and HF_PATH
in test.py
.
Attribution
assets/cat.jpeg
is a "Cat" by London's, licensed under CC BY-SA 2.0.assets/dog.jpeg
is a "Happy Dog" by tedmurphy, licensed under CC BY 2.0.
-
Refer to the original paper Learning Transferable Visual Models From Natural Language Supervision or blog post ↩︎