CLIP (ViT) (#315)

* probably approximatelly correct CLIPTextEncoder * implemented CLIPEncoderLayer as built-in nn.TransformerEncoderLayer * replaced embedding layer with simple matrix * implemented ViT * added ViT tests * fixed tests * added pooler_output for text * implemented complete CLIPModel * implemented init * implemented convert.py and from_pretrained * fixed some minor bugs and added the README.md * removed tokenizer unused comments * removed unused deps * updated ACKNOWLEDGEMENTS.md * Feat: Image Processor for CLIP (#1) @nkasmanoff: * clip image processor * added example usage * refactored image preprocessing * deleted unused image_config.py * removed preprocessing port * added dependency to mlx-data * fixed attribution and moved photos to assets * implemented a simple port of CLIPImageProcessor * review changes * PR review changes * renamed too verbose arg * updated README.md * nits in readme / conversion * simplify some stuff, remove unneeded inits * remove more init stuff * more simplify * make test a unit test * update main readme * readme nits --------- Co-authored-by: Noah Kasmanoff <nkasmanoff@gmail.com> Co-authored-by: Awni Hannun <awni@apple.com>
2025-12-11 23:19:06 +08:00 · 2024-01-31 23:19:53 +01:00
parent ba3a9355d1
commit 94358219cf
14 changed files with 890 additions and 0 deletions
--- a/ACKNOWLEDGMENTS.md
+++ b/ACKNOWLEDGMENTS.md
@@ -10,3 +10,4 @@ MLX Examples was developed with contributions from the following individuals:
 - Juarez Bochi: Added support for T5 models.
 - Sarthak Yadav: Added the `cifar` and `speechcommands` examples.
 - Shunta Saito: Added support for PLaMo models.
 - Gabrijel Boduljak: Implemented `CLIP`.
--- a/README.md
+++ b/README.md
@@ -26,9 +26,15 @@ Some more useful examples are listed below.
 - Speech recognition with [OpenAI's Whisper](whisper).
 ### Multimodal models
 - Joint text and image embeddings with [CLIP](clip).
 ### Other Models 
 - Semi-supervised learning on graph-structured data with [GCN](gcn).
 - Real NVP [normalizing flow](normalizing_flow) for density estimation and
  sampling.
 ### Hugging Face
--- a/clip/.gitignore
+++ b/clip/.gitignore
@@ -0,0 +1 @@
 mlx_model/
--- a/clip/README.md
+++ b/clip/README.md
@@ -0,0 +1,76 @@
 # CLIP
 An example of OpenAI's CLIP in MLX. The CLIP (contrastive language-image
 pre-training) model embeds images and text in the same space.[^1]
 ### Setup
 Install the dependencies:
 ```shell
 pip install -r requirements.txt
 ```
 Next, download a CLIP model from Hugging Face and convert it to MLX. The
 default model is
 [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32).
 ```
 python convert.py
 ```
 The script will by default download the model and configuration files to the
 directory ``mlx_model/``.
 ### Run
 You can use the CLIP model to embed images and text. 
 ```python
 from PIL import Image
 import clip
 model, tokenizer, img_processor = clip.load("mlx_model")
 inputs = {
    "input_ids": tokenizer(["a photo of a cat", "a photo of a dog"]),
    "pixel_values": img_processor(
        [Image.open("assets/cat.jpeg"), Image.open("assets/dog.jpeg")]
    ),
 }
 output = model(**inputs)
 # Get text and image embeddings:
 text_embeds = output.text_embeds
 image_embeds = output.image_embeds
 ```
 Run the above example with `python clip.py`.
 To embed only images or only the text, pass only the ``input_ids`` or
 ``pixel_values``, respectively.
 This example re-implements minimal image preprocessing and tokenization to reduce
 dependencies. For additional preprocessing functionality, you can use
 ``transformers``. The file `hf_preproc.py` has an example.
 MLX CLIP has been tested and works with the following Hugging Face repos:
 - [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)
 - [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)
 You can run the tests with:
 ```shell
 python test.py
 ```
 To test new models, update the `MLX_PATH` and `HF_PATH` in `test.py`.
 ### Attribution
 - `assets/cat.jpeg` is a "Cat" by London's, licensed under CC BY-SA 2.0.
 - `assets/dog.jpeg` is a "Happy Dog" by tedmurphy, licensed under CC BY 2.0.
 [^1]: Refer to the original paper [Learning Transferable Visual Models From
  Natural Language Supervision ](https://arxiv.org/abs/2103.00020) or [blog
  post](https://openai.com/research/clip)
--- a/clip/assets/cat.jpeg
+++ b/clip/assets/cat.jpeg
--- a/clip/assets/dog.jpeg
+++ b/clip/assets/dog.jpeg
--- a/clip/clip.py
+++ b/clip/clip.py
@@ -0,0 +1,31 @@
 from typing import Tuple
 from image_processor import CLIPImageProcessor
 from model import CLIPModel
 from tokenizer import CLIPTokenizer
 def load(model_dir: str) -> Tuple[CLIPModel, CLIPTokenizer, CLIPImageProcessor]:
    model = CLIPModel.from_pretrained(model_dir)
    tokenizer = CLIPTokenizer.from_pretrained(model_dir)
    img_processor = CLIPImageProcessor.from_pretrained(model_dir)
    return model, tokenizer, img_processor
 if __name__ == "__main__":
    from PIL import Image
    model, tokenizer, img_processor = load("mlx_model")
    inputs = {
        "input_ids": tokenizer(["a photo of a cat", "a photo of a dog"]),
        "pixel_values": img_processor(
            [Image.open("assets/cat.jpeg"), Image.open("assets/dog.jpeg")]
        ),
    }
    output = model(**inputs)
    # Get text and image embeddings:
    text_embeds = output.text_embeds
    image_embeds = output.image_embeds
    print("Text embeddings shape:", text_embeds.shape)
    print("Image embeddings shape:", image_embeds.shape)
--- a/clip/convert.py
+++ b/clip/convert.py
@@ -0,0 +1,107 @@
 # Copyright © 2023-2024 Apple Inc.
 import argparse
 import shutil
 from pathlib import Path
 from typing import Tuple
 import mlx.core as mx
 import torch
 from huggingface_hub import snapshot_download
 def get_model_path(path_or_hf_repo: str) -> Path:
    model_path = Path(path_or_hf_repo)
    if not model_path.exists():
        model_path = Path(
            snapshot_download(
                repo_id=path_or_hf_repo,
                allow_patterns=[
                    "*.bin",
                    "*.json",
                    "*.txt",
                ],
            )
        )
    return model_path
 def torch_to_mx(a: torch.Tensor, *, dtype: str) -> mx.array:
    # bfloat16 is not numpy convertible. Upcast to float32 to avoid precision loss
    a = a.to(torch.float32) if dtype == "bfloat16" else a.to(getattr(torch, dtype))
    return mx.array(a.numpy(), getattr(mx, dtype))
 def map_weights(key: str, value: torch.Tensor) -> Tuple[str, mx.array]:
    key = key.replace("embeddings.", "")
    key = key.replace("encoder.", "")
    key = key.replace("position_embedding.weight", "position_embedding")
    # Map attention layers
    if "self_attn." in key:
        key = key.replace("self_attn.", "attention.")
    if "q_proj." in key:
        key = key.replace("q_proj.", "query_proj.")
    if "k_proj." in key:
        key = key.replace("k_proj.", "key_proj.")
    if "v_proj." in key:
        key = key.replace("v_proj.", "value_proj.")
    if "layer_norm1." in key:
        key = key.replace("layer_norm1.", "ln1.")
    if "layer_norm2." in key:
        key = key.replace("layer_norm2.", "ln2.")
    # Map ffn layers
    if "mlp.fc1" in key:
        key = key.replace("mlp.fc1", "linear1")
    if "mlp.fc2" in key:
        key = key.replace("mlp.fc2", "linear2")
    # Fix layernorm typo
    if "pre_layrnorm" in key:
        # Fix typo in weights :)
        key = key.replace("pre_layrnorm", "pre_layernorm")
    if "patch_embedding.weight" in key:
        # Initially, value: [out_channels, in_channels, kH, KW].
        # We want [out_channels, kH, KW, in_channels]
        value = value.permute(0, 2, 3, 1)
    return (key, torch_to_mx(value, dtype=str(value.dtype).replace("torch.", "")))
 def should_keep_weight(key: str):
    return not ("position_ids" in key)
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Download and Convert (OpenAI) CLIP weights to MLX"
    )
    parser.add_argument(
        "--hf-repo",
        type=str,
        default="openai/clip-vit-base-patch32",
        help="Hugging Face repository name.",
    )
    parser.add_argument(
        "--mlx-path",
        type=str,
        default="mlx_model",
        help="Path to save the MLX model.",
    )
    args = parser.parse_args()
    torch_path = get_model_path(args.hf_repo)
    mlx_path = Path(args.mlx_path)
    mlx_path.mkdir(parents=True, exist_ok=True)
    print("[INFO] Loading")
    torch_weights = torch.load(torch_path / "pytorch_model.bin")
    print("[INFO] Converting")
    mlx_weights = dict(map_weights(k, v) for (k, v) in torch_weights.items())
    mlx_weights = {k: v for (k, v) in mlx_weights.items() if should_keep_weight(k)}
    print("[INFO] Saving")
    mx.savez(str(mlx_path / "weights.npz"), **mlx_weights)
    for fn in ["config.json", "merges.txt", "vocab.json", "preprocessor_config.json"]:
        shutil.copyfile(
            str(torch_path / f"{fn}"),
            str(mlx_path / f"{fn}"),
        )
--- a/clip/hf_preproc.py
+++ b/clip/hf_preproc.py
@@ -0,0 +1,29 @@
 import mlx.core as mx
 import transformers
 from PIL import Image
 import clip
 hf_model = "openai/clip-vit-base-patch32"
 mlx_model = "mlx_model"
 model, *_ = clip.load(mlx_model)
 processor = transformers.CLIPProcessor.from_pretrained(hf_model)
 inputs = processor(
    text=["a photo of a cat", "a photo of a dog"],
    images=[Image.open("assets/cat.jpeg"), Image.open("assets/dog.jpeg")],
    return_tensors="np",
 )
 out = model(
    input_ids=mx.array(inputs.input_ids),
    pixel_values=mx.array(inputs.pixel_values).transpose((0, 2, 3, 1)),
    return_loss=True,
 )
 print("text embeddings:")
 print(out.text_embeds)
 print("image embeddings:")
 print(out.image_embeds)
 print(f"CLIP loss: {out.loss.item():.3f}")
--- a/clip/image_processor.py
+++ b/clip/image_processor.py
@@ -0,0 +1,93 @@
 # Copyright © 2023-2024 Apple Inc.
 import json
 from pathlib import Path
 from typing import List, Tuple
 import mlx.core as mx
 import numpy as np
 from PIL.Image import Image
 class CLIPImageProcessor:
    """
    A simple port of
    https://github.com/huggingface/transformers/blob/main/src/transformers/models/clip/image_processing_clip.py.
    """
    def __init__(
        self,
        crop_size: int = 224,
        do_center_crop: bool = True,
        do_normalize: bool = True,
        do_resize: bool = True,
        image_mean: List[float] = [0.48145466, 0.4578275, 0.40821073],
        image_std: List[float] = [0.26862954, 0.26130258, 0.27577711],
        size: int = 224,
        **kwargs
    ) -> None:
        self.crop_size = crop_size
        self.do_center_crop = do_center_crop
        self.do_normalize = do_normalize
        self.do_resize = do_resize
        self.image_mean = mx.array(image_mean)
        self.image_std = mx.array(image_std)
        self.size = size
    def __call__(self, images: List[Image]) -> mx.array:
        return mx.concatenate(
            [self._preprocess(image)[None] for image in images], axis=0
        )
    def _preprocess(self, image: Image) -> mx.array:
        if self.do_resize:
            image = resize(image, self.size)
        if self.do_center_crop:
            image = center_crop(image, (self.crop_size, self.crop_size))
        image = mx.array(np.array(image))
        image = rescale(image)
        if self.do_normalize:
            image = normalize(image, self.image_mean, self.image_std)
        return image
    @staticmethod
    def from_pretrained(path: str):
        path = Path(path)
        with open(path / "preprocessor_config.json", encoding="utf-8") as f:
            config = json.load(f)
        return CLIPImageProcessor(**config)
 def resize(image: Image, short_size: int) -> Image:
    """
    Resize so small size to short_size
    """
    width, height = image.size
    short = min(width, height)
    long = max(width, height)
    if short == short_size:
        return image
    new_short = short_size
    new_long = int(short_size * long / short)
    new_size = (new_short, new_long) if width <= height else (new_long, new_short)
    return image.resize(new_size)
 def center_crop(image: Image, size: Tuple[int, int]) -> Image:
    if size[0] % 2 != 0 or size[1] % 2 != 0:
        raise ValueError("Only even crop sizes supported.")
    original_width, original_height = image.size
    crop_height, crop_width = size
    top = (original_height - crop_height) // 2
    bottom = top + crop_height
    left = (original_width - crop_width) // 2
    right = left + crop_width
    return image.crop((left, top, right, bottom))
 def rescale(image: mx.array) -> mx.array:
    return image.astype(mx.float32) * (1 / 255.0)
 def normalize(image: mx.array, mean: mx.array, std: mx.array) -> mx.array:
    return (image - mean) / std
--- a/clip/model.py
+++ b/clip/model.py
@@ -0,0 +1,283 @@
 # Copyright © 2023-2024 Apple Inc.
 import json
 from dataclasses import dataclass
 from pathlib import Path
 from typing import Any, Optional
 import mlx.core as mx
 import mlx.nn as nn
 from mlx.core import linalg as LA
 from mlx.nn.losses import cross_entropy
 from mlx.utils import tree_flatten
@dataclass
 class CLIPVisionOutput:
    pooler_output: mx.array
    last_hidden_state: mx.array
@dataclass
 class CLIPTextOutput:
    pooler_output: mx.array
    last_hidden_state: mx.array
@dataclass
 class CLIPModelOutput:
    loss: Optional[mx.array]
    text_embeds: Optional[mx.array]
    image_embeds: Optional[mx.array]
    text_model_output: CLIPTextOutput
    vision_model_output: CLIPVisionOutput
@dataclass
 class CLIPTextConfig:
    num_hidden_layers: int
    hidden_size: int
    intermediate_size: int
    num_attention_heads: int
    max_position_embeddings: int
    vocab_size: int
@dataclass
 class CLIPVisionConfig:
    num_hidden_layers: int
    hidden_size: int
    intermediate_size: int
    num_attention_heads: int
    num_channels: int
    image_size: int
    patch_size: int
@dataclass
 class CLIPConfig:
    text_config: CLIPTextConfig
    vision_config: CLIPVisionConfig
    projection_dim: int
 def quick_gelu(x: mx.array) -> mx.array:
    """
    A fast GELU approximation https://github.com/hendrycks/GELUs
    """
    return x * mx.sigmoid(1.702 * x)
 def clip_loss(logits: mx.array) -> mx.array:
    N, M = logits.shape
    caption_loss = cross_entropy(logits, mx.arange(N), reduction="mean")
    image_loss = cross_entropy(logits.T, mx.arange(M), reduction="mean")
    return (caption_loss + image_loss) / 2.0
 class CLIPEncoderLayer(nn.TransformerEncoderLayer):
    """The transformer encoder layer from CLIP."""
    def __init__(self, hidden_dim: int, intermediate_dim: int, num_heads: int):
        super().__init__(
            dims=hidden_dim,
            mlp_dims=intermediate_dim,
            num_heads=num_heads,
            activation=quick_gelu,
            norm_first=True,
        )
        # Add biases to the attention projections
        self.attention = nn.MultiHeadAttention(hidden_dim, num_heads, bias=True)
 class CLIPTextModel(nn.Module):
    """Implements the text encoder transformer from CLIP."""
    def __init__(self, config: CLIPTextConfig):
        super().__init__()
        self.token_embedding = nn.Embedding(config.vocab_size, config.hidden_size)
        self.position_embedding = mx.zeros(
            (config.max_position_embeddings, config.hidden_size)
        )
        self.layers = [
            CLIPEncoderLayer(
                config.hidden_size, config.intermediate_size, config.num_attention_heads
            )
            for _ in range(config.num_hidden_layers)
        ]
        self.final_layer_norm = nn.LayerNorm(config.hidden_size)
    def _embed(self, x: mx.array) -> mx.array:
        embeddings = self.token_embedding(x)
        embeddings += self.position_embedding[: x.shape[1]]
        return embeddings
    def __call__(self, x: mx.array) -> CLIPTextOutput:
        B, N = x.shape
        eot_tokens = mx.argmax(x, axis=-1)
        x = self._embed(x)
        mask = nn.MultiHeadAttention.create_additive_causal_mask(N, x.dtype)
        for l in self.layers:
            x = l(x, mask)
        last_hidden_state = self.final_layer_norm(x)
        pooler_output = last_hidden_state[mx.arange(B), eot_tokens]
        return CLIPTextOutput(
            pooler_output=pooler_output, last_hidden_state=last_hidden_state
        )
 class CLIPVisionModel(nn.Module):
    """Implements the vision encoder transformer from CLIP."""
    def __init__(self, config: CLIPVisionConfig):
        super().__init__()
        self.class_embedding = mx.zeros((config.hidden_size,))
        self.patch_embedding = nn.Conv2d(
            in_channels=config.num_channels,
            out_channels=config.hidden_size,
            kernel_size=config.patch_size,
            stride=config.patch_size,
            bias=False,
        )
        num_patches = (config.image_size // config.patch_size) ** 2
        num_positions = num_patches + 1
        self.position_embedding = mx.zeros((num_positions, config.hidden_size))
        self.pre_layernorm = nn.LayerNorm(config.hidden_size)
        self.layers = [
            CLIPEncoderLayer(
                config.hidden_size, config.intermediate_size, config.num_attention_heads
            )
            for _ in range(config.num_hidden_layers)
        ]
        self.post_layernorm = nn.LayerNorm(config.hidden_size)
    def _embed(self, x: mx.array) -> mx.array:
        batch_size = x.shape[0]
        # Patchify using conv:
        # [batch_size, sqrt(num_patches), sqrt(num_patches), embed_dim]
        patch_embeddings = self.patch_embedding(x)
        # [batch_size, num_patches, embed_dim]
        patch_embeddings = mx.flatten(patch_embeddings, start_axis=1, end_axis=2)
        embed_dim = patch_embeddings.shape[-1]
        # Prepend <CLS> embeddings
        # [batch_size, 1, embed_dim]
        cls_embeddings = mx.broadcast_to(
            self.class_embedding, (batch_size, 1, embed_dim)
        )
        # [batch_size, num_patches + 1, embed_dim]
        embeddings = mx.concatenate((cls_embeddings, patch_embeddings), axis=1)
        # Add positional encoding
        embeddings += self.position_embedding
        return embeddings
    def __call__(self, x: mx.array) -> CLIPVisionOutput:
        x = self._embed(x)
        x = self.pre_layernorm(x)
        for l in self.layers:
            x = l(x, mask=None)
        # Extract <CLS> token embedding
        pooler_output = self.post_layernorm(x[:, 0, :])
        return CLIPVisionOutput(pooler_output=pooler_output, last_hidden_state=x)
 class CLIPModel(nn.Module):
    def __init__(self, config: CLIPConfig):
        self.text_model = CLIPTextModel(config.text_config)
        self.vision_model = CLIPVisionModel(config.vision_config)
        text_embed_dim = config.text_config.hidden_size
        vision_embed_dim = config.vision_config.hidden_size
        projection_dim = config.projection_dim
        self.visual_projection = nn.Linear(vision_embed_dim, projection_dim, bias=False)
        self.text_projection = nn.Linear(text_embed_dim, projection_dim, bias=False)
        self.logit_scale = mx.array(0.0)
    def get_text_features(self, x: mx.array) -> mx.array:
        return self.text_projection(self.text_model(x).pooler_output)
    def get_image_features(self, x: mx.array) -> mx.array:
        return self.visual_projection(self.vision_model(x).pooler_output)
    def __call__(
        self,
        input_ids: Optional[mx.array] = None,
        pixel_values: Optional[mx.array] = None,
        return_loss=False,
    ) -> CLIPModelOutput:
        if input_ids is not None:
            text_model_output = self.text_model(input_ids)
            text_embeds = self.text_projection(text_model_output.pooler_output)
            text_embeds = text_embeds / LA.norm(text_embeds, axis=-1, keepdims=True)
        else:
            text_embeds = None
            text_model_output = None
        if pixel_values is not None:
            vision_model_output = self.vision_model(pixel_values)
            image_embeds = self.visual_projection(vision_model_output.pooler_output)
            image_embeds = image_embeds / LA.norm(image_embeds, axis=-1, keepdims=True)
        else:
            image_embeds = None
            vision_model_output = None
        if return_loss and (input_ids is None or pixel_values is None):
            raise ValueError("Must provide text and image inputs to compute loss.")
        if return_loss:
            logit_scale = mx.exp(self.logit_scale)
            logits = (text_embeds @ image_embeds.T) * logit_scale
            loss = clip_loss(logits)
        else:
            loss = None
        return CLIPModelOutput(
            loss=loss,
            text_embeds=text_embeds,
            image_embeds=image_embeds,
            vision_model_output=vision_model_output,
            text_model_output=text_model_output,
        )
    @staticmethod
    def from_pretrained(path: str):
        path = Path(path)
        with open(path / "config.json", "r") as fid:
            config = json.load(fid)
        text_config = config["text_config"]
        text_config = CLIPTextConfig(
            num_hidden_layers=text_config["num_hidden_layers"],
            hidden_size=text_config["hidden_size"],
            intermediate_size=text_config["intermediate_size"],
            num_attention_heads=text_config["num_attention_heads"],
            max_position_embeddings=text_config["max_position_embeddings"],
            vocab_size=text_config["vocab_size"],
        )
        vision_config = config["vision_config"]
        vision_config = CLIPVisionConfig(
            num_hidden_layers=vision_config["num_hidden_layers"],
            hidden_size=vision_config["hidden_size"],
            intermediate_size=vision_config["intermediate_size"],
            num_attention_heads=vision_config["num_attention_heads"],
            num_channels=3,
            image_size=vision_config["image_size"],
            patch_size=vision_config["patch_size"],
        )
        config = CLIPConfig(
            text_config=text_config,
            vision_config=vision_config,
            projection_dim=config["projection_dim"],
        )
        model = CLIPModel(config)
        model.load_weights(str(path / "weights.npz"))
        return model
--- a/clip/requirements.txt
+++ b/clip/requirements.txt
@@ -0,0 +1,6 @@
 mlx
 numpy
 transformers
 torch
 huggingface_hub
 Pillow
--- a/clip/test.py
+++ b/clip/test.py
@@ -0,0 +1,136 @@
 import unittest
 import mlx.core as mx
 import model
 import numpy as np
 import torch
 import transformers
 from image_processor import CLIPImageProcessor
 from PIL import Image
 from tokenizer import CLIPTokenizer
 from transformers import AutoTokenizer
 from transformers.image_processing_utils import ChannelDimension
 MLX_PATH = "mlx_model"
 HF_PATH = "openai/clip-vit-base-patch32"
 def load_mlx_models(path):
    image_proc = CLIPImageProcessor.from_pretrained(path)
    tokenizer = CLIPTokenizer.from_pretrained(path)
    clip = model.CLIPModel.from_pretrained(path)
    return image_proc, tokenizer, clip
 def load_hf_models(path):
    image_proc = transformers.CLIPImageProcessor.from_pretrained(path)
    tokenizer = AutoTokenizer.from_pretrained(path)
    clip = transformers.CLIPModel.from_pretrained(path)
    return image_proc, tokenizer, clip
 class TestCLIP(unittest.TestCase):
    @classmethod
    def setUpClass(cls):
        cls.mx_image_proc, cls.mx_tokenizer, cls.mx_clip = load_mlx_models(MLX_PATH)
        cls.hf_image_proc, cls.hf_tokenizer, cls.hf_clip = load_hf_models(HF_PATH)
    def test_image_processor(self):
        image = Image.open("assets/cat.jpeg")
        mx_data = self.mx_image_proc([image])
        hf_data = mx.array(
            np.array(
                self.hf_image_proc([image], data_format=ChannelDimension.LAST)[
                    "pixel_values"
                ]
            )
        )
        self.assertTrue(mx.allclose(mx_data, hf_data, atol=1e-5))
    def test_text_tokenizer(self):
        texts = ["a photo of a cat", "a photo of a dog"]
        for txt in texts:
            self.assertTrue(
                np.array_equal(
                    self.mx_tokenizer.tokenize(txt)[None, :],
                    self.hf_tokenizer(txt, return_tensors="np")["input_ids"],
                ),
            )
    def test_text_encoder(self):
        texts = ["a photo of a cat", "a photo of a dog"]
        # Tokenize
        hf_tokens = self.hf_tokenizer(texts, return_tensors="pt")
        mx_tokens = self.mx_tokenizer(texts)
        # Get expected
        with torch.inference_mode():
            expected_out = self.hf_clip.text_model(**hf_tokens)
            expected_last_hidden = expected_out.last_hidden_state.numpy()
            expected_pooler_output = expected_out.pooler_output.numpy()
        out = self.mx_clip.text_model(mx_tokens)
        self.assertTrue(
            np.allclose(out.last_hidden_state, expected_last_hidden, atol=1e-5)
        )
        self.assertTrue(
            np.allclose(out.pooler_output, expected_pooler_output, atol=1e-5)
        )
    def test_vision_encoder(self):
        # Load and process test image
        x = self.hf_image_proc(
            images=[Image.open("assets/dog.jpeg")], return_tensors="np"
        ).pixel_values
        # Infer with HuggingFace model
        with torch.inference_mode():
            # Get expected
            x_tc = torch.tensor(x)
            expected_out = self.hf_clip.vision_model(x_tc)
            expected_last_hidden = expected_out.last_hidden_state.numpy()
            expected_pooler_output = expected_out.pooler_output.numpy()
        # Test MLX vision encoder
        out = self.mx_clip.vision_model(mx.array(x.transpose(0, 2, 3, 1)))
        self.assertTrue(
            np.allclose(
                out.last_hidden_state, expected_last_hidden, rtol=1e-4, atol=1e-3
            ),
        )
        self.assertTrue(
            np.allclose(
                out.pooler_output, expected_pooler_output, rtol=1e-4, atol=1e-3
            ),
        )
    def test_clip_model(self):
        image_input = self.hf_image_proc(
            images=[Image.open("assets/cat.jpeg"), Image.open("assets/dog.jpeg")],
            return_tensors="np",
        )["pixel_values"]
        text = ["a photo of a cat", "a photo of a dog"]
        tokens = self.hf_tokenizer(text, return_tensors="np")["input_ids"]
        with torch.inference_mode():
            expected_out = self.hf_clip(
                input_ids=torch.tensor(tokens),
                pixel_values=torch.tensor(image_input),
                return_loss=True,
            )
        out = self.mx_clip(
            input_ids=mx.array(tokens),
            pixel_values=mx.array(image_input.transpose((0, 2, 3, 1))),
            return_loss=True,
        )
        self.assertTrue(
            np.allclose(out.text_embeds, expected_out.text_embeds, atol=1e-5)
        )
        self.assertTrue(
            np.allclose(out.image_embeds, expected_out.image_embeds, atol=1e-5)
        )
        self.assertTrue(np.allclose(out.loss, expected_out.loss, atol=1e-5))
 if __name__ == "__main__":
    unittest.main()
--- a/clip/tokenizer.py
+++ b/clip/tokenizer.py
@@ -0,0 +1,121 @@
 # Copyright © 2023-2024 Apple Inc.
 import json
 from pathlib import Path
 from typing import Any
 import mlx.core as mx
 import regex
 class CLIPTokenizer:
    """A simple port of CLIPTokenizer from https://github.com/huggingface/transformers/ ."""
    def __init__(self, bpe_ranks, vocab):
        self.bpe_ranks = bpe_ranks
        self.vocab = vocab
        self.pat = regex.compile(
            r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""",
            regex.IGNORECASE,
        )
        self._cache = {self.bos: self.bos, self.eos: self.eos}
    @property
    def bos(self):
        return "<|startoftext|>"
    @property
    def bos_token(self):
        return self.vocab[self.bos]
    @property
    def eos(self):
        return "<|endoftext|>"
    @property
    def eos_token(self):
        return self.vocab[self.eos]
    def bpe(self, text):
        if text in self._cache:
            return self._cache[text]
        unigrams = list(text[:-1]) + [text[-1] + "</w>"]
        unique_bigrams = set(zip(unigrams, unigrams[1:]))
        if not unique_bigrams:
            return unigrams
        # In every iteration try to merge the two most likely bigrams. If none
        # was merged we are done.
        #
        # Ported from https://github.com/huggingface/transformers/blob/main/src/transformers/models/clip/tokenization_py
        while unique_bigrams:
            bigram = min(
                unique_bigrams, key=lambda pair: self.bpe_ranks.get(pair, float("inf"))
            )
            if bigram not in self.bpe_ranks:
                break
            new_unigrams = []
            skip = False
            for a, b in zip(unigrams, unigrams[1:]):
                if skip:
                    skip = False
                    continue
                if (a, b) == bigram:
                    new_unigrams.append(a + b)
                    skip = True
                else:
                    new_unigrams.append(a)
            if not skip:
                new_unigrams.append(b)
            unigrams = new_unigrams
            unique_bigrams = set(zip(unigrams, unigrams[1:]))
        self._cache[text] = unigrams
        return unigrams
    def __call__(self, *args: Any, **kwargs: Any) -> Any:
        return self.tokenize(*args, **kwargs)
    def tokenize(self, text, prepend_bos=True, append_eos=True) -> mx.array:
        if isinstance(text, list):
            return mx.array([self.tokenize(t, prepend_bos, append_eos) for t in text])
        # Lower case, cleanup, and split. Hugging Face does a much,
        # more thorough job here but this should suffice for 95% of
        # cases.
        clean_text = regex.sub(r"\s+", " ", text.lower())
        tokens = regex.findall(self.pat, clean_text)
        # Split the tokens according to the byte-pair merge file
        bpe_tokens = [ti for t in tokens for ti in self.bpe(t)]
        # Map to token ids and return
        tokens = []
        if prepend_bos:
            tokens.append(self.bos_token)
        tokens.extend(self.vocab[t] for t in bpe_tokens)
        if append_eos:
            tokens.append(self.eos_token)
        return mx.array(tokens)
    @staticmethod
    def from_pretrained(path: str):
        path = Path(path)
        with open(path / "vocab.json", encoding="utf-8") as f:
            vocab = json.load(f)
        with open(path / "merges.txt", encoding="utf-8") as f:
            bpe_merges = f.read().strip().split("\n")[1 : 49152 - 256 - 2 + 1]
        bpe_merges = [tuple(m.split()) for m in bpe_merges]
        bpe_ranks = dict(map(reversed, enumerate(bpe_merges)))
        return CLIPTokenizer(bpe_ranks, vocab)