Files
mlx-examples/esm/notebooks/embeddings.ipynb

387 lines
173 KiB
Plaintext
Raw Normal View History

2025-08-15 23:48:57 -04:00
{
"cells": [
{
"cell_type": "markdown",
"id": "acfe011d",
"metadata": {},
"source": [
"## Exploring Protein Relationships Through ESM-2 Embeddings\n",
"\n",
"Proteins are molecular machines with unique structures that determine their functions. ESM-2 treats protein sequences as a language, learning representations that capture evolutionary and functional relationships without relying on traditional sequence alignment.\n",
"\n",
"In this notebook, we'll explore how ESM-2 embeddings reveal relationships between six human proteins:\n",
"\n",
"**Oxygen Transport & Storage:**\n",
"- **Hemoglobin Beta**: The oxygen-carrying protein in red blood cells, part of the tetrameric hemoglobin complex\n",
"- **Myoglobin**: The oxygen storage protein in muscle tissue, structurally similar to individual hemoglobin subunits\n",
"\n",
"**Defense & Immunity:**\n",
"- **Lysozyme C**: An antimicrobial enzyme that breaks down bacterial cell walls, found in tears, saliva, and mucus\n",
"- **Defensin Beta 4A**: A small antimicrobial peptide that directly kills bacteria and other pathogens\n",
"\n",
"**Structural Support:**\n",
"- **Alpha-1 Type I Collagen**: The most abundant protein in the human body, providing strength to bones, skin, and connective tissues\n",
"- **Elastin**: The protein that gives tissues their elasticity, crucial for arteries, lungs, and skin"
]
},
{
"cell_type": "markdown",
"id": "20f98f7f",
"metadata": {},
"source": [
"### Setup\n",
"\n",
"Here we import all neccessary libraries."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "2bacd1ff",
"metadata": {},
"outputs": [],
"source": [
"import mlx.core as mx\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from sklearn.decomposition import PCA\n",
"from sklearn.manifold import TSNE\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "5563c495",
"metadata": {},
"source": [
"These are our protein sequences, obtained from [UniProt](https://www.uniprot.org/)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "b8e9d6d2",
"metadata": {},
"outputs": [],
"source": [
"proteins = [\n",
" # Oxygen Transport\n",
" (\"Hemoglobin Beta\", \"MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH\"),\n",
" (\"Myoglobin\", \"MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG\"),\n",
"\n",
" # Antimicrobial Defense\n",
" (\"Lysozyme C\", \"MKALIVLGLVLLSVTVQGKVFERCELARTLKRLGMDGYRGISLANWMCLAKWESGYNTRATNYNAGDRSTDYGIFQINSRYWCNDGKTPGAVNACHLSCSALLQDNIADAVACAKRVVRDPQGIRAWVAWRNRCQNRDVRQYVQGCGV\"),\n",
" (\"Defensin Beta 4A\", \"MRVLYLLFSFLFIFLMPLPGVFGGIGDPVTCLKSGAICHPVFCPRRYKQIGTCGLPGTKCCKKP\"),\n",
"\n",
" # Structural Proteins\n",
" (\"Alpha-1 Type I Collagen\", \"MFSFVDLRLLLLLAATALLTHGQEEGQVEGQDEDIPPITCVQNGLRYHDRDVWKPEPCRICVCDNGKVLCDDVICDETKNCPGAEVPEGECCPVCPDGSESPTDQETTGVEGPKGDTGPRGPRGPAGPPGRDGIPGQPGLPGPPGPPGPPGPPGLGGNFAPQLSYGYDEKSTGGISVPGPMGPSGPRGLPGPPGAPGPQGFQGPPGEPGEPGASGPMGPRGPPGPPGKNGDDGEAGKPGRPGERGPPGPQGARGLPGTAGLPGMKGHRGFSGLDGAKGDAGPAGPKGEPGSPGENGAPGQMGPRGLPGERGRPGAPGPAGARGNDGATGAAGPPGPTGPAGPPGFPGAVGAKGEAGPQGPRGSEGPQGVRGEPGPPGPAGAAGPAGNPGADGQPGAKGANGAPGIAGAPGFPGARGPSGPQGPGGPPGPKGNSGEPGAPGSKGDTGAKGEPGPVGVQGPPGPAGEEGKRGARGEPGPTGLPGPPGERGGPGSRGFPGADGVAGPKGPAGERGSPGPAGPKGSPGEAGRPGEAGLPGAKGLTGSPGSPGPDGKTGPPGPAGQDGRPGPPGPPGARGQAGVMGFPGPKGAAGEPGKAGERGVPGPPGAVGPAGKDGEAGAQGPPGPAGPAGERGEQGPAGSPGFQGLPGPAGPPGEAGKPGEQGVPGDLGAPGPSGARGERGFPGERGVQGPPGPAGPRGANGAPGNDGAKGDAGAPGAPGSQGAPGLQGMPGERGAAGLPGPKGDRGDAGPKGADGSPGKDGVRGLTGPIGPPGPAGAPGDKGESGPSGPAGPTGARGAPGDRGEPGPPGPAGFAGPPGADGQPGAKGEPGDAGAKGDAGPPGPAGPAGPPGPIGNVGAPGAKGARGSAGPPGATGFPGAAGRVGPPGPSGNAGPPGPPGPAGKEGGKGPRGETGPAGRPGEVGPPGPPGPAGEKGSPGADGPAGAPGTPGPQGIAGQRGVVGLPGQRGERGFPGLPGPSGEPGKQGPSGASGERGPPGPMGPPGLAGPPGESGREGAPGAEGSPGRDGSPGAKGDRGETGPAGPPGAPGAPGAPGPVGPAGKSGDRGETGPAGPAGPVGPVGARGPAGPQGPRGDKGETGEQGDRGIKGHRGFSGLQGPPGPPGSPGEQGPSGASGPAGPRGPPGSAGAPGKDGLNGLPGPIGPPGPRGRTGDAGPVGPPGPPGPPGPPGPPSAGFDFSFLPQPPQEKAHDGGRYYRADDANVVRDRDLEVDTTLKSLSQQIENIRSPEGSRKNPARTCRDLKMCHSDWKSGEYWIDPNQGCNLDAIKVFCNMETGETCVYPTQPSVAQKNWYISKNPKDKRHVWFGESMTDGFQFEYGGQGSDPADVAIQLTFLRLMSTEASQNITYHCKNSVAYMDQQTGNLKKALLLQGSNEIEIRAEGNSRFTYSVTVDGCTSHTGAWGKTVIEYKTTKTSRLPIIDVAPLDVGAPDQEFGFDVGPVCFL\"),\n",
" (\"Elastin\", \"MAGLTAAAPRPGVLLLLLSILHPSRPGGVPGAIPGGVPGGVFYPGAGLGALGGGALGPGGKPLKPVPGGLAGAGLGAGLGAFPAVTFPGALVPGGVADAAAAYKAAKAGAGLGGVPGVGGLGVSAGAVVPQPGAGVKPGKVPGVGLPGVYPGGVLPGARFPGVGVLPGVPTGAGVKPKAPGVGGAFAGIPGVGPFGGPQPGVPLGYPIKAPKLPGGYGLPYTTGKLPYGYGPGGVAGAAGKAGYPTGTGVGPQAAAAAAAKAAAKFGAGAAGVLPGVGGAGVPGVPGAIPGIGGIAGVGTPAAAAAAAAAAKAAKYGAAAGLVPGGPGFGPGVVGVPGAGVPGVGVPGAGIPVVPGAGIPGAAVPGVVSPEAAAKAAAKAAKYGARPGVGVGGIPTYGVGAGGFPGFGVGVGGIPGVAGVPGVGGVPGVGGVPGVGISPEAQAAAAAKAAKYGAAGAGVLGGLVPGAPGAVPGVPGTGGVPGVGTPAAAAAKAAAKAAQFGLVPGVGVAPGVGVAPGVGVAPGVGLAPGVGVAPGVGVAPGVGVAPGIGPGGVAAAAKSAAKVAAKAQLRAAAGLGAGIPGLGVGVGVPGLGVGAGVPGLGVGAGVPGFGAGADEGVRRSLSPELREGDPSSSQHLPSTPSSPRVPGALAAAKAAKYGAAVPGVLGGLGALGGVGIPGGVVGAGPAAAAAAAKAAAKAAQFGLVGAAGLGGLGVGGLGVPGVGGLGGIPPAAAAKAAKYGAAGLGGVLGGAGQFPLGGVAARPGFGLSPIFPGGACLGKACGRKRK\"),\n",
"]"
]
},
{
"cell_type": "markdown",
"id": "c9621578",
"metadata": {},
"source": [
"### Loading the model and tokenizing a sequence\n",
"\n",
"First, load the ESM-2 model. Change the path below to point to your converted checkpoint.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "05696400",
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"sys.path.append(\"..\")\n",
"\n",
"from esm import ESM2\n",
"\n",
"esm_checkpoint = \"../checkpoints/mlx-esm2_t33_650M_UR50D\"\n",
"tokenizer, model = ESM2.from_pretrained(esm_checkpoint)"
]
},
{
"cell_type": "markdown",
"id": "2916adbb",
"metadata": {},
"source": [
"Here, we tokenize and decode the protein sequence for human Insulin."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "47178dcd",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sequence: MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN\n",
"Tokens: array([0, 20, 5, ..., 23, 17, 2], dtype=int32)\n",
"Decoded: <cls>MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN<eos>\n"
]
}
],
"source": [
"human_insulin_sequence = \"MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN\"\n",
"tokens = tokenizer.encode(human_insulin_sequence)\n",
"print(f\"Sequence: {human_insulin_sequence}\")\n",
"print(f\"Tokens: {tokens}\")\n",
"print(f\"Decoded: {tokenizer.decode(tokens)}\")"
]
},
{
"cell_type": "markdown",
"id": "c1b73ded",
"metadata": {},
"source": [
"### Embedding sequences\n",
"\n",
"To compute the embeddings of our proteins, we pass each protein sequence through ESM-2's tokenizer to convert amino acids into token IDs, then extract the final layer representations using `get_sequence_representations()`. This process gives us a vector for each protein that captures its learned functional and evolutionary features."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "cb470957",
"metadata": {},
"outputs": [],
"source": [
"def extract_embeddings(model, protein_list):\n",
" embeddings = []\n",
" names = []\n",
" for name, sequence in protein_list:\n",
" tokens = model.tokenizer.encode(sequence, return_batch_dim=True)\n",
" embedding = model.get_sequence_representations(tokens, layer=-1)\n",
" embeddings.append(embedding[0])\n",
" names.append(name)\n",
" return mx.stack(embeddings), names"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "38e83142",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Embedding shape: (6, 1280)\n",
"Each protein represented by 1280 features\n"
]
}
],
"source": [
"embeddings, protein_names = extract_embeddings(model, proteins)\n",
"print(f\"\\nEmbedding shape: {embeddings.shape}\")\n",
"print(f\"Each protein represented by {embeddings.shape[1]} features\")"
]
},
{
"cell_type": "markdown",
"id": "fccd2a99",
"metadata": {},
"source": [
"### Protein embedding similarity matrix\n",
"\n",
"We can measure how similar the protein embeddings are by calculating a similarity matrix. We normalize each embedding to unit length and compute cosine similarities between all pairs, producing a matrix where values close to 1 indicate highly similar proteins and values close to 0 indicate dissimilar ones."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "93d14fff",
"metadata": {},
"outputs": [],
"source": [
"def compute_similarity_matrix(embeddings):\n",
" normalized = embeddings / mx.linalg.norm(embeddings, axis=1, keepdims=True)\n",
" similarity_matrix = normalized @ normalized.T\n",
" return similarity_matrix"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "3485f854",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAswAAAJOCAYAAACjqVHJAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjUsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvWftoOwAAAAlwSFlzAAAPYQAAD2EBqD+naQABAABJREFUeJzs3QV4U2cXB/B/3b1QgeJuxd21MNzdHQaM4c7GkOHO2IZ9DPfh7u5QCi0tpYW6QV3zPecNSZM2LaWUpmHn9zyBJrm5SW5ubs4997znakkkEgkYY4wxxhhjKmmrvpkxxhhjjDFGOGBmjDHGGGMsExwwM8YYY4wxlgkOmBljjDHGGMsEB8yMMcYYY4xlggNmxhhjjDHGMsEBM2OMMcYYY5nggJkxxhhjjLFMcMDMGGOMMcZYJjhgZiyXzJs3D1paWrh8+TLyAm9vb/F6Bg4c+M2eo3HjxuI5FNH7p9toeXxLqp6bqW/djImJQYECBTB8+PBv9hws+2g7QOsAbRc0YVuk6vu9bds2cRv9nxe9evUKurq62LBhg7pfCssGDphZnibbkCpe9PX14eTkhN69e+Pp06ff7LlzK7DLSc+fP8eAAQNQpEgRGBgYwMLCAiVKlEDnzp2xevVqSCQS/Jfl1g8qLX/Z+kqfiSrJyckigJRN9zWBSl4PFMjSpUsREhKCWbNmqQx8MrukDeS/ZD1X3IbY29sjKSlJ5etzc3OTT0fzzSoPDw8sXLgQDRs2hKOjo3z71L9/f7x8+TJbQWtml7z8GbPMlS5dGr169cL8+fMRGRmp7pfDvpDulz6AMXUoXrw4+vbtK/6OiorC7du3sXv3bhw6dAgXLlxAvXr1kNeNHTsWPXv2RKFChb7J/M+dO4e2bduKgKB58+bo1KkTDA0N4enpiStXruDw4cMYM2aMyHAQCtYoSKBg41vZsWOHyCyqgzqfm2hrS/MRW7ZswYoVK9Ldf+rUKfj5+YnPI6Mg7ntZNz9+/Ihly5ahR48eGT7Hzz//DFNTU5X3KQawX7qey9D1wMBAnDx5Eu3bt0/3HH///bf8M/sSs2fPxt69e1GhQgV06NAB5ubmePbsGf73v//hwIEDOH36tAimv8SQIUNQsGBBlfdVrlz5i1/jfwWtC7Vr14aDgwPyqilTpmDnzp1Ys2YNZs6cqe6Xw74AB8xMI1D2KG2mlzJVv/32m9jo5JUyh8zY2tqKy7cyatQokbU8f/48mjRponQfZdzOnj0LHR0d+W16enooU6YMvqVvFYDl9eeWLV8KlOjHccmSJeK6IgqkaWfF2dkZV69exfe8blLwSDu6lHXNyKRJk0QGOKfXc5m6deviyZMnYrmnDZgp+KbPiQJwCrq/hIuLC6ZOnYoqVaoo3b5nzx6RTaTX6+rq+kXzHDp0qAj82Jeh79O3TADkhIoVK6JSpUr4888/MX369GztpDH14E+Kaawff/xR/H/v3j35bXTIkg7xvn//Xvw40w8wbZAUA+qtW7eiVq1aIptFF/o77WFOCs5lP8Z0+EzxkKjiofOEhASRPaxatSpMTExgZmaGBg0a4NixY1mqE1Ws3Xv9+rXIkFhZWYl50Y83/cBnRVBQkMiwUZYrbRAhWy6tWrVSqvnLqG5Qdog8Pj4eM2bMEIGnkZERqlWrJoIU8uHDB5HFo0PQlN2rU6cO7t69+1V1xJcuXcLgwYPFYUvZZ1O9enVs3rxZ5fSf+6zTPje9z0GDBom/6X/Fz5TUr19fZCH9/f1VPh89B01769YtZBW9n+DgYPz7779Kt9Ntx48fFwEVLdu0aL1au3at+Mzo8D6VHeTPn1+UHDx69Ehp2s+9L8VlERcXJ3Y06YgNBfCynVBV6+bIkSPFbYsXL073+mT30Y5AVtB3ztraGk2bNsXXyM56LkPLmbLoJ06cEPNRRJ8FZZ/p8/pStPzTBsuEnqtUqVJ48eKFKEX5FhQ/N1rGFIzR+yxatKjIYMp2IpYvXy6+V/RdLVmypDj6kpGUlBT8/vvvYjqanub1yy+/IDExUeX0tLPXrl07scNF6yk9jtYxVUd3aEeH1hlKgNC86f9FixaJ58zI9evX0ahRI7FNtLGxEUcpfH19v6g0SbatoM+YSnnotdJyop2SjJItVO7Xpk0bsU2nIJz+plIgVbXe9Pr/+usv1KxZU6znNG86QkDLRdX8u3fvjrdv34ptHtMcnGFmGi/tj2NoaKgI4GjDRT9aFCTQYVIybtw4EYhQOQId9iQHDx4UgQYFIlT/SGjjShvE7du3i401XZextLQU/1NASdkl2iDSYVKaH/2o0A8yHZql56FD3VlBz0Ub7/Lly4sfbQoKjh49KoICKpuws7PL9PG0QZcFe9HR0eLH5WvRDxMdWqZsXGxsLP755x9xKPzGjRti4BYFdd26dRPBHx2SpmXx5s2bbGd46IeUdhpoOdCOQ0REhDicPWLECDFYhn7008rss06rY8eOYp60XOnzSXtom56H3hsFHrSjoIgeR4fX6fOh58sq2Q4QzZOCXcWMK60r9FnTIf20wsLCMGHCBLHzRT/UNA8vLy+xI0alHBSk1KhRI0vvS1GXLl3EThh9VrQeUzCUkZUrV4rnmTNnDpo1ayZ/Pip5+OOPP0TwO3ny5M8ug/DwcPHdatmy5Vdn0752PaflTa+dlj+VgMhQ1pnWIVqWOUl2VCFteUhOW7VqldgO0edPnwtt08aPHw9jY2Ox7Ok6fXfpc6TMt6z+W1WpCK139D2goI52Wmlnb+7cuSKApO+Aoo0bN4odZ1qXKDiknbr79++LI38UDNKFarplaLtBy5rWO3ocfV8p4XDz5k2V74vK7Vq3bi3WG9oe0Q66rASPvhNfgr4jtFNM61C/fv3EThNtt2gH68GDB2InTIa+I/Tdo3WMvre0E0Dvix5PR4TSokwx7WTQjiiNraEgm3bkKdinJIPi7weRbUPovdBnwjSEhLE87M2bNzR6R9KqVat0982ZM0fc16RJE/ltdJ0ugwYNkiQlJSlNf+XKFXFf2bJlJREREfLbw8LCJKVKlRL3Xb16VX77pUuXxG1z585V+dpmzJgh7p89e7YkJSVFfvvHjx8l1atXl+jr60vev38vv53mQ9PTfNO+P7osXrxYaf6zZs0Sty9atChLy6pz585i+ooVK0rWrFkjuX//viQ+Pj7D6WXPPWDAAKXbGzVqJG6vX7++JCoqSn773r17xe2WlpaSbt26SRITE+X3LVmyRNy3fPlylfNSlNFy9fLySvca6TlatGgh0dHRkbx9+1bpvsw+64yee+vWreI2+j+t2NhYibW1taRYsWJKnydZt26deNyqVaskWVG4cGGJgYGB+Hvs2LESXV1dib+/v/z+8uXLi8+J0LpN86bPQyYuLk7y7t27dPN9/vy5xNTUVNK8efMsvy/FZVG5cmVJaGhouvtVrZvk8ePH4n0UL15cEhkZKfH19RXLyMbGRmndzsyJEyfEvGfOnJnpa/v555/F60h7Sbv+Z3c9l21DKlSoIJa/DH0u9Pn8+OOP4jq9X/r8vtadO3fE89aoUSPLj6HvIj1myJAhKpcFXWg9Tfu50Wfi6ekpv93Hx0dsfywsLMS2LSgoSH7f7du3xWPatWun8rnz5csnPmcZWrYNGzYU9x04cEB+u6urq1huzs7OkpCQEKV50WdG0y9btizd956mV9yu0Hpua2ubbluUnJwsvotaWlqSa9euyW+n72bv3r3l3/+sfA9k044ePVrMV+avv/4St48YMUJpetr20e3//POP0u20rZfNS/H7Ssvf0dFREh0dLUlL1fftw4cPYh60XJnm4ICZ5WmyHzv6wZb9YEyaNEnSoEEDcbuhoaHk5s2b8unpNvqhCA4OTjevwYMHi/sp8EuLNox0H02TlYCZNrpWVlbidaUNrsixY8fEY9euXZulgLlo0aJKG3LF+yhAyAr60aIfQdkGXbYs6tatK1m9erUkJibmiwJm2sFI+5719PTEfWmDV/qBptv79++vcl6KPrcjktbBgwfF9Nu2bVO6PbPPOqPn/lxg+dNPP4n7z58/r3R7lSpVRCCl6sfvcwHzw4c
"text/plain": [
"<Figure size 800x600 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"similarity_matrix = compute_similarity_matrix(embeddings)\n",
"\n",
"plt.figure(figsize=(8, 6))\n",
"similarity_np = np.array(similarity_matrix)\n",
"\n",
"sim_df = pd.DataFrame(similarity_np, \n",
" index=protein_names, \n",
" columns=protein_names)\n",
"\n",
"sns.heatmap(sim_df, annot=True, cmap='viridis', \n",
" fmt='.3f', square=True, cbar_kws={'label': 'Cosine Similarity'})\n",
"plt.title('Protein Similarity Matrix (ESM-2 Embeddings)', fontsize=14)\n",
"plt.xticks(rotation=45, ha='right')\n",
"plt.yticks(rotation=0)\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "1532dfc5",
"metadata": {},
"source": [
"The similarity matrix shows both expected and unexpected relationships in ESM-2s learned representations. We see clear functional clustering for oxygen-binding proteins such as Hemoglobin Beta and Myoglobin, which have a similarity score of 0.986, and for structural proteins such as Collagen and Elastin, which have a similarity score of 0.852. Interestingly, Lysozyme C shows high similarity to both Myoglobin, with a score of 0.906, and Hemoglobin Beta, with a score of 0.921, even though it is an antimicrobial enzyme. This is because ESM-2 has likely learned that these three proteins share a fundamental blueprint of compact globular proteins, including similar size, folding patterns, and structural stability that go beyond their specific biological functions. We will use PCA and t-SNE to better visualize how these proteins cluster in the high dimensional embedding space."
]
},
{
"cell_type": "markdown",
"id": "66794597",
"metadata": {},
"source": [
"### PCA visualization\n",
"\n",
"PCA (Principal Component Analysis) reduces high-dimensional data to a lower-dimensional representation by finding the directions of maximum variance in the data. This allows us to visualize our high-dimensional protein embeddings in 2D space while preserving the most important patterns of similarity and difference between proteins."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "67afd74e",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjcAAAHWCAYAAACL2KgUAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjUsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvWftoOwAAAAlwSFlzAAAPYQAAD2EBqD+naQAAYn5JREFUeJzt3Qd8U+X3P/BTKLQUZO89ZSp7I0OQISLgQHCwRRRFNoIMUYYsAREEUZki4ICvouylMmTIXrL3HmW2hZL/63P83fyTkDRpSZrk9vN+vUKbm5ubm6cp9/Q853meEIvFYhEiIiIik0jm7xMgIiIi8iYGN0RERGQqDG6IiIjIVBjcEBERkakwuCEiIiJTYXBDREREpsLghoiIiEyFwQ0RERGZCoMbIiIiMhUGN0SJ7NatW9KxY0fJnj27hISESLdu3cQM2rZtK/nz55dgVLt2bSlVqlSivBZ+5h999JHb/bAP9rWF9kU7E1HcGNxQUJkxY4b+h2/cwsPD5fHHH5d3331XLly48ND+2NarVy8pVqyYRERESOrUqaV8+fIydOhQuX79utPXqFSpkh77yy+/9Ml7GD58uL6Pt99+W2bPni1vvPGGy31xMbN9v1mzZpWnnnpKFi5c6NVz2rBhg15MXbWJP6xdu9buvTve5s2b5+9TJKIAFervEyBKiI8//lgKFCggUVFR8tdff2kg8vvvv8uePXs0iIEtW7bIs88+q5mS119/XYMa2Lp1q3z66afyxx9/yPLly+2Oe+jQIX0egorvvvtOAxBvW716tVSpUkUGDx7s0f5lypSRnj176vdnz56VqVOnygsvvKDvuXPnzl4LboYMGaJZgfTp0yfoGNOmTZMHDx6It3Xt2lUqVqz40PaqVatKUnPw4EFJlox/kxK5w+CGglKjRo2kQoUK+j26eDJlyiSfffaZ/O9//5NWrVppBqJ58+aSPHly2b59u2ZubA0bNkwvxo7mzJmj2ZGxY8fKSy+9JMePH/d6V8vFixelRIkSHu+fK1cuDc4MrVu3lsKFC8u4ceNcBjf379/XQCNlypSSWFKkSOGT4yJThZ8FiYSFhfn7FIiCAv8EIFN4+umn9euxY8f0K7IbZ86c0YDHMbCBbNmyyYABAx7aPnfuXL2QPvfcc5IuXTq9H5+gpUOHDnpsdJeVLl1aZs6c+VA3C87xt99+s3avIICKD9TqFC9e3Ppe8XwcZ8yYMTJ+/HgpVKiQXgT37dtnzRQhQECXHLIyTZs2lf3791uPh+6o3r176/fIhjk7LwR9yHylSpVKMmbMKC1btpRTp07FWXNje15fffWV9byQhUF2zJvwOuia/OGHHzRwxHkis7N7927r5wEBIX4uqK9x1ebbtm2TatWq6fPRFlOmTHlon+joaM264Xh4P3ny5JE+ffrodsf9unfvLlmyZJHHHntMnn/+eTl9+rTT10X2Ee2C80M74Xydcay5Mbpp169fLz169NDXws8Zgf2lS5fsnotgFz/rnDlzanazTp06+hlxPOa9e/c0i1ekSBE9H/zhUKNGDVmxYkWcPwOiQMLMDZnCkSNH9Cv+I4ZffvlFL1Dx+Yv/77//lsOHD8v06dM144GuH3RN9e/f3+1z7969qxdNPB8XWVwYcaHFRQNZpPfff18DEtTY4IKXO3dua1cTLkjxgYsPAgvjvRpw3uim69Spk150EYSsXLlSs1wFCxbUCxvOc+LEiVK9enX5559/9MKG9/nvv//K999/r9mgzJkz250XslwDBw6UFi1aaJYMF00co2bNmpoVc9eNhQDx5s2b8tZbb+mFeNSoUfqaR48e9Sjbg+devnz5oe14/7YFt3/++af+3Lt06aL3R4wYoUEqAo/JkyfLO++8I9euXdPXb9++vQZ9tvAYujHxPpH9W7BggXZL4rOA/Y0AAUEKghG0M36mCKDQbmjDRYsWWY+HtkJQ+Oqrr2rAhNdr3LjxQ+8Dz69fv762N35GyLoheEKQ7Kn33ntPMmTIoM9D4IYgF5/D+fPnW/fp16+fvvcmTZpIgwYNZOfOnfoVnxlbOAe0Hc4f9Wc3btzQrlx8Xp555hmPz4nIryxEQWT69OkWfGxXrlxpuXTpkuXUqVOWefPmWTJlymRJlSqV5fTp07pfhgwZLKVLl47Xsd99911Lnjx5LA8ePND7y5cv19favn272+eOHz9e950zZ451W0xMjKVq1aqWNGnSWG7cuGHdni9fPkvjxo09OifsW79+fX2vuO3cudPSsmVLfa333ntP9zl27JjeT5s2reXixYt2zy9Tpowla9aslitXrli34RjJkiWztG7d2rpt9OjRegwcy9bx48ctyZMntwwbNsxu++7duy2hoaF229u0aaPnazDOCz+bq1evWrf/73//0+2//vprnO99zZo1up+r27lz56z74n5YWJjd+U+dOlW3Z8+e3a79+/Xr99B7rVWrlm4bO3asdVt0dLS1/fCzhNmzZ2vb/fnnn3bnOmXKFH3++vXr9f6OHTv0/jvvvGO336uvvqrbBw8ebN3WrFkzS3h4uOXEiRPWbfv27dN2d/wvGu2Ldnb8fahXr571cwvdu3fX51+/fl3vnz9/Xn9eeC1bH330kT7f9pj4vfH080kUqNgtRUGpXr16+pcuugTQRZImTRodQYT6FMBfm+gK8BT+WsZfua+88oo1G4CuLtTfIHvjDoqZ0V2Ev/gNyEqgGBYFzevWrZOEQtEz3itu6OpCRggjrEaOHGm334svvmiXBTp37pzs2LFDs0fI4hiefPJJ/Qsc5+zOzz//rNkKZDOQPTFueK/otlizZo3bY6BNkVUwoIsMkLnxxKBBg7RLxPFm+56gbt26dt1ilStXtraL7WfB2O74+qGhoZpdMiBjg/vobkR3FaDtka1BV6dtexjdokZ7GG2Ln78tx2H/sbGxsmzZMmnWrJnkzZvXuh2vgayKp5BFss1ioY1x7BMnTuj9VatW6Wcc2SvHjI8jZOL27t2rxfVEwYrdUhSUJk2apEPAcUFC+r5o0aJ2o0jSpk2r3RnxCSDQ3YI0PLqWDKhLQHcNAom4RqngIoKLveM+uEgZjycULsYYuo6LF2olcExnXUHoCnM8J0DbOMIxcFG9ffu21mi4ggscEiN4b8540q1ke9EGI9BBN5AnnnjiCQ1m4/s6qJkCBMDOtju+PmpRHNsCnzFAVw9GuKE9UK/kqisRgZDR9vgsoH7GluPPAp85dBU6a1/s60kA6kkbG58F1AnZQoBoG3gaIxFRl4X3jrl/GjZsqME0gmKiYMHghoISghBjtJQz+MsaWYuYmBiPRgwZ2RlkKJxB5gWBjj+gBsaTiztqjLwNWRsEVUuWLNGRZ46QMXPH2fPgv94k73H1Ot58fbQHgi0UqjvjGEglFm++R9RSoYYNIw8R9H/99ddaU4TiatThEAUDBjdkSiia3Lhxo/z00092XUXOIHuB/8jRfeKsABldCwh+4gpu8uXLJ7t27dKLn2325sCBA9bHE5vxmpgbxRHOC0GTkalwnAnXgMwDLpDIChlZDLPCHEKOmSwUCYPR3YX2QCEuusBctZnR9vgsIEiwzdY4/iyQAUJQ6qwLyNnP7VE/C8hK2mb4rly54jSDhoxOu3bt9IZuVQQ8KDRmcEPBgjU3ZEqY/yVHjhw6Ism4QDl2H6CrB1Crg4saRtkguHG8YcQNgiTHob62MMrm/PnzdqNTUOOAUUXIbtSqVUsSG94/JgDEcHTbmYcx0SH+Isc5G4wLuuMMxRjVhKwAhgY7ZgFwHxdHs8DPy3YINrJ+uI8AxJgAEpk9TDHgbI4kdC/hcwQYoQaff/653T4YxWQLbYvaGoyyOnnypHU7ur7QbegtCMbQhes46/YXX3zx0L6OP1N8ftGdFdfnnyjQMHNDpoQ6AgQtuIDjAm87QzGGtKKOxpjhFlkZDCvGcF1nMPQXFzPMTYOLvauCTlwIUbyL4lP8pf/jjz/q/CO4oMWnuNmbRo8erRdavFfMwWMMBUfdie36RkbbfPjhh1qgjVoaZL+QqUAQiGHEqDtB4SveC+bYQfvifWN5C1/CEG/H4cqAGhB
"text/plain": [
"<Figure size 1200x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"pca = PCA(n_components=2)\n",
"pca_result = pca.fit_transform(np.array(embeddings))\n",
"\n",
"plt.figure(figsize=(12, 5))\n",
"\n",
"plt.subplot(1, 2, 1)\n",
"plt.scatter(pca_result[:, 0], pca_result[:, 1], s=100, alpha=0.7)\n",
"for i, name in enumerate(protein_names):\n",
" plt.annotate(name, (pca_result[i, 0], pca_result[i, 1]), \n",
" xytext=(5, 5), textcoords='offset points', fontsize=9)\n",
"plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')\n",
"plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')\n",
"plt.title('PCA of Protein Embeddings')\n",
"plt.grid(True, alpha=0.3)"
]
},
{
"cell_type": "markdown",
"id": "4dd24749",
"metadata": {},
"source": [
"The PCA analysis reveals clear groupings among the proteins. Hemoglobin Beta and Myoglobin, both oxygen-binding proteins, cluster tightly in the bottom right, while Lysozyme C and Defensin Beta 4A, both antimicrobial proteins, group together in the middle. In contrast, Collagen and Elastin appear isolated and far apart. However, this apparent separation is misleading. Because PCA is a linear dimensionality reduction method that captures only the directions of maximum variance, the high similarity between Collagen and Elastin (0.852 in our matrix) may lie in dimensions that contribute little to the variance represented by the first two principal components."
]
},
{
"cell_type": "markdown",
"id": "9099082f",
"metadata": {},
"source": [
"### t-SNE visualization\n",
"\n",
"t-SNE (t-distributed Stochastic Neighbor Embedding) is a non-linear dimensionality reduction method designed to preserve local structure, ensuring that points close in the original high-dimensional space remain close in the low-dimensional representation. Unlike PCA, which captures global variance through a linear transformation, t-SNE can uncover intricate clustering patterns and subtle local relationships that linear methods may obscure."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "4f24b218",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAnwAAAHqCAYAAACeOpOVAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjUsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvWftoOwAAAAlwSFlzAAAPYQAAD2EBqD+naQAAYI9JREFUeJzt3Qd4FOX+9vFfEkhCCARC71VRQBFRERRQQYp47L2iCEdFj4oKdsQGiv0IVspRUbEcu6AgRQQ8KopSFKX3DgECCSn7XveTd/efTQeSbHby/VzXsuzM7Ozss5Pk3qdNhM/n8xkAAAA8KzLUBwAAAICSReADAADwOAIfAACAxxH4AAAAPI7ABwAA4HEEPgAAAI8j8AEAAHgcgQ8AAMDjCHwAAAAeR+ADcNj+/vtv69mzpyUkJFhERIR98skn5gVNmza1fv36WTjS53DLLbeU+OvMnDnTvZbuC3Paaae5m9+qVavccydMmFDCRwmAwAcUo7lz59rDDz9su3btKvJz9u7da8OGDbO2bdta5cqVrUaNGnbcccfZbbfdZhs2bAhsp/3qj2OdOnVs3759eYaTs88+O2iZts/vduONN1pxufbaa23hwoX2+OOP21tvvWUnnHBCntv5/8D7b1FRUda4cWM7//zzbcGCBVac3nnnHXv++eetLPF/hvndNm3aFOpDBOBRFUJ9AIDXAt/w4cNdrVC1atUK3T4tLc26du1qf/75pwtNt956qwuAixcvdoFFQah+/fpBz9myZYu9/PLLdueddxbpmM4880y75pprci0/8sgjrTjs37/f5s2bZ/fff3+Ra5Quv/xyO+ussywjI8P++OMP934mT55sP/zwgwu7xUHlt2jRIrv99tsPeR9Lly61yMji/16s9xsfH59reVHOGS9p0qSJO38qVqwY6kMBPI/AB4SQmj5//fVXmzhxol1xxRVB61JSUuzAgQO5nqNANGrUKLv55putUqVKhb6Ggt1VV11lJWXr1q0HHVaOP/74oGM65ZRT7JxzznFB6NVXX83zOcnJya4GtDTFxMSUyH4vuugiq1mzppV3qtWMjY0N9WEA5QJNukAxNtfdfffd7v/NmjULNNOpGTM/y5cvDwSenPSHsGrVqrmWP/TQQ7Z582YXjkqawmifPn3ccahGqnv37q4WLvt7Vi2N6L3r/app+WCdccYZ7n7lypXuXn26tK9Zs2a5YFu7dm1r2LBhYPsxY8ZYmzZtXCBTDeigQYOCmtHVT+zLL7+01atXBz6H7MeVmprqmtFbtmzp9tGoUSMbMmSIW15QHz7/cc2ZM8cGDx5stWrVciFUNbH+4Fuc/eLef/99V2PcoEEDq1KliguKSUlJ7jhVc6ly0edy3XXX5Tp2P32ZaNWqlTufOnToYN99912ubdavX2/XX3+96y6g8lDZjhs3Ltd269ats/POO8+9Z732HXfcke/rvvbaa9aiRQv3peSkk06y2bNn59omrz58Km+9Jx2TXkv/VznfddddrkY4u+3bt9vVV1/tzk994VAt+W+//ZZrn2oqVxnpHNL7q1evnp177rkF/mwCXkMNH1BMLrjgAvvrr7/s3Xffteeeey5Qg6M/Vvnxh6U333zTHnjgAfeHqjBdunRxAempp56ym266qdBaPtUUbtu2Lddy/ZGMjo7O93lqVtZraTuFITW7qfZNYUpBrGPHju496w+t/vD7m2nzaqosjD/4qv9idgp7Kj+FXNXw+UOmQlCPHj3c+1ezq8LvTz/95IKYjlPNywpGCij6LMR/XJmZma428fvvv7eBAwfa0Ucf7fofajt9fkUZcKKm9+rVq7vQqNCgvoJqzp40aVKR3u+OHTtyLatQoUKuWtIRI0a4z/eee+6xZcuW2b///W/3/tTMvHPnTlcWCuAKN/qSoXLKTp+Tjulf//qXCzoKyr1797Yff/zR9RkVfXk4+eSTA4M8VN5qXu/fv7/t3r070CSuplcF/jVr1rj9KWirv+b06dNzvZexY8faP//5T+vcubN7/ooVK1yZJyYmunBdGAW7Xr16uXPs6aeftmnTptkzzzzjAqQ+c//n+I9//MO9Fy076qij7NNPP3WhL6cLL7zQnc/63BTi1S1i6tSp7r0cyhcUICz5ABSbUaNG+fRjtXLlyiJtv2/fPl+rVq3cc5o0aeLr16+fb+zYsb7Nmzfn2nbYsGFuu61bt/pmzZrl/v/ss88G1uv5ffv2DXqOtsnv9u677xZ4bOedd54vOjrat3z58sCyDRs2+KpUqeLr2rVrYJneq/an914Y/7bDhw9372PTpk2+mTNn+tq3b++Wf/TRR2678ePHu8ennnqqLz09PfD8LVu2uGPq2bOnLyMjI7D8pZdectuPGzcusExloTLJ6a233vJFRkb6Zs+eHbT8lVdecfuYM2dOUJlee+21gcf+4+rRo4cvMzMzsPyOO+7wRUVF+Xbt2lXg+/d/hnnddB74zZgxwy1r27at78CBA4Hll19+uS8iIsLXp0+foP126tQp13v17/fnn38OLFu9erUvNjbWd/755weW9e/f31evXj3ftm3bgp5/2WWX+RISEtw5Ks8//7zb3/vvvx/YJjk52deyZUu3XMcsOt7atWv7jjvuOF9qampg29dee81t161bt1zng8rVT+WtZY888kjQ8egc6dChQ+CxzhVtp+Py0zlxxhlnBO1z586dRT4/AS+jSRcIIdXe/O9//ws0BaumRjUranJSbUR+zWUa6HH66ae7Wj7VvBRETVeqzch50/MLqmH55ptvXJNa8+bNA8t1XOprqNox1f4cKtWMqSapbt26rsZQNXxPPvmkqzHMbsCAAW4kr59qetSvUbVG2QdTaDvVRKoZtzAffPCBq9VTjZBqPv03f7PyjBkzCt2Hagaz18aqJlRlpibkovjoo49yfR7jx4/PtZ0G22Qf0KAaL2U5Nb9mp+Vr16619PT0oOWdOnVyzbh+GhGt8+Hrr792x6t96VhUU6b/Zy8P1bCplvSXX35xz/3qq6/c569mZb+4uDhXFtn9/PPPrgZNo8Cz1yCrqVbT9hRVzlHkKmPVFPpNmTLFlY0+ez+dE2rez/kzpuNQM7lqRYHyiiZdoBSoCS/7AAz9EfL/8dO9gptuCgzffvuta8Z66aWX3LrHHnssz32qOa9bt272yiuvuCbV/Kjfkpo/D4b6o2nqF/X9yklhSc1pChjq63UoFBIuvvhi9wdazZj+/ng5qZkyO3+gynlc+oOuYFqUwKU5AzUyOL+mdoWVwig4ZafmXSlqoFBgL8qgjZyv4z9ncjaLark+EwW07M3iRxxxRJ6DePTZ6jNW+avvo/rb6VZQeahs1ecxZ7eDnJ+F/zPI+doKZ9m/PBRE/Q1zfj4q4+zlq9dRAFXozE7HmJ3OK32Z0Kh29VFU87WmL1KY1hcOoLwg8AGlQDVX6k/lp35GeU02qz59qr3RIAD9cVSH+/wCn0KDascUFItzTr3SoDBQlBBalFHIB0vB6JhjjrFnn302z/VF6WOWvdYxu6yW1OKT3+sU1+urLEQjpvPq+ybHHnuslbb83t+hUo2wajHVP1O1mw8++KDrH6n+h+3bty/W1wLKKgIfUIzyG3ShDufZaydyzq2Xk2oz1EFd88gVRLV8Cn35TWVyqFS7opoTDYjISXMGqmaoKMGouPkHuei4stcWqfZUI3yzh8j8PguVq0ZyagBCUQbJhDPVZuakgSn6bP01aBr9q+bdwgK4yl7no0Jl9nLLeY74PyO9tr+Z3D/npD6jdu3aHfb78r+Omt9VW5m9lk+DW/L73FXLp5uOTdMb6efy7bffLpbjAco6+vABxcg/T1zOK22oH5X+oPpvrVu3dssVPPIaQavmqiVLluTZpJqdmnQV+NRkpdG4xVnDokuladRj9qkrNKJTExqfeuqpeU4ZU9JUdmq+ffHFF4NqszQqVM2Zffv2DfostCynSy65xE358frrr+dap/6Q/tHAXqAJsf198ETN8PpM9dnqM9ZNI1jVjy+vLxfZp5rRCGxd+eXDDz8MLFPYytkUrKusKEyqq0H2bgyq0T6YK9AURn0MFSK
"text/plain": [
"<Figure size 1200x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tsne = TSNE(n_components=2, random_state=42, perplexity=3)\n",
"tsne_result = tsne.fit_transform(np.array(embeddings))\n",
"plt.figure(figsize=(12, 5))\n",
"plt.subplot(1, 2, 2)\n",
"plt.scatter(tsne_result[:, 0], tsne_result[:, 1], s=100, alpha=0.7)\n",
"for i, name in enumerate(protein_names):\n",
" plt.annotate(name, (tsne_result[i, 0], tsne_result[i, 1]), \n",
" xytext=(5, 5), textcoords='offset points', fontsize=9)\n",
"plt.xlabel('t-SNE 1')\n",
"plt.ylabel('t-SNE 2')\n",
"plt.title('t-SNE of Protein Embeddings')\n",
"plt.grid(True, alpha=0.3)\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "f312fb11",
"metadata": {},
"source": [
"The t-SNE visualization uncovers more nuanced clustering than PCA, revealing relationships that were previously obscured. Notably, the structural proteins Collagen and Elastin now cluster together in the upper right, reflecting their functional similarity that PCA failed to capture. The oxygen-binding proteins Hemoglobin Beta and Myoglobin remain grouped in the bottom left, consistent with both analyses. In contrast, the antimicrobial proteins separate, with Lysozyme C positioned centrally and Defensin Beta 4A isolated in the bottom right. This separation indicates that ESM-2 has likely learned to distinguish between fundamentally different antimicrobial strategies: Lysozyme, a large enzyme (148 amino acids) that enzymatically cleaves bacterial cell walls, and Defensin, a small peptide (64 amino acids) that disrupts bacterial membranes via direct interaction. Despite their shared antimicrobial role, ESM-2 appears to encode them as distinct in both architecture and mechanism."
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}