mirror of
https://github.com/ml-explore/mlx-examples.git
synced 2025-08-21 20:46:50 +08:00
335 lines
151 KiB
Plaintext
335 lines
151 KiB
Plaintext
![]() |
{
|
||
|
"cells": [
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "acfe011d",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Exploring Protein Relationships Through ESM-2 Embeddings\n",
|
||
|
"\n",
|
||
|
"Proteins are molecular machines with unique structures that determine their functions. ESM-2 treats protein sequences as a language, learning representations that capture evolutionary and functional relationships without relying on traditional sequence alignment.\n",
|
||
|
"\n",
|
||
|
"In this notebook, we'll explore how ESM-2 embeddings reveal relationships between six human proteins:\n",
|
||
|
"\n",
|
||
|
"**Oxygen Transport & Storage:**\n",
|
||
|
"- **Hemoglobin Beta**: The oxygen-carrying protein in red blood cells, part of the tetrameric hemoglobin complex\n",
|
||
|
"- **Myoglobin**: The oxygen storage protein in muscle tissue, structurally similar to individual hemoglobin subunits\n",
|
||
|
"\n",
|
||
|
"**Antimicrobial Defense:**\n",
|
||
|
"- **Cathelicidin (LL-37)**: An antimicrobial peptide that disrupts bacterial membranes and modulates immune responses\n",
|
||
|
"- **Defensin Beta 4A**: A small cysteine-rich antimicrobial peptide that directly kills bacteria and other pathogens\n",
|
||
|
"\n",
|
||
|
"**Structural Support:**\n",
|
||
|
"- **Erythroid Alpha-Spectrin**: Forms the flexible scaffolding that gives red blood cells their shape and helps them squeeze through tiny blood vessels\n",
|
||
|
"- **Dystrophin**: A massive protein that connects the muscle cell's internal framework to its surroundings, preventing damage during muscle contraction"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "20f98f7f",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### Setup\n",
|
||
|
"\n",
|
||
|
"Here we import all neccessary libraries."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 32,
|
||
|
"id": "2bacd1ff",
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"import mlx.core as mx\n",
|
||
|
"import numpy as np\n",
|
||
|
"import matplotlib.pyplot as plt\n",
|
||
|
"import seaborn as sns\n",
|
||
|
"from sklearn.decomposition import PCA\n",
|
||
|
"from sklearn.manifold import TSNE\n",
|
||
|
"import pandas as pd"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "5563c495",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"These are our protein sequences, obtained from [UniProt](https://www.uniprot.org/)."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 33,
|
||
|
"id": "b8e9d6d2",
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"proteins = [\n",
|
||
|
" # Oxygen Transport & Storage\n",
|
||
|
" (\"Hemoglobin Beta\", \"MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH\"),\n",
|
||
|
" (\"Myoglobin\", \"MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG\"),\n",
|
||
|
"\n",
|
||
|
" # Antimicrobial Defense\n",
|
||
|
" (\"Cathelicidin (LL-37)\", \"MKTQRDGHSLGRWSLVLLLLGLVMPLAIIAQVLSYKEAVLRAIDGINQRSSDANLYRLLDLDPRPTMDGDPDTPKPVSFTVKETVCPRTTQQSPEDCDFKKDGLVKRCMGTVTLNQARGSFDISCDKDNKRFALLGDFFRKSKEKIGKEFKRIVQRIKDFLRNLVPRTES\"),\n",
|
||
|
" (\"Defensin Beta 4A\", \"MRVLYLLFSFLFIFLMPLPGVFGGIGDPVTCLKSGAICHPVFCPRRYKQIGTCGLPGTKCCKKP\"),\n",
|
||
|
"\n",
|
||
|
" # Structural Support\n",
|
||
|
" (\"Erythroid Alpha-Spectrin\", \"MEQFPKETVVESSGPKVLETAEEIQERRQEVLTRYQSFKERVAERGQKLEDSYHLQVFKRDADDLGKWIMEKVNILTDKSYEDPTNIQGKYQKHQSLEAEVQTKSRLMSELEKTREERFTMGHSAHEETKAHIEELRHLWDLLLELTLEKGDQLLRALKFQQYVQECADILEWIGDKEAIATSVELGEDWERTEVLHKKFEDFQVELVAKEGRVVEVNQYANECAEENHPDLPLIQSKQNEVNAAWERLRGLALQRQKALSNAANLQRFKRDVTEAIQWIKEKEPVLTSEDYGKDLVASEGLFHSHKGLERNLAVMSDKVKELCAKAEKLTLSHPSDAPQIQEMKEDLVSSWEHIRALATSRYEKLQATYWYHRFSSDFDELSGWMNEKTAAINADELPTDVAGGEVLLDRHQQHKHEIDSYDDRFQSADETGQDLVNANHEASDEVREKMEILDNNWTALLELWDERHRQYEQCLDFHLFYRDSEQVDSWMSRQEAFLENEDLGNSLGSAEALLQKHEDFEEAFTAQEEKIITVDKTATKLIGDDHYDSENIKAIRDGLLARRDALREKAATRRRLLKESLLLQKLYEDSDDLKNWINKKKKLADDEDYKDIQNLKSRVQKQQVFEKELAVNKTQLENIQKTGQEMIEGGHYASDNVTTRLSEVASLWEELLEATKQKGTQLHEANQQLQFENNAEDLQRWLEDVEWQVTSEDYGKGLAEVQNRLRKHGLLESAVAARQDQVDILTDLAAYFEEIGHPDSKDIRARQESLVCRFEALKEPLATRKKKLLDLLHLQLICRDTEDEEAWIQETEPSATSTYLGKDLIASKKLLNRHRVILENIASHEPRIQEITERGNKMVEEGHFAAEDVASRVKSLNQNMESLRARAARRQNDLEANVQFQQYLADLHEAETWIREKEPIVDNTNYGADEEAAGALLKKHEAFLLDLNSFGDSMKALRNQANACQQQQAAPVEGVAGEQRVMALYDFQARSPREVTMKKGDVLTLLSSINKDWWKVEAADHQGIVPAVYVRRLAHDEFPMLPQRRREEPGNITQRQEQIENQYRSLLDRAEERRRRLLQRYNEFLLAYEAGDMLEWIQEKKAENTGVELDDVWELQKKFDEFQKDLNTNEPRLRDINKVADDLLFEGLLTPEGAQIRQELNSRWGSLQRLADEQRQLLGSAHAVEVFHREADDTKEQIEKKCQALSAADPGSDLFSVQALQRRHEGFERDLVPLGDKVTILGETAERLSESHPDATEDLQRQKMELNEAWEDLQGRTKDRKESLNEAQKFYLFLSKARDLQNWISSIGGMVSSQELAEDLTGIEILLERHQEHRADMEAEAPTFQALEDFSAELIDSGHHASPEIEKKLQAVKLERDDLEKAWEKRKKILDQCLELQMFQGNCDQVESWMVARENSLRSDDKSSLDSLEALMKKRDDLDKAITAQEGKITDLEHFAESLIADEHYAKEEIATRLQRVLDRWKALKAQLIDERTKLGDYANLKQFYRDLEELEEWISEMLPTACDESYKDATNIQRKYLKHQTFAHEVDGRSEQVHGVINLGNSLIECSACDGNEEAMKEQLEQLKEHWDHLLERTNDKGKKLNEASRQQRFNTSIRDFEFWLSEAETLLAMKDQARDLASAGNLLKKHQLLEREMLAREDALKDLNTLAEDLLSSGTFNVDQIVKKKDNVNKRFLNVQELAAAHHEKLKEAYALFQFFQDLDDEESWIEEKLIRVSSQDYGRDLQGVQNLLKKHKRLEGELVAHEPAIQNVLDMAEKLKDKAAVGQEEIQLRLAQFVEHWEKLKELAKARGLKLEESLEYLQFMQNAEEEEAWINEKNALAVRGDCGDTLAATQSLLMKHEALENDFAVHETRVQNVCAQGEDILNKVLQEESQNKEISSKIEALNEKTPSLAKAIAAWKLQLEDDYAFQEFNWKADVVEAWIADKETSLKTNGNGADLGDFLTLLAKQDTLDASLQSFQQERLPEITDLKDKLISAQHNQSKAIEERYAALLKRWEQLLEASAVHRQKLLEKQLPLQKAEDLFVEFAHKASALNNWCEKMEENLSEPVHCVSLNEIRQLQKDHEDFLASLARAQADFKCLLELDQQIKALGVPSSPYTWLTVEVLERTWKHLSDIIEEREQELQKEEARQVKNFEMCQEFEQNASTFLQWILETRAYFLDGSLLKETGTLESQLEANKRKQKEIQAMKRQLTKIVDLGDNLEDALILDIKYSTIGLAQQWDQLYQLGLRMQHNLEQQIQAKDIKGVSEETLKEFSTIYKHFDENLTGRLTHKEFRSCLRGLNYYLPMVEEDEHEPKFEKFLDAVDPGRKGYVSLEDYTAFLIDKESENIKSSDEIENAFQALAEGKSYITKEDMKQALTPEQVSFCATHMQQYMDPRGRSHLSGYDYVGFTNSYFGN\"),\n",
|
||
|
" (\"Dystrophin\", \"MLWWEEVEDCYEREDVQKKTFTKWVNAQFSKFGKQHIENLFSDLQDGRRLLDLLEGLTGQKLPKEKGSTRVHALNNVNKALRVLQNNNVDLVNIGSTDIVDGNHKLTLGLIWNIILHWQVKNVMKNIMAGLQQTNSEKILLSWVRQSTRNYPQVNVINFTTSWSDGLALNALIHSHRPDLFDWNSVVCQQSATQRLEHAFNIARYQLGIEKLLDPEDVDTTYPDKKSILMYITSLFQVLPQQVSIEAIQEVEMLPRPPKVTKEEHFQLHHQMHYSQQITVSLAQGYERTSSPKPRFKSYAYTQAAYVTTSDPTRSPFPSQHLEAPEDKSFGSSLMESEVNLDRYQTALEEVLSWLLSAEDTLQAQGEISNDVEVVKDQFHTHEGYMMDLTAHQGRVGNILQLGSKLIGTGKLSEDEETEVQEQMNLLNSRWECLRVASMEKQSNLHRVLMDLQNQKLKELNDWLTKTEERTRKMEEEPLGPDLEDLKRQVQQHKVLQEDLEQEQVRVNSLTHMVVVVDESSGDHATAALEEQLKVLGDRWANICRWTEDRWVLLQDILLKWQRLTEEQCLFSAWLSEKEDAVNKIHTTGFKDQNEMLSSLQKLAVLKADLEKKKQSMGKLYSLKQDLLSTLKNKSVTQKTEAWLDNFARCWDNLVQKLEKSTAQISQAVTTTQPSLTQTTVMETVTTVTTREQILVKHAQEELPPPPPQKKRQITVDSEIRKRLDVDITELHSWITRSEAVLQSPEFAIFRKEGNFSDLKEKVNAIEREKAEKFRKLQDASRSAQALVEQMVNEGVNADSIKQASEQLNSRWIEFCQLLSERLNWLEYQNNIIAFYNQLQQLEQMTTTAENWLKIQPTTPSEPTAIKSQLKICKDEVNRLSDLQPQIERLKIQSIALKEKGQGPMFLDADFVAFTNHFKQVFSDVQAREKELQTIFDTLPPMRYQETMSAIRTWVQQSETKLSIPQLSVTDYEIMEQRLGELQALQSSLQEQQSGLYYLSTTVKEMSKKAPSEISRKYQSEFEEIEGRWKKLSSQLVEHCQKLEEQMNKLRKIQNHIQTLKKWMAEVDVFLKEEWPALGDSEILKKQLKQCRLLVSDIQTIQPSLNSVNEGGQKIKNEAEPEFASRLETELKELNTQWDHMCQQVYARKEALKGGLEKTVSLQKDLSEMHEWMTQAEEEYLERDFEYKTPDELQKAVEEMKRAKEEAQQKEAKVKLLTESVNSVIAQAPPVAQEALKKELETLTTNYQWLCTRLNGKCKTLEEVWACWHELLSYLEKANKWLNEVEFKLKTTENIPGGAEEISEVLDSLENLMRHSEDNPNQIRILAQTLTDGGVMDELINEELETFNSRWRELHEEAVRRQKLLEQSIQSAQETEKSLHLIQESLTFIDKQLAAYIADKVDAAQMPQEAQKIQSDLTSHEISLEEMKKHNQGKEAAQRVLSQIDVAQKKLQDVSMKFRLFQKPANFEQRLQESKMILDEVKMHLPALETKSVEQEVVQSQLNHCVNLYKSLSEVKSEVEMVIKTGRQIVQKKQTENPKELDERVTALKLHYNELGAKVTERKQQLEKCLKLSRKMRKEMNVLTEWLAATDMELTKRSAVEGMPSNLDSEVAWGKATQKEIEKQKVHLKSITEVGEALKTVLGKKETLVEDKLSLLNSNWIAVTSRAEEWLNLLLEYQKHMETFDQNVDHITKWIIQADTLLDESEKKKPQQKEDVLKRLKAELNDIRPKVDSTRDQAANLMANRGDHCRKLVEPQISELNHRFAAISHRIKTGKASIPLKELEQFNSDIQKLLEPLEAEIQQGVNLKEEDFNKDMNEDNEGTVKELLQRGDNLQQRITDERKREEIKIKQQLLQTKHNALKDLRSQRRKKALEISHQWYQYKRQADDLLKCLDDIEKKLASLPEPRDERKIKEIDRELQKKKEELNAVRRQAEGLSEDGAAMAVEPTQIQLSKRWREIESKFAQFRRLNFAQIHTVREETMMVMTEDMPLEISYVPSTYLTEITHVSQALLEVEQLLNAPDLCAKDFEDLFKQEESLKNIKDSLQQSSGRIDIIHSKKTAALQSATPVERVKLQEALSQLDFQWEKVNKMYKDRQGRFDRSVEKWRRFHYDIKIFNQWLTEAEQFLRKTQIPENWEHAKYKWYLKELQDGIGQRQTVVRTLNATGEEIIQQSSKTDASILQEKLGSLNLRWQEVCKQLSDRKKRLEEQKNILSEFQRDLNEFVLWLEEADNIASIPLEPGKEQQLKEKLEQVKLLVEELPLRQGILKQLNETGGPVLVSAPISPEEQDKLENKLKQTNLQWIKVSRALPEKQGEIEAQIKDLGQLEKKLEDLEEQLNHLLLWLSPIRNQLEIYNQPNQEGPFDVKETEIAVQAKQPDVEEILSKGQHLYKEKPATQPVKRKLEDLSSEWKAVNRLLQELRAKQPDLAPGLTTIGASPTQTVTLVTQPVVTKETAISKLEMPSSLMLEVPALADFNRAWTELTDWLSLLDQVIKSQRVMVGDLEDINEMIIKQKATMQDLEQRRPQLEELITAAQNLKNKTSNQEARTIITDRIERIQNQWDEVQEHLQNRRQQLNEMLKDSTQWLEAKEEAEQVLGQARAKLESWKEGPYTVDAIQKKITETKQLAKDLRQWQTNVDVANDLALKLLRDYSADDTRKVHMITENINASWRSIHKRVSEREAALEETHRLLQQFPLDLEKFLAWLTEAETTANVLQDATRKERLLEDSKGVKELMKQWQDLQGEIEAHTDVYHNLDENSQKILRSLEGSDDAVLLQRRLDNMNFKWSELRKKSLNIRSHLEASSDQWKRLHLSLQELLVWLQLKDDELSRQAPIGGDFPAVQKQNDVHRAFKRELKTKEPVIMSTLETVRIFLTEQPLEGLEKLYQEPRELPPEERAQNVTRLLRKQAEEVNTEWEKLNLHSADWQRKIDETLERLRELQEATDELDLKLRQAEVIKGSWQPVGDLLIDSLQDHLEKVKALRGEIAPLKENVSHVNDLARQLTTLGIQLSPYNLSTLEDLNTRWKLLQVAVEDRVRQLHEAHRDFGPASQHFLSTSVQGPWERAISPNKVPYYINHETQTTCWDHPKMTELYQSLADLNNVRFSAYRTAMKLRRLQKALCLDLLSLSAACDALDQHNLKQNDQPMDILQIINCLTTIYDRLEQEHNNLVNVPLCVDMCLNWLLNVYDTGRTGRIRVLSFKTGIISLCKAHLEDKYRYLFKQVASSTGFCDQRRLGLLLHDSIQIPRQLGEVASFGGSNIEPSVRSCFQFANNKPEIEAALFLDWMRLEPQSMVWLPVLHRVAAAETAKHQAKCNICKECPIIGFRYRSLKHFNYDICQSCFFSGRVAKGHKMHYPMVEYCTPTTSGEDVRDFAKVLKNKFRTKRYFAKHPRMGYLPVQTVLEGDNMETPVTLINFWPVDSAPASSPQLSHDDTHSRIEHYASRLAEMENSNGSYLNDSISPNESIDDEHLLIQHYCQSLNQDSPLSQPRSPAQILISLESEERGELERILADLEEENRNLQAEYDRLKQQHEHKGLSPLPSPPEMMPTSPQSPRDAELIAEAKLLRQHKGRLEARMQILEDHNKQLESQLHRLRQLLEQPQAEAKVNGTTVSSPSTSLQRSDSSQPMLLRVVGSQTSDSMGEEDLLSPPQDTSTGLEEVMEQLNNSFPSSRGRNTPGKPMREDTM\"),\n",
|
||
|
"]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "c9621578",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### Loading the model and tokenizing a sequence\n",
|
||
|
"\n",
|
||
|
"Load the ESM-2 model. Here we will use the 650M parameter version. Change the path below to point to your converted checkpoint.\n",
|
||
|
"\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 34,
|
||
|
"id": "05696400",
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"import sys\n",
|
||
|
"sys.path.append(\"..\")\n",
|
||
|
"\n",
|
||
|
"from esm import ESM2\n",
|
||
|
"\n",
|
||
|
"esm_checkpoint = \"../checkpoints/mlx-esm2_t33_650M_UR50D\"\n",
|
||
|
"tokenizer, model = ESM2.from_pretrained(esm_checkpoint)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "2916adbb",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Here, we tokenize and decode the protein sequence for human Insulin."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 35,
|
||
|
"id": "47178dcd",
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"Sequence: MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN\n",
|
||
|
"Tokens: [20, 5, 4, 22, 20, 10, 4, 4, 14, 4, 4, 5, 4, 4, 5, 4, 22, 6, 14, 13, 14, 5, 5, 5, 18, 7, 17, 16, 21, 4, 23, 6, 8, 21, 4, 7, 9, 5, 4, 19, 4, 7, 23, 6, 9, 10, 6, 18, 18, 19, 11, 14, 15, 11, 10, 10, 9, 5, 9, 13, 4, 16, 7, 6, 16, 7, 9, 4, 6, 6, 6, 14, 6, 5, 6, 8, 4, 16, 14, 4, 5, 4, 9, 6, 8, 4, 16, 15, 10, 6, 12, 7, 9, 16, 23, 23, 11, 8, 12, 23, 8, 4, 19, 16, 4, 9, 17, 19, 23, 17]\n",
|
||
|
"Decoded: MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"human_insulin_sequence = \"MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN\"\n",
|
||
|
"tokens = tokenizer.encode(human_insulin_sequence, add_special_tokens=False)\n",
|
||
|
"print(f\"Sequence: {human_insulin_sequence}\")\n",
|
||
|
"print(f\"Tokens: {tokens.tolist()}\")\n",
|
||
|
"print(f\"Decoded: {tokenizer.decode(tokens)}\")"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "c1b73ded",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### Embedding sequences\n",
|
||
|
"\n",
|
||
|
"To compute the embeddings of our proteins, we pass each protein sequence through ESM-2's tokenizer to convert amino acids into token IDs, then extract the final layer representations using `get_sequence_representations()`. This process gives us a vector for each protein that captures its learned functional and evolutionary features."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 36,
|
||
|
"id": "cb470957",
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"def extract_embeddings_batch(model, protein_list):\n",
|
||
|
" \"\"\"Extract embeddings by processing all sequences in a batch.\"\"\"\n",
|
||
|
" sequences = [seq for _, seq in protein_list]\n",
|
||
|
" names = [name for name, _ in protein_list]\n",
|
||
|
" \n",
|
||
|
" tokens = model.tokenizer.batch_encode(sequences, add_special_tokens=True)\n",
|
||
|
" embeddings = model.get_sequence_representations(tokens, layer=-1)\n",
|
||
|
" \n",
|
||
|
" return embeddings, names"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 37,
|
||
|
"id": "38e83142",
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"\n",
|
||
|
"Embedding shape: (6, 1280)\n",
|
||
|
"Each protein represented by 1280 features\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"embeddings, protein_names = extract_embeddings_batch(model, proteins)\n",
|
||
|
"print(f\"\\nEmbedding shape: {embeddings.shape}\")\n",
|
||
|
"print(f\"Each protein represented by {embeddings.shape[1]} features\")"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "fccd2a99",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### Protein embedding similarity matrix\n",
|
||
|
"\n",
|
||
|
"We can measure how similar the protein embeddings are by calculating a similarity matrix. We normalize each embedding to unit length and compute cosine similarities between all pairs, producing a matrix where values close to 1 indicate highly similar proteins and values close to 0 indicate dissimilar ones."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 38,
|
||
|
"id": "93d14fff",
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"def compute_similarity_matrix(embeddings):\n",
|
||
|
" \"\"\"Compute cosine similarity matrix for embeddings.\"\"\"\n",
|
||
|
" normalized = embeddings / mx.linalg.norm(embeddings, axis=1, keepdims=True)\n",
|
||
|
" similarity_matrix = normalized @ normalized.T\n",
|
||
|
" return similarity_matrix"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 39,
|
||
|
"id": "3485f854",
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAs4AAAJOCAYAAACnXIH0AAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjUsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvWftoOwAAAAlwSFlzAAAPYQAAD2EBqD+naQABAABJREFUeJzs3QV4U2cXB/B/3b1ACxSnuMtw1+Iw3G3osOHOsLEBw3W4sw8b7u5SrEChhVIodRfq+Z7zhqRJmpZSStPA+T1PoEne3CQ3du655z1XRyKRSMAYY4wxxhhLl276VzPGGGOMMcYIB86MMcYYY4xlAAfOjDHGGGOMZQAHzowxxhhjjGUAB86MMcYYY4xlAAfOjDHGGGOMZQAHzowxxhhjjGUAB86MMcYYY4xlAAfOjDHGGGOMZQAHzoxpyOzZs6Gjo4NLly4hJ/Dy8hKPp1+/ft/sPho0aCDuQxE9f7qM1se3pO6+mebemzExMciXLx9++eWXb3YfLPPoe4DeA/S9oA3fReo+31u3bhWX0f85kbu7O/T19bFmzRpNPxT2BThwZlpF9oWqeDI0NISTkxN69OiBx48ff7P7zq4ALys9ffoUffv2RaFChWBkZAQrKysUK1YMHTt2xPLlyyGRSPAjy64fVlr/svcrvSbqJCUliUBSNu5rApacHjCQv/76C0FBQZg+fbraACi9k2pA/yXvc8XvEAcHByQmJqp9fM+fP5ePo+Vm1KtXr7BgwQLUq1cPefPmlX8/9enTBy9evMhU8JreKSe/xix9JUqUQPfu3TFnzhxERkZq+uGwDNLP6EDGcpKiRYuiV69e4u+oqCjcunULe/bswcGDB3H+/HnUrl0bOd3IkSPRrVs3FChQ4Jss/+zZs2jdurUIDJo0aYIOHTrA2NgYnp6euHz5Mg4dOoQRI0aIjAehoI2CBQo6vpXt27eLTKMmaPK+ia6uNE+xefNmLF26NNX1J0+exIcPH8TrkVYw9728NyMiIrB48WJ07do1zfv47bffYG5urvY6xUD2S9/nMnTe398fJ06cQNu2bVPdx6ZNm+Sv2ZeYMWMG9u3bh7Jly6Jdu3awtLTEkydPsGPHDvzvf//DqVOnRFD9JQYOHIj8+fOrva5ixYpf/Bh/FPReqFGjBhwdHZFTTZw4ETt37sSKFSswbdo0TT8clgEcODOtRNkk1cwvZa7mz58vvnxySvlDeuzt7cXpWxk2bJjIYp47dw4NGzZUuo4ycGfOnIGenp78MgMDA5QsWRLf0rcKxHL6fcvWLwVM9CO5aNEicV4RBdS00VKhQgVcuXIF3/N7k4JI2uClLGxaxo8fLzLCWf0+l6lVqxYePXok1rtq4ExBOL1OFIhT8P0lWrRogUmTJqFSpUpKl+/du1dkF+nxurm5fdEyBw0aJAJA9mXo8/QtEwFZoVy5cihfvjw2btyIKVOmZGpjjWUvfoXYd+PXX38V/9+9e1d+Ge3KpF2/Pj4+4keafojpi0kxsN6yZQt++uknkd2iE/2tuvuTgnTZjzLtVlPcVaq4Sz0+Pl5kEytXrgwzMzNYWFigbt26+O+//zJUR6pY2+fh4SEyJjY2NmJZ9CNOP/QZERAQIDJulPVSDSZk66V58+ZKNYFp1RXKdp3HxcVh6tSpIgA1MTFBlSpVRLBCwsPDRVaPdk1Ttq9mzZq4c+fOV9UZX7x4EQMGDBC7M2WvTdWqVbFhwwa14z/3WqveNz3P/v37i7/pf8XXlNSpU0dkJX19fdXeH90Hjb158yYyip5PYGAgjh49qnQ5XXbs2DERWNG6VUXvq5UrV4rXjHb7UzlC7ty5RSmCq6ur0tjPPS/FdREbGys2OGkPDgXyso1Rde/NoUOHisv++OOPVI9Pdh1tEGQEfeZsbW3RqFEjfI3MvM9laD1TVv348eNiOYrotaBsNL1eX4rWv2rQTOi+nJ2d8ezZM1Gi8i0ovm60jikoo+dZuHBhkdGUbUwsWbJEfK7os1q8eHGxNyYtycnJ+PPPP8U4Gk/L+v3335GQkKB2PG30tWnTRmx40fuUbkfvMXV7e2iDh94zlAihZdP/CxcuFPeZlmvXrqF+/friO9HOzk7stXj37t0XlSzJvivoNaYSH3qstJ5o4yStpAuVAbq4uIjvdArG6W8qEVJXC06P/59//kH16tXF+5yWTXsMaL2oW36XLl3w9u1b8Z3Hcj7OOLPvjuqPZHBwsAjk6AuMfrwoWKDdp2TUqFEiIKEyBdodSg4cOCACDgpIqD6S0JcsfTFu27ZNfGnTeRlra2vxPwWWlG2iL0bafUrLox8X+mGmXbZ0P7QLPCPovuhLvEyZMuLHm4KDI0eOiOCAyiny5MmT7u3pi10W9EVHR4sfma9FP1C0y5mycx8/fsSuXbvELvLr16+LCV4U3HXu3FkEgbSrmtbFmzdvMp3xoR9U2nig9UAbEGFhYWI395AhQ8SkGvrxV5Xea62qffv2Ypm0Xun1Ud3lTfdDz40CENpgUES3o93u9PrQ/WWUbEOIlklBr2IGlt4r9FrTrn5VISEhGDNmjNgIox9sWsbr16/FBhmVeFCwUq1atQw9L0WdOnUSG2P0WtH7mIKitPz999/ifmbOnInGjRvL749KIdavXy+C4AkTJnx2HYSGhorPVrNmzb46u/a173Na3/TYaf1TaYgMZaHpPUTrMivJ9jKolo1ktWXLlonvIXr96XWh77TRo0fD1NRUrHs6T59deh0pEy6rD1dXQkLvO/ocUHBHG6+00Tdr1iwRSNJnQNHatWvFBjS9lyhIpI27e/fuiT2BFBTSiWq+Zeh7g9Y1ve/odvR5pcTDjRs31D4vKsNr2bKleN/Q9xFtqMtK8+gz8SXoM0Ibx/Qe6t27t9h4ou8t2tC6f/++2BiToc8IffboPUafW9oYoOdFt6c9RKooc0wbG7RBSnNvKNimDXoK+inZoPj7QWTfIfRc6DVhOZyEMS3y5s0bmuUjad68earrZs6cKa5r2LCh/DI6T6f+/ftLEhMTlcZfvnxZXFeqVClJWFiY/PKQkBCJs7OzuO7KlSvyyy9evCgumzVrltrHNnXqVHH9jBkzJMnJyfLLIyIiJFWrVpUYGhpKfHx85JfTcmg8LVf1+dHpjz/+UFr+9OnTxeULFy7M0Lrq2LGjGF+uXDnJihUrJPfu3ZPExcWlOV5233379lW6vH79+uLyOnXqSKKiouSX79u3T1xubW0t6dy5syQhIUF+3aJFi8R1S5YsUbssRWmt19evX6d6jHQfTZs2lejp6Unevn2rdF16r3Va971lyxZxGf2v6uPHjxJbW1tJkSJFlF5PsmrVKnG7ZcuWSTKiYMGCEiMjI/H3yJEjJfr6+hJfX1/59WXKlBGvE6H3Ni2bXg+Z2NhYyfv371Mt9+nTpxJzc3NJkyZNMvy8FNdFxYoVJcHBwamuV/feJA8fPhTPo2jRopLIyEjJu3fvxDqys7NTem+n5/jx42LZ06ZNS/ex/fbbb+JxqJ5U3/+ZfZ/LvkPKli0r1r8MvS70+vz666/iPD1fev2+1u3bt8X9VqtWLcO3oc8i3WbgwIFq1wWd6H2q+rrRa+Lp6Sm/3NvbW3z/WFlZie+2gIAA+XW3bt0St2nTpo3a+86VK5d4nWVo3darV09c97///U9+uZubm1hvFSpUkAQFBSkti14zGr948eJUn3sar/i9Qu9ze3v7VN9FSUlJ4rOoo6MjuXr1qvxy+mz26NFD/vnPyOdANnb48OFiuTL//POPuHzIkCFK4+m7jy7ftWuX0uX0XS9bluLnldZ/3rx5JdHR0RJV6j5v4eHhYhm0XlnOx4Ez0yqyHz364Zb9cIwfP15St25dcbmxsbHkxo0b8vF0Gf1gBAYGplrWgAEDxPUUAKqiL0i6jsZkJHCmL18bGxvxuFSDLPLff/+J265cuTJDgXPhwoWVvtAVr6NAISPox4t+DGVf7LJ1UatWLcny5cslMTExXxQ404aG6nM2MDAQ16kGsfRDTZf36dNH7bIUfW6DRNWBAwfE+K1btypdnt5rndZ9fy7AHDt2rLj+3LlzSpdXqlRJBFTqfgQ
|
||
|
"text/plain": [
|
||
|
"<Figure size 800x600 with 2 Axes>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"similarity_matrix = compute_similarity_matrix(embeddings)\n",
|
||
|
"\n",
|
||
|
"plt.figure(figsize=(8, 6))\n",
|
||
|
"similarity_np = np.array(similarity_matrix)\n",
|
||
|
"\n",
|
||
|
"sim_df = pd.DataFrame(similarity_np, \n",
|
||
|
" index=protein_names, \n",
|
||
|
" columns=protein_names)\n",
|
||
|
"\n",
|
||
|
"sns.heatmap(sim_df, annot=True, cmap='viridis', \n",
|
||
|
" fmt='.3f', square=True, cbar_kws={'label': 'Cosine Similarity'})\n",
|
||
|
"plt.title('Protein Similarity Matrix (ESM-2 Embeddings)', fontsize=14)\n",
|
||
|
"plt.xticks(rotation=45, ha='right')\n",
|
||
|
"plt.yticks(rotation=0)\n",
|
||
|
"plt.tight_layout()\n",
|
||
|
"plt.show()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "1532dfc5",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"The similarity matrix highlights clear expected relationships. Erythroid Alpha-Spectrin and Dystrophin are most similar (0.982), reflecting their shared role in cytoskeletal support. Hemoglobin Beta and Myoglobin also show high similarity (0.950), consistent with their oxygen-binding functions. Antimicrobial peptides Cathelicidin and Defensin Beta 4A likewise cluster closely (0.955), reflecting their common role in immunity. Alongside these expected patterns, some unexpected similarities appear, such as between Cathelicidin and Dystrophin (0.948) or Hemoglobin Beta and Alpha-Spectrin (0.931). These likely reflect sequence-level motifs or general structural tendencies that the embeddings capture, rather than direct functional relationships."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "66794597",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### PCA visualization\n",
|
||
|
"\n",
|
||
|
"PCA (Principal Component Analysis) reduces high-dimensional data to a lower-dimensional representation by finding the directions of maximum variance in the data. This allows us to visualize our high-dimensional protein embeddings in 2D space while preserving the most important patterns of similarity and difference between proteins."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 42,
|
||
|
"id": "67afd74e",
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAkYAAAHWCAYAAACSU0ayAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjUsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvWftoOwAAAAlwSFlzAAAPYQAAD2EBqD+naQAAbWZJREFUeJzt3Qd4U+X7N/AbWtrSMsveW7YgCgj4E5QNMlX2RpApiIqAAgIqIqg4UHAAMhT1r6Ci7CmI7CF7yt6jzJaO817f25y8aUjaFJs2Of1+riuUnHOSPE/WuXM/K51hGIYQERERkaRP7QIQERER+QoGRkREREQ2DIyIiIiIbBgYEREREdkwMCIiIiKyYWBEREREZMPAiIiIiMiGgRERERGRDQMjIiIiIhsGRkQp7ObNm/Lcc89J3rx5JV26dDJ48GCxgm7duknRokXFH9WpU0cqVKiQIo+F1/yNN95I9Dgcg2Md4fnF80xE3sPAiPzKzJkz9WRhXkJCQuSBBx6QAQMGyPnz5+85HttefvllKVOmjISGhkpYWJg8/PDD8uabb8q1a9dcPka1atX0vj/77DOv1OHtt9/WevTt21dmz54tnTt3dnssToSO9c2dO7f873//k/nz5ydrmf788089Ebt7TlLD6tWr49Xd+TJv3rzULiIRWVBgaheA6H6MHTtWihUrJpGRkbJu3ToNYn7//XfZvXu3BkCwefNmadKkiWZoOnXqpAERbNmyRd555x1Zu3atLF26NN79Hjp0SG+HgGTu3LkavCS3lStXyqOPPiqjR4/26PjKlSvLSy+9pP8/c+aMTJs2TVq3bq117tOnT7IFRmPGjNFsRLZs2e7rPr744guJi4uT5PbCCy9I1apV79leo0YNSWsOHDgg6dPz9yyRNzEwIr/UuHFjeeSRR/T/aJbKkSOHvP/++/Lzzz9L+/btNfPRqlUrCQgIkO3bt2vGyNFbb72lJ3Jnc+bM0azMe++9J88884z8888/yd48dOHCBSlXrpzHxxcoUEADO1OXLl2kZMmS8sEHH7gNjGJiYjRICQoKkpSSIUMGr9wvMmR4LUgkODg4tYtAZHn86UGW8OSTT+rfY8eO6V9kVU6fPq3BknNQBHny5JHXX3/9nu3ffPONnoSfeuopyZo1q15PSsDTs2dPvW808VWqVEm+/vrre5qGUMbffvvN3iSE4Csp0DepbNmy9rri9rifSZMmyeTJk6VEiRJ6At27d689Q4XgAs2IyAa1aNFC9u3bZ78/NKG98sor+n9k4VyVCwEjMm4ZM2aU8PBwadeunZw8eTLBPkaO5fr888/t5UL2B1m55ITHQXPqDz/8oEEnyomM0t9//21/PyCYxOuC/kTunvOtW7dKzZo19fZ4LqZOnXrPMVFRUZrtw/2hPoUKFZKhQ4fqdufjXnzxRcmVK5dkzpxZmjdvLqdOnXL5uMh64nlB+fA8obyuOPcxMpuW169fL0OGDNHHwuuMHwUXL16Md1sEynit8+fPr1nVJ554Qt8jzvcZHR2t2cNSpUppefCj47HHHpNly5Yl+BoQWQUzRmQJR44c0b/4EodffvlFT25JyTRs3LhRDh8+LDNmzNBMC5qr0Jw2YsSIRG97584dPeHi9jhB46SKkzROOMheDRo0SIMZ9CnCybJgwYL25jGczJICJy4EJWZdTSg3mhZ79+6tJ2wEMMuXL9fsWvHixfWkiHJ+/PHHUqtWLdm2bZueFFHPgwcPyrfffqtZqJw5c8YrF7JrI0eOlDZt2mh2Didc3Mfjjz+u2bjEmt4QXN64cUOef/55PYm/++67+phHjx71KMuE2166dOme7ai/Y+fkP/74Q1/3/v376/Xx48drgIug5dNPP5V+/frJ1atX9fF79OihAaMj7EPTK+qJrOP333+vTal4L+B4M7hAgINABs8zXlMEX3je8BwuWLDAfn94rhBQdujQQYMtPF7Tpk3vqQdu36BBA32+8Roh24fACwG2pwYOHCjZs2fX2yHoQ4CM9+F3331nP2b48OFa92bNmknDhg1l586d+hfvGUcoA547lB/97a5fv67Nz3i/1K9f3+MyEfktg8iPzJgxw8Dbdvny5cbFixeNkydPGvPmzTNy5MhhZMyY0Th16pQelz17dqNSpUpJuu8BAwYYhQoVMuLi4vT60qVL9bG2b9+e6G0nT56sx86ZM8e+7e7du0aNGjWMTJkyGdevX7dvL1KkiNG0aVOPyoRjGzRooHXFZefOnUa7du30sQYOHKjHHDt2TK9nyZLFuHDhQrzbV65c2cidO7dx+fJl+zbcR/r06Y0uXbrYt02cOFHvA/fl6J9//jECAgKMt956K972v//+2wgMDIy3vWvXrlpek1kuvDZXrlyxb//55591+6+//ppg3VetWqXHubucPXvWfiyuBwcHxyv/tGnTdHvevHnjPf/Dhw+/p661a9fWbe+99559W1RUlP35w2sJs2fP1ufujz/+iFfWqVOn6u3Xr1+v13fs2KHX+/XrF++4Dh066PbRo0fbt7Vs2dIICQkxjh8/bt+2d+9efd6dv6Lx/OJ5dv481KtXz/6+hRdffFFvf+3aNb1+7tw5fb3wWI7eeOMNvb3jfeJz4+n7k8iK2JRGfqlevXr6CxvNGGjWyZQpk47UQn8cwK9cNF94Cr/S8eu6bdu29iwEmufQ3whZo8Sg4zeauJBpMCEbgo7D6Py9Zs0auV/oII664oLmOWSiMJJtwoQJ8Y57+umn42Wfzp49Kzt27NCsFbJHpgcffFB/+aPMifnpp580S4IsCrI25gV1RVPLqlWrEr0PPKfIZpjQrAfIGHli1KhR2ozjfHGsE9StWzdeU1716tXtz4vje8Hc7vz4gYGBmtUyIVOE62giRRMb4LlHlgjNs47Ph9mUaz4f5nOL19+R89QMsbGxsmTJEmnZsqUULlzYvh2PgWyOp5C9csye4TnGfR8/flyvr1ixQt/jyJo5Z5qcIQO4Z88eHYhAlBaxKY380pQpU3SYPk5maHIoXbp0vNE6WbJk0SaYpAQfaCJC0wGaw0zoh4EmJgQhCY0GwgkIgYLzMTjBmfvvF07kmF4AJz70DcF9umq+QvOdc5kAz40z3AdOyLdu3dI+Ke7g5IiEDOrmiidNYY4nfDCDJDRdeaJixYoaCCf1cdBHDBA8u9ru/Pjoe+P8XOA9BmiewkhCPB/on+Wu+RNBlPnc472A/kKOnF8LvOfQvOnq+cWxngSvnjzH5nsB/aIcIbh0DFrNEZ/oh4a6Y26nRo0aaSCOgJooLWBgRH4JAYw5Ks0V/KJHtuTu3bsejcwys0LIjLiCjA+CpNSAPj+eBAboU5XckC1CQLZo0SId4ecMmbrEuLod/NsClnzcPU5yPj6eDwRq6NTvinMQllKSs47oO4Y+exjhiR8MX375pfahQkd09DsisjoGRmRJ6GC6YcMG+fHHH+M1b7mCrAlOAmjycdVZG80hCJwSCoyKFCkiu3bt0hOnY9Zo//799v0pzXxMzH3jDOVCwGVmSJxnWDYh44GTK7JRZvbEqjBHlHMGDR2qwWyiw/OBTstotnP3nJnPPd4LCDAcs0TOrwUyTwhoXTVbuXrd/ut7AdlQx8zi5cuXXWbukEnq3r27XtAUjGAJnbIZGFFawD5GZEmY3ydfvnw68ss8uTk3eaB5CtA3CSdEjGZCYOR8wcgmBFjOw7EdYTTTuXPn4o0CQp8OjN5CVqV27dqS0lB/TA6JKQMcZ7TGJJjIBKDMJjMYcJ75GqPHkI3A8G3n7AOu48RqFXi9HIfJI9uI6whezMlBkVHENBCu5sBCkxjeR4CRgPDRRx/FOwajxRzhuUVfIoxmO3HihH07muvQ1JlcEMih2dl5NvdPPvnknmOdX1O8f9EEl9D7n8hKmDEiS0K/CQQ8OPkjOHCc+RrDjtFvyJw5GdkgDP3GkGpXMDwbJ0LMPYRAwV3nV5xE0dEZHXWRYfi///s/nV8GJ8OkdARPThMnTtSTNOqKOZbM4froZ+O4Xpf53Lz22mvamR19h5B1Q4YEASSGeqOfDToJoy6YQwnPL+qNJVe8CcPwnYe
|
||
|
"text/plain": [
|
||
|
"<Figure size 1200x500 with 1 Axes>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"pca = PCA(n_components=2)\n",
|
||
|
"pca_result = pca.fit_transform(np.array(embeddings))\n",
|
||
|
"\n",
|
||
|
"plt.figure(figsize=(12, 5))\n",
|
||
|
"\n",
|
||
|
"plt.subplot(1, 2, 1)\n",
|
||
|
"plt.scatter(pca_result[:, 0], pca_result[:, 1], s=100, alpha=0.7)\n",
|
||
|
"for i, name in enumerate(protein_names):\n",
|
||
|
" plt.annotate(name, (pca_result[i, 0], pca_result[i, 1]), \n",
|
||
|
" xytext=(5, 5), textcoords='offset points', fontsize=9)\n",
|
||
|
"plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')\n",
|
||
|
"plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')\n",
|
||
|
"plt.title('PCA of Protein Embeddings')\n",
|
||
|
"plt.grid(True, alpha=0.3)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "4dd24749",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"The PCA analysis reveals clear groupings among the proteins, with the first two components capturing the majority of the variance in the embeddings. Hemoglobin Beta and Myoglobin cluster tightly in the upper right, reflecting their shared evolutionary history and common role in oxygen transport. Erythroid Alpha-Spectrin and Dystrophin separate into the lower left quadrant, consistent with their related cytoskeletal functions and structural importance in maintaining cell integrity. Cathelicidin and Defensin Beta 4A form another distinct cluster, aligning with their antimicrobial roles in innate immunity. The spatial separation of these groups highlights how the embeddings capture broad functional and structural distinctions, while also showing within-group proximity that reflects biological similarity."
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"metadata": {
|
||
|
"kernelspec": {
|
||
|
"display_name": ".venv",
|
||
|
"language": "python",
|
||
|
"name": "python3"
|
||
|
},
|
||
|
"language_info": {
|
||
|
"codemirror_mode": {
|
||
|
"name": "ipython",
|
||
|
"version": 3
|
||
|
},
|
||
|
"file_extension": ".py",
|
||
|
"mimetype": "text/x-python",
|
||
|
"name": "python",
|
||
|
"nbconvert_exporter": "python",
|
||
|
"pygments_lexer": "ipython3",
|
||
|
"version": "3.11.13"
|
||
|
}
|
||
|
},
|
||
|
"nbformat": 4,
|
||
|
"nbformat_minor": 5
|
||
|
}
|