mlx-examples/esm/notebooks/mutation_effect_prediction.ipynb

663 lines
522 KiB
Plaintext
Raw Normal View History

2025-08-16 11:48:57 +08:00
{
"cells": [
{
"cell_type": "markdown",
"id": "d28ac072",
"metadata": {},
"source": [
"## Zero-Shot Mutation Effect Prediction with ESM-2\n",
"\n",
"Protein function depends on its amino acid sequence, and even one change can alter activity, stability, or binding. Traditional methods to study these effects are costly and slow. ESM-2 offers a faster alternative, predicting mutation impacts from sequence alone. Trained on millions of proteins with masked language modeling, it learns which substitutions preserve function and stability from evolutionary patterns.\n",
"\n",
"In this notebook, we'll be scoring mutations, meaning we will quantify how much each amino acid change is predicted to affect the proteins function or stability.\n",
"\n",
"To score a mutation, we:\n",
"1. **Mask** the position of interest in the protein sequence\n",
"2. **Predict** probabilities for all 20 amino acids at that position \n",
"3. **Compare** the probabilities of the wild-type vs. mutant amino acid\n",
"4. **Calculate** a log-likelihood ratio (LLR) score\n",
"\n",
"**Negative scores** indicate deleterious mutations (model assigns lower probability to the mutant), while **positive scores** suggest beneficial or neutral changes.\n",
"\n",
"We'll demonstrate mutation effect prediction using **β-lactamase TEM from *E. coli***, a clinically important antibiotic resistance enzyme. This protein hydrolyzes β-lactam antibiotics and is one of the most common resistance mechanisms in clinical isolates. TEM-1 was extensively characterized through deep mutational scanning by [Stiffler et al. (2015)](https://www.cell.com/fulltext/S0092-8674%2815%2900078-1), who measured the fitness effects of nearly 5,000 single amino acid mutations. We'll use their experimental data to validate our ESM-2 predictions and demonstrate how well the model captures functional constraints.\n"
]
},
{
"cell_type": "markdown",
"id": "a98816a3",
"metadata": {},
"source": [
"### Setup\n",
"\n",
"Here we import all neccessary libraries."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "24652854",
"metadata": {},
"outputs": [],
"source": [
"import mlx.core as mx\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from matplotlib.patches import Rectangle\n",
"from tqdm.auto import tqdm\n",
"import pandas as pd\n",
"from scipy.stats import spearmanr"
]
},
{
"cell_type": "markdown",
"id": "fb407955",
"metadata": {},
"source": [
"Download the experimental deep mutational scanning dataset from Stiffler et al. (2015) for validation of our predictions."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "bd229ffc",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" % Total % Received % Xferd Average Speed Time Time Time Current\n",
" Dload Upload Total Spent Left Speed\n",
"100 1693k 100 1693k 0 0 7621k 0 --:--:-- --:--:-- --:--:-- 7630k\n"
]
}
],
"source": [
"!mkdir -p data\n",
"!curl -o data/BLAT_ECOLX_Ranganathan2015.csv https://raw.githubusercontent.com/facebookresearch/esm/refs/heads/main/examples/variant-prediction/data/BLAT_ECOLX_Ranganathan2015.csv"
]
},
{
"cell_type": "markdown",
"id": "52c8a805",
"metadata": {},
"source": [
"Load the ESM-2 model. Here we will use the 650M parameter version. Change the path below to point to your converted checkpoint."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "5e694b81",
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"sys.path.append(\"..\")\n",
"\n",
"from esm import ESM2\n",
"\n",
"esm_checkpoint = \"../checkpoints/mlx-esm2_t12_35M_UR50D\"\n",
"tokenizer, model = ESM2.from_pretrained(esm_checkpoint)"
]
},
{
"cell_type": "markdown",
"id": "b69be9f9",
"metadata": {},
"source": [
"Here we define the mature TEM-1 β-lactamase sequence (263 amino acids, signal peptide removed) and coordinate conversion constants for mapping between UniProt numbering and our sequence indices."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "b1b30eae",
"metadata": {},
"outputs": [],
"source": [
"UNIPROT_START_OF_MATURE = 24\n",
"\n",
"sequence = (\n",
" \"HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRVDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW\"\n",
")\n",
"\n",
"amino_acids = list(\"ACDEFGHIKLMNPQRSTVWY\")\n"
]
},
{
"cell_type": "markdown",
"id": "bd0f037c",
"metadata": {},
"source": [
"### Function Definitions"
]
},
{
"cell_type": "markdown",
"id": "4c42edf4",
"metadata": {},
"source": [
"These functions handle coordinate conversion between UniProt numbering (which includes the signal peptide) and our mature sequence indices, enabling us to map experimental data positions to the correct sequence locations."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "6c619e5c",
"metadata": {},
"outputs": [],
"source": [
"def u2m0(u_pos):\n",
" \"\"\"Convert UniProt position (1-based) to 0-based index into mature sequence.\"\"\"\n",
" m0 = u_pos - UNIPROT_START_OF_MATURE\n",
" return m0 if 0 <= m0 else None\n",
"\n",
"def urange_to_m0_span(u_start_incl, u_end_incl):\n",
" \"\"\"Convert UniProt range to [start, end) mature 0-based slice.\"\"\"\n",
" s0 = u2m0(u_start_incl)\n",
" e0 = u2m0(u_end_incl)\n",
" if s0 is None: s0 = 0\n",
" if e0 is None: e0 = -1\n",
" s0 = max(0, s0)\n",
" e0 = min(len(sequence) - 1, e0)\n",
" if s0 > e0:\n",
" return None\n",
" return s0, e0 + 1\n"
]
},
{
"cell_type": "markdown",
"id": "8ac3e9e4",
"metadata": {},
"source": [
"These are the core mutation scoring functions that implement ESM-2's masked language modeling approach. They mask a position in the sequence, predict amino acid probabilities using the model, and calculate log-likelihood ratios (LLR) comparing mutant versus wild-type amino acids."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "5c880801",
"metadata": {},
"outputs": [],
"source": [
"def get_site_log_probs(m0_pos):\n",
" \"\"\"Get log probabilities for all amino acids at a masked position.\"\"\"\n",
" tokens = tokenizer.encode(sequence)\n",
" input_ids = mx.array(tokens).reshape(1, -1)\n",
" masked = mx.array(input_ids)\n",
" masked[0, m0_pos + 1] = tokenizer.mask_id\n",
" logits = model(masked)[\"logits\"]\n",
" return mx.log(mx.softmax(logits[0, m0_pos + 1], axis=-1))\n",
"\n",
"def score_mutation_llr(u_pos, wt_aa, mut_aa):\n",
" \"\"\"Calculate log-likelihood ratio for a specific mutation.\"\"\"\n",
" m0 = u2m0(u_pos)\n",
" if m0 is None or m0 >= len(sequence):\n",
" return np.nan\n",
" \n",
" log_probs = get_site_log_probs(m0)\n",
" wt_tok = tokenizer.token_to_id.get(wt_aa, tokenizer.unk_id)\n",
" mt_tok = tokenizer.token_to_id.get(mut_aa, tokenizer.unk_id)\n",
" return float(log_probs[mt_tok].item() - log_probs[wt_tok].item())\n",
"\n",
"def score_position_llrs(m0_pos):\n",
" \"\"\"Get LLR scores for all 20 amino acids at a given position.\"\"\"\n",
" log_probs = get_site_log_probs(m0_pos)\n",
" wt = sequence[m0_pos]\n",
" wt_token = tokenizer.token_to_id.get(wt, tokenizer.unk_id)\n",
" wt_lp = float(log_probs[wt_token].item())\n",
" \n",
" scores = np.zeros(len(amino_acids), dtype=float)\n",
" for i, aa in enumerate(amino_acids):\n",
" mt_tok = tokenizer.token_to_id.get(aa, tokenizer.unk_id)\n",
" scores[i] = float(log_probs[mt_tok].item()) - wt_lp\n",
" return scores\n"
]
},
{
"cell_type": "markdown",
"id": "b2579edb",
"metadata": {},
"source": [
"This function generates a mutation effect heatmap for a specified protein region. It calculates LLR scores for all 20 amino acids at each position within the given UniProt range, creating a matrix that visualizes the predicted functional impact of every possible single amino acid substitution."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "b83d7f7d",
"metadata": {},
"outputs": [],
"source": [
"def generate_heatmap(u_start_incl, u_end_incl):\n",
" \"\"\"Generate mutation effect heatmap for a UniProt range.\"\"\"\n",
" span = urange_to_m0_span(u_start_incl, u_end_incl)\n",
" if span is None:\n",
" raise ValueError(\"Requested UniProt window lies outside the mature sequence.\")\n",
" \n",
" s0, e0 = span\n",
" n_cols = e0 - s0\n",
" heatmap = np.zeros((len(amino_acids), n_cols), dtype=float)\n",
" \n",
" print(f\"Calculating mutation effects for UniProt {u_start_incl}-{u_end_incl} \")\n",
" \n",
" for j, m0 in enumerate(tqdm(range(s0, e0), desc=\"Positions\")):\n",
" heatmap[:, j] = score_position_llrs(m0)\n",
" \n",
" seq_segment = sequence[s0:e0]\n",
" u_positions = list(range(UNIPROT_START_OF_MATURE + s0, UNIPROT_START_OF_MATURE + e0))\n",
" \n",
" return heatmap, seq_segment, u_positions\n"
]
},
{
"cell_type": "markdown",
"id": "737185d5",
"metadata": {},
"source": [
"This function generates a heatmap visualizing the predicted impact of all possible amino acid substitutions at each position, highlighting regions or residues of interest and coloring them by their log-likelihood ratio scores."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "6570ef2a",
"metadata": {},
"outputs": [],
"source": [
"def plot_heatmap(heatmap, seq_seg, u_positions, title,\n",
" col_width=0.42, min_width=8.0, max_width=16.0, height=6.0, dpi=160,\n",
" robust=True, lo=2, hi=98, vmin=None, vmax=None,\n",
" highlight=None, box_color=\"black\", box_lw=3.0, tick_fs=9, aa_fs=11):\n",
" \"\"\"Plot mutation effect heatmap.\"\"\"\n",
" \n",
" n_cols = len(seq_seg)\n",
" width = np.clip(col_width * n_cols, min_width, max_width)\n",
" \n",
" if robust and (vmin is None or vmax is None):\n",
" vals = heatmap[np.isfinite(heatmap)]\n",
" if vals.size:\n",
" p_lo, p_hi = np.percentile(vals, [lo, hi])\n",
" vmin = p_lo if vmin is None else vmin\n",
" vmax = p_hi if vmax is None else vmax\n",
" if vmin is None: vmin = -10\n",
" if vmax is None: vmax = 10\n",
" \n",
" fig, ax = plt.subplots(figsize=(width, height), constrained_layout=True, dpi=dpi)\n",
" im = ax.imshow(heatmap, cmap=\"RdBu_r\", aspect=\"auto\", vmin=vmin, vmax=vmax)\n",
" \n",
" ax.set_xticks(range(n_cols))\n",
" ax.set_xticklabels([f\"{aa}{u}\" for aa, u in zip(seq_seg, u_positions)],\n",
" rotation=90, fontsize=tick_fs)\n",
" ax.set_yticks(range(len(amino_acids)))\n",
" ax.set_yticklabels(amino_acids, fontsize=aa_fs)\n",
" \n",
" ax.set_xlabel(\"Wild-type Residue and Position\", fontsize=12)\n",
" ax.set_ylabel(\"Mutant Amino Acid\", fontsize=12)\n",
" ax.set_title(title, fontsize=14)\n",
" \n",
" cbar = plt.colorbar(im, ax=ax)\n",
" cbar.ax.tick_params(labelsize=tick_fs)\n",
" cbar.set_label(\"Log-Likelihood Ratio\", fontsize=11)\n",
" \n",
" if highlight:\n",
" if not isinstance(highlight, (list, tuple, set)):\n",
" highlight = list(highlight)\n",
" u2col = {u: j for j, u in enumerate(u_positions)}\n",
" for u in highlight:\n",
" j = u2col.get(int(u), None)\n",
" if j is not None:\n",
" ax.add_patch(Rectangle((j - 0.5, -0.5), 1.0, heatmap.shape[0],\n",
" fill=False, edgecolor=box_color, linewidth=box_lw))\n",
" \n",
" plt.show()\n",
" return fig"
]
},
{
"cell_type": "markdown",
"id": "aa62650c",
"metadata": {},
"source": [
"### Generating Heatmaps"
]
},
{
"cell_type": "markdown",
"id": "82c5be68",
"metadata": {},
"source": [
"#### SxxK motif"
]
},
{
"cell_type": "markdown",
"id": "4e9500f8",
"metadata": {},
"source": [
"We start with the SxxK catalytic motif, a signature sequence found in Class A β-lactamases. The highlighted serine (S68) forms a covalent acyl-enzyme intermediate with the β-lactam substrate, while the lysine (K71) plays a crucial role in the catalytic mechanism. This region should show strong purifying selection, with most mutations being highly deleterious, making it an excellent test of whether ESM-2 captures functional constraints."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "cef435f3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Calculating mutation effects for UniProt 60-80 \n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Positions: 100%|██████████| 21/21 [00:00<00:00, 72.60it/s]\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABZIAAAPSCAYAAADhnC89AAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjUsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvWftoOwAAAAlwSFlzAAAYmwAAGJsBSXWDlAAA/hRJREFUeJzs3QecE1X38PGTXVh6L9JBmjQrCApKB7GCKChVihVR7B1BsGFH1MeGLF0EARFQkbI8CiqCiihKk+LC0nvfkvdz7vOfvNklyWbZZDOZ/L5+xiyTm8nNZDJJTs491+V2u90CAAAAAAAAAIAfcf6uAAAAAAAAAABAEUgGAAAAAAAAAAREIBkAAAAAAAAAEBCBZAAAAAAAAABAQASSAQAAAAAAAAABEUgGAAAAAAAAAAREIBkAAAAAAAAAEBCBZAAAAAAAAABAQASSAQAAAAAAAAABEUgGAAAAAAAAAAREIBkAAAAAAAAAEBCBZAAAAAAAAABAQASSAQAAAAAAAAABEUgGAAAAAAAAAAREIBkAAAAAAAAAEBCBZAAIk9OnT8vHH38snTp1kgoVKkhCQoIUKVJEatasKTfddJNMnz5d3G53SO+zX79+4nK5zLJly5aQbhsAnKBGjRrmHKmX2Z3DR48eLVdeeaWUKVNG8uXL5zm/zp49W2JVUlKSZz8MHz5cYsGuXbukVKlS5jHrMRHNYvH5gzN899135riNi4uT5cuXR7o7ABCzCCQDNqfBQOsDfygWK7h4trdPTEw8o4+6Lmu7iRMnBv0Yp0yZEtT9nI0TJ07Ijz/+KO+9957cfvvtcvHFF5uAbri/ROmXzssvv1zuuOMO+eabb8y/U1NT5fjx47J582aZOXOmdO/eXVq3bi1HjhyRWPbWW2+Z54EvtMjO4sWL5a677jKvYw3s5c+fXwoXLiyVKlWS5s2bm9f4+++/L+vWrZNooueBnJz79BxSt25dz23q1asn27Ztk1h08OBBz/kjVO8b6uTJk9KmTRt54IEH5Pvvv5f9+/dLenp6yLaP6PLII4+YY61OnToyaNAgn218fdYKx+s/L3j/KJ110R9U9Pyr52E9H//3v/+VWDo3+LJ161Z58cUX5YorrpDKlStLgQIFpFy5ctKwYUO59dZbzfvS7t27g9qW/kjVrVs3Offcc6VQoUJStmxZady4sXkcO3bskLxy7Ngx+fDDD+X666+XatWqmfda/fxcvnx58ziffvpp2bhxY462uWrVKvP60fesYsWKmUX/vvfee+WXX37J9vb6o96NN95okjDuuecezskAECluALa2efNmTVkN2aLbU2d7+3Hjxp3RR12XtV3Lli2Dfoxt2rQJ6n7ORunSpQM+nmHDhrnDoV27dp77KFasmHvIkCHuSZMmuSdMmOB+5pln3BUqVPBc37Nnz5Dd72233XbGc2131atX9/QZ8EWP5VatWuXoXDVlyhR3tPB+bNmd+1avXu2uWLGip33Tpk3de/bscccq7/dI3Y85OefopT+jR4/2bLdq1aruUaNGuT/77DP3rFmzzLJjxw53rFqyZEnY30PtZOXKlW6Xy2Ue7/jx4/228/VZK9Sv/7x6/rw/SwSzXHfdde59+/a5o/3ckFNpaWnmM12BAgXO6vOzt/3797uvuuqqgNsoUaKEe9q0ae5wS0pKclepUiXbx5Q/f373s88+687IyAi4Pb3+kUceccfFxfndll73+OOPZ7ut3377zfN6HDt2bIgfOQAgGPkiFsEGEBT95X/WrFl+r9cMB80IsQRqa20vqw8++MDnel8uueSSgNdrpkpaWprJUNmwYYPJ3glk06ZNZpil921DKWu2gpaY0EwRzR4Jl2XLlsmiRYvM3yVKlDAZ0Zpx4e2hhx6SCy+8UP7991+ZOnWqvPnmm0E/B0As2b59u8l+0ktVsGBBueGGG0zGv76eNUNu7969smbNGpM5+ueff5p2TsxU0mG9mh126NAh8++rrrpKPv/8c1MyB8ELJlv0yy+/9Pz96aefmox3/P8M2lCXZbIzzbzUx6tZmT179pRYe/5uueUWk1Vr0c9pej6eN2+efPvtt2bd3Llz5dprrzXnKP0sFwu09I3uF+tzt37e02zZpk2bmixivV7PNT/88IMZTZPdCIjrrrvOU65Bs5l1hE2jRo3k8OHD5j4WLFhgzv16DGqmsr4XhMNvv/1mSrJpn1SVKlWkb9++piybfn7W0S8zZsyQX3/91Yy0GzFihGn33HPP+d3mww8/bD7nKs1q7t27t8kuVnrMTJo0yeyvUaNGSUZGhrzyyit+t6Wfna+55hpz/GmWdp8+fczoJABAHgoq3AwgajKWg3W2mTO+eGckX3/99Z6MA80syM6TTz7pyUS44YYbQp6V06dPH/eIESPcc+fOdaekpJh1moETzmyqp556yrP94cOH+22nWSxWO+1fKJCRDKfp3r275/i48MIL3Vu3bg3YftOmTea19dVXX7mjRTAZibNnz3YXLFgw00iG06dPu2NduLIOzzvvPM92T5w4EbLtIrr8+uuvnuPghRdeCNg2WjKSc/pZItDnJO2vlR2qy4cffuiOlYzk++67z7P9bt26BczIPnbsWMCRI/o51dpW3bp13du3bz+jzWuvveZpU758efehQ4fc4dChQwfP/fTu3dt96tQpn+3eeustT7uEhAS/j/+///2vp52+h3333XdntNF13u9vy5cvD9jHefPmedomJiae5SMFAJwtaiQDCCnN2OnQoYP5e/z48QEzjDVjUNuojh07StWqVUPenwkTJsjQoUNNpoxmL+aFv//+2/N3+/bt/bbTLA+LZpwAyMzKxPKup67nmEA0a2rkyJEmo8opPvnkEzNBp5UhNmTIEJPBRRZW+Fj72sqCR2x6++23zaWOfLjtttsi3R3b0VrKWs/XMm3aNIkFmkX7zjvveEaG6OMuXbq03/ZaX1izlP29z2kmrkXnGNG6/76yeq+++mrPaEQrwzeUTp065cme1veXMWPGmAxiX/R9yBqlqNnE/ia/04x+i2Yt6wijrHSdd0bzU089FbCf1iTW3q9RAEDeIZAMIOR0OJ7auXOnGXrmj15nTRxi3cYJrGHnKlDw2ntyLJ24Ji8DJDpsWyeR0g/v55xzjvmioMPja9SoYQJWkydPznGZkaVLl5rJT84//3zPRGglS5Y0k8ToRCo6LFOHLFr0vvTLuXeZEV8T++gw3Ky0bIpO0nfzzTd7Jm3R+9Mvapdeeqk8+uijpmxKdnRYZNZJjv744w/zOLQsi+4T3T/t2rWTL7744ozb66Ru9913n9SvX1+KFi1qHm/Lli3N/svr52HhwoUm0OG9P7RcivZNf9x55plngprM5qeffpLBgweb51G/GOtQVv1Sq1/cdMIg/aKZV3QiHx06q/SxNGjQIFfb0+PFer51qG529128eHHTVvelDk+26PNmbUeft5SUFL/b0URF/SHLaq/7Nic0wDBw4EBPqQ6d0EmPfd1WKCfUskoM6ePUocLWRE/6g5cOodbXd1YrV64029DXirbV170eJ1999VXQfdAh8nps6nBwHc6trwE9b+rr6KWXXpIDBw4EnIhW+2nRPvo6h2SdyNM69+ilv/NBoPOSPuac8nWu0SHk1sRT1rGmz21W+hrQ2+g5oXr16iYopecb3e8DBgwwJV2C9fvvv5tJYGvVqmWeM93nzZo1k9dee80z8av35G++6LES7IS1evxr+RUtAaDPlfZdz0/ad92Pet46m+NU95334yhVqpS0aNHCBPesc0YoJhubPn26+VtLm+gkapFiPX7rmNX3Uv0xSX+stkp2af90El9/Ab2zef6CoaWGvI+vUB33kTg3BOvll182x7ZuQ4OtuTkf62cLPdaUfhbQx+uPBpO9f1gNtX379nnea3Sf62eaQHTSV8vRo0d9fs61zk/62r/77rv9bkuv0zbW82WVs/IlLi5Ounbtav7WzzVZjzsAQJiddS4zAFuwW2mLe++91wy3LleunKfUhT9WKQttq7fR2+bF8M5wl7bwHqbqb9/q461Vq5Zpo8NCt23blmelLc4999ygJs/RMgL
"text/plain": [
"<Figure size 1411.2x960 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"hm1, seg1, upos1 = generate_heatmap(60, 80)\n",
"fig1 = plot_heatmap(\n",
" hm1, seg1, upos1,\n",
" title=\"TEM-1 β-lactamase: SxxK motif region (UniProt 6080)\",\n",
" highlight=[68, 71]\n",
")"
]
},
{
"cell_type": "markdown",
"id": "d4c9b97c",
"metadata": {},
"source": [
"The heatmap highlights ESM-2s ability to pinpoint functionally critical residues. The catalytic serine (S68) and lysine (K71) exhibit extremely negative scores, nearing -10, indicating that almost any substitution at these sites would severely impair enzyme function. In contrast, surrounding positions display a broader range of tolerance, with some accommodating conservative changes. This pattern shows that ESM-2 has captured the precise functional constraints of the active site purely from evolutionary sequence patterns."
]
},
{
"cell_type": "markdown",
"id": "d24e211b",
"metadata": {},
"source": [
"#### Binding patch"
]
},
{
"cell_type": "markdown",
"id": "7142c147",
"metadata": {},
"source": [
"We next examine the substrate-binding residues 232234, which form part of the binding patch that interacts directly with the β-lactam substrate. These residues are key to substrate recognition, making them an ideal test of ESM-2s ability to detect the functional importance of specific contact points."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "474f1fcc",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Calculating mutation effects for UniProt 226-238 \n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Positions: 100%|██████████| 13/13 [00:00<00:00, 78.58it/s]\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABQ8AAAPSCAYAAADVySxkAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjUsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvWftoOwAAAAlwSFlzAAAYmwAAGJsBSXWDlAAA88RJREFUeJzs3QeYE1XXwPGTXXrv0qQrRbB8IChKV8QOKiq9qCgKYu8KUuy9N2QpgoqCICBSFxXkRVAUBAFpurD0Xrfle87ViWFJWzbJTDb/n8+YkEwmN5NJNjk55x6X2+12CwAAAAAAAABkk5D9AgAAAAAAAABQBA8BAAAAAAAA+ETwEAAAAAAAAIBPBA8BAAAAAAAA+ETwEAAAAAAAAIBPBA8BAAAAAAAA+ETwEAAAAAAAAIBPBA8BAAAAAAAA+ETwEAAAAAAAAIBPBA8BAAAAAAAA+ETwEAAAAAAAAIBPBA8BAAAAAAAA+ETwEAAAAAAAAIBPBA8BAAAAAAAA+ETwEAAAAAAAAIBPBA8BxIS0tDT56KOPpEOHDlKxYkUpUKCAFC1aVGrVqiXXX3+9TJw4Udxud1jvs3fv3uJyucyyadOmsG4byEus10nr1q3FaZKSkjzj0/O+1KhRw1yvp4gfQ4YMCXpsRNrRo0eldu3aZgz33nuvxDL9O2ntT/37CSD3MjIypF69euZ1dfvtt9s9HABxjOAh4or3B9twLFZA6VRv7+vLivcXXWsZO3ZsyI9x/PjxId3PqX7JWbx4sbzzzjty6623ynnnnWeCeNb96BexSNi+fbtceOGFctttt8m3335r/p2eni5HjhyRjRs3yqRJk+TGG280gYuDBw9KPHvttdfM8xCp5wKxLTk5OeB7UkJCghQvXlzOOOMM6dy5s3k/0dcaEGnx+t71zDPPyIYNG6R06dLy1FNPBQxu66Kv4VP5Acwp+9U7YJt9SUxMNPuhYcOG0qtXL5kxY4ZkZWWJE1nHqh634ZSZmSkLFiyQwYMHmx9Lq1evLoULF5ZChQpJ5cqVpX379vLSSy/Jzp07Q9reoUOH5KuvvpJ77rlHWrRoIaeddpr53FasWDHz4+sNN9wgn3zyiRw/fvyUxjtnzhy544475KyzzpIyZcqYcZ5++unSrFkzc59635ESzsemP1DPnTtXXnjhBbnpppvM51t9HEWKFDH7X/f9JZdcIs8++6xs3bo15DHqj9pff/21dO/eXerWrSslSpSQfPnymeNc70P33Q8//BBwG7q+jkvpj+hLliwJ+f4BIKzcQBzZuHGjpqaFbdHtqVO9/ahRo04ao16Wfb2WLVuG/BjbtGkT0v2cijJlygR8PIMHD3ZHQrt27Tz3Ubx4cfegQYPc48aNc48ZM8b9xBNPuCtWrOi5vmvXrmG73169ep30XDtd9erVPWMGsps/f36O36fq1q3rXrlyZcDtWuu2atXK7TTe76n+3gut142eIn7eu/RvVrj/TuZESkqKu3Dhwub+n3766ZD2jb6GT+VvWKT+Pvv7jKX3HWyfh7JceOGF7k2bNrmdxhpfON8zFixY4C5fvnxI+0U/C3344YcBt/fyyy+7CxUqFNL2ateu7V64cGHIY12/fv0Jn80CLZEQ7sf2yy+/hHxM6mtW7z+Yv//+2xy/oWyzY8eO7gMHDgTc3v/93/+ZdVu0aJHj/QUA4ZAvvKFIwNkqVKggkydP9nv9jh07TigJCLSutb3s3n//fZ+X+/J///d/QX9t1HKF7777TtatW2eygQJZv369JyvBum24fxH3puXDBQsWlM2bN0ukLFy40PwarEqWLGkyH7V8w9t9990n55xzjvz9998yYcIEefXVV0N+DoB4Vb58efnggw9OuEyzfHbv3i1Lly41WYea2bFmzRpp06aNrFy50u/rKtxTBkQb0xLADsOGDTMZ/ToFx9133y2xTjMkc/JeoO8r3o9bP2NoZYH+zdeMMn0/+vHHH03Gl74n6WeAvEwz2qyMQj0m2rVrZ6ouqlataj7TrV271mTS6alWWWg1xuHDh2XQoEE+t6frHTt2zJyvVKmS2d75559vMvQ0027ZsmWmsmXPnj3m8+Oll15qMgn1PgNZsWKFyYDctm2b+bdmHV5zzTVy5plnmqy/ffv2yR9//CGzZ8+W3377Lez7KZKPrWbNmmY7DRo0MFmfmomv96PZwZpB+NNPP5nX7P33328yGh999FGf29G/nXp8//nnn57ns0ePHiartmzZsrJlyxaTYTpt2jTzmtHjXfehHvtaAeCL3pdWBHz//fcyc+ZMk5kKAFEVlhAkkEczE0PlfZvcZqh5Z8lcffXV7oSEBHP+4YcfDnrbRx991Kyrt7nmmmvCnlHRo0cP99ChQ93Tpk1zp6amnpRFEInMhscee8yz/SFDhvhdTzMQrfV0fOFA5iHycuZhsIwZzfapXLmyZ/1Q3oOcKJTMQ9gv3jIP9W9owYIFzX3fdtttAdeNlczDnO5zf9mJatasWe4CBQp41tXPAnk983DChAnuevXquUeOHOk+ePCgz3XS09Pdd9xxh+f+9RjSLEBfdL1LLrnE/c0337gzMjJ8rrNjxw73BRdccEKmeWZmpt8x7tu3z3366aebdfX50bFmZWX5XX/z5s3uSAj3Y9Osvw0bNgS9348++sizvfz583s+C2c3YsQIz3oNGzZ0b9++3ed6mhFZpEgRz7pff/213/vW5976m9y6deugYwWAcGPOQ8DBqlWrZn4tVaNHjw6YSai/2Os6Sn8R1rlawm3MmDHy5JNPypVXXmmyDqNBf722aPaBP/rLvOXAgQMRHxeQ12nWxYMPPuj5d07mWgMQmGb9WnOx9enTx+7hOI5+9tG57CyfffaZ5HWXX365/P7779K3b1+TweeLZiC+/fbbZr48pceQv3mxR4wYYbL/NENN55P0l4H+5Zdfmnn9lGaaa2abP5pxp1UeatSoUWasOl9loM+xkRDux6YZhpp1GMwtt9wiV111lTmv8wF/8803Ptfzvvy5557zm7XfvHlzM+9hKH9n9bnXuROt9TQDFACiieAh4HDamERpecj06dP9rqfXWZM4W7fJC/bv3+85Hyhg+ddff3nOa0lItGg5i5ay6Jeciy++2DNht5aoaAmXdoLWMqOclpBrOUv//v2lUaNG5vHkz59fSpUqJY0bN5a77rpLZs2adcJE8taE+t4l5L4mpPfVDVdL4nXSd51cXEvC9UO03l+5cuVM+Y4GkLTs51Q6l2qpqz4OLbnXfaL7R8uLpkyZctLt9YP9wIEDpX79+uaLkz7eli1bmv0X7edBy5t0wn7v/aEf/nVs+qX2iSeekJ9//jnodv73v//JgAEDzPOok8lrmb9OvK5feN57771TnqQ+WrR0y9drMafdln0dG3pMaXdZ3cf6PGlJoh7f2kBCS/FCoeVj2jxAJ+fXCeh1O3qs6fQTv/76a8iPM1i3Ze9GVlbjCX1P1h9T9LnVSfD1mNWStEceeUR27doV0v3ql08NBOhxrq83/ZKrE/1reZtO2ZC9yU1um154N95Q+h6iPzrpDzNa+qfHp/7w1LVr16CT+Ct9PWmZnT5mLdGrUqWKaZigj0PP6w9Nepzr8xRsTKfy3mVJTU01JcA6Bn196ePQRgd16tQxr32dTmTv3r0h7aNwHJehsH7s0x++LrjgArFL9mNC6XupllDq2HRf6vvp1VdfHfAzSCS6LesYvJ8XbZIW6DWpJaL690pfk/p+oNd5ByAtWsqqjS/0daefK/RvhQaa9H1E39u1pNSf7PtKj1tfx+upPH491vyVrHrTdbRJnMVfabD+zQmFvmZ0XwTbnr7O9EdkpX9n9X3CLuF+bDmh7/MWq3Q7Oy2/t2g5dyDe12u5cyD6Oc2iwVsAiKqw5zICMcxpZct33XWXOy0tzTOBtpYx+2OVKeu6ehu9bTTKsSJdtqwNGILtW328OiG2ruNyudx//fVX1MqWa9asGdJk2Oecc05Ix4aWwIQ6CflXX33ls6wt0JK9ocXo0aNDul1iYqL7hRdeyFEJ4Pvvv39C2Vn25Z577vHcVtcPtG7fvn0D3ne4nodjx46
"text/plain": [
"<Figure size 1280x960 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"hm2, seg2, upos2 = generate_heatmap(226, 238)\n",
"fig2 = plot_heatmap(\n",
" hm2, seg2, upos2,\n",
" title=\"TEM-1 β-lactamase: Binding patch (UniProt 226238)\",\n",
" highlight=[232, 233, 234]\n",
")\n"
]
},
{
"cell_type": "markdown",
"id": "8e0450d9",
"metadata": {},
"source": [
"The heatmap illustrates ESM-2s recognition of the substrate-binding residues 232234 as critical contact points. All three positions display strongly negative scores, often approaching -10, indicating that nearly any substitution would likely disrupt substrate binding. This sharp contrast with the more variable tolerance observed at neighboring sites suggests that ESM-2 has accurately learned the essential role of these residues in substrate recognition from evolutionary sequence data alone."
]
},
{
"cell_type": "markdown",
"id": "12cf53a7",
"metadata": {},
"source": [
"#### Ω-loop region"
]
},
{
"cell_type": "markdown",
"id": "ef967cd4",
"metadata": {},
"source": [
"We now focus on the Ω-loop region spanning residues 160180, which contains the catalytic glutamate at position 166. This residue functions as a proton acceptor during catalysis, playing a central role in the enzymes mechanism."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "709e747b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Calculating mutation effects for UniProt 160-180 \n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Positions: 100%|██████████| 21/21 [00:00<00:00, 76.40it/s]\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABZIAAAPSCAYAAADhnC89AAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjUsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvWftoOwAAAAlwSFlzAAAYmwAAGJsBSXWDlAAA+69JREFUeJzs3QeYE1XXwPGTXXrv0kGqCCgKghSpilhBFJTeFEVR7L0golix4mvDXbqAgqLwIt0CImJBVKQXgaX3viXfc+73Tszupi2bbGYn/5/PmGVyM7kzmUySM2fOdbndbrcAAAAAAAAAAOBHnL87AAAAAAAAAABQBJIBAAAAAAAAAAERSAYAAAAAAAAABEQgGQAAAAAAAAAQEIFkAAAAAAAAAEBABJIBAAAAAAAAAAERSAYAAAAAAAAABEQgGQAAAAAAAAAQEIFkAAAAAAAAAEBABJIBAAAAAAAAAAERSAYAAAAAAAAABEQgGQAAAAAAAAAQEIFkAAAAAAAAAEBABJIBAAAAAAAAAAERSAYAAAAAAAAABEQgGQByiTNnzshHH30knTp1kvLly0u+fPmkcOHCUqNGDbnxxhtl+vTp4na7w/qc/fv3F5fLZaYtW7aEddlArBs+fLjn/bVkyZJodwc5rG3btp7XP1Y89thjZn0rV64sJ06ckNwsFl8/INIOHTokZcqUMe+rUaNGRbs7AAAfCCQDMUaDgdYPn3BMVnDxbB+fmJiYqY86L2O7CRMmhLyOkydPDul5zsbJkydl+fLl8u6778qtt94qF110kQnoWs+jgaFI2L17tzRv3lxuu+02+frrr82/k5OTzQ/xzZs3y4wZM6R79+7mh+3Ro0cllr3xxhvmdYjUawFnSUtLk9mzZ8tdd90ljRo1knPOOce8p8uWLSsNGjQw73M9SaPvNwBnb82aNfLaa6+Zv0eOHCmFChUKeHJFT2SGSk/EWI+rXr265IbvW0WKFJGqVavKNddcYz63Dhw4IHb0+eefez5Tw31COZLfqfTk+8SJE+Wmm26SWrVqme2tU82aNaVDhw5m2StWrAhpWevWrZOHHnpIGjZsKCVLljQn8XWZuo/m1EnASGwrTT748ssvpXfv3lK3bl0pVqyY5MmTx6yjLv+OO+6Q77//PkvL3LFjh+lL48aNTUC4YMGCcu6550q3bt3kiy++CPr4EiVKyNNPP+05Tmzbti3L6wUAiDA3gJiyefNmTVkN26TLU2f7+ISEhEx91HkZ27Vu3TrkdWzXrl1Iz3M2SpUqFXB9nnnmGXckdOjQwfMcRYsWdQ8bNsw9ceJE9/jx491PPvmku3z58p77e/bsGbbn7devX6bX2u6qVavm6TMQyNy5c90NGjQI6Vh17rnnuqdMmRLW59fjhbX8xYsXh3XZsL82bdrE1LHquuuuM+taq1Ytd0pKStD3hH7+hErfP9bj9DPADq9fVr9vlSxZ0v3JJ5+47cb7e0C4j1OR+k41f/58s58F2+b6GgYzevRod/78+QMuR7fR6dOn3ZEU7m31zz//uJs3bx7SvtmlSxf3kSNHgi5z8uTJ7mLFigVc1lVXXeU+ePBgwOWcOnXKXaFCBdO+T58+WVovAEDk5Yl0oBqAvZQrV05mzpzp9/49e/bI7bff7vl3oLbW8jJ6//33fc735eKLLw54v2ZGpKSkyLfffivr16+X2rVrB2y/ceNGT3aI9dhwSk1NTfdvLTGRP39+2bp1q0TK0qVLZeHChebv4sWLm4yU8847L12b+++/Xy688EL5559/ZMqUKfL666+H/BoAsei5556TZ555xlMORt/L119/vcnCKl26tLm8VjMoNVtrw4YNJvO/R48e8t1338mbb75pji9AdsRSOZMffvjBvJfUww8/LPHx8RJrr1/G71N69dBvv/0m48ePl3379snBgwelZ8+eJlP7uuuuk1gQie9U+h2ob9++nu9/7du3lyuuuEKqVKkiefPmNd9zf//9d/nvf/8bdFlvvfWW+X6lNOtXs5uvvPJK08effvpJPv74Yzl27JiMGzfOZAxPnTpVcsO20j63a9fOfLYpzbDu06ePuQpHP/80q/ibb76Rr776ynxGala6fj7qd9G4uDi/+7dmNutVPqpjx45yww03mCznP/74w5Rm27t3r9nuuqz58+eb/vui83W7axb4pEmT5JFHHpH69etneT0BABGSA8FqALlIxgyaUPnKUj5b3hnJmsEUFxdn/n7kkUeCPvaxxx4zbfUx119/fdgzkjUzYsSIEe6vvvrKnZSUlCmDKhIZyY8//rhn+cOHD/fbTjOTrXbav3AgIxlO9NJLL3n2kfj4ePfIkSPdJ0+e9Nk2NTXV/fHHH7uLFCniecyQIUPC0g8ykhErOnfubPbzEiVKuI8fP+63XW7KSA7X96l9+/a5mzRpkq7/ycnJ7ljISA73d6off/zRHNP1sZUqVXL/8MMPAdtv27bN730bNmxw58uXzyzL5XK5p02blqnNX3/95S5btqynv+G+aiVS2+r555/3PE6vytm9e7fPdkuXLnUXKlTI0/bLL7/02U4zjEuXLu1p9+qrr2Zqs2PHDnedOnU8bUaNGhWwj/v373cXKFDAtO3fv3/I6wYAiDxqJAOwNa0fqJkkSjM+AmUYa7aGtrEyITT7JNw0c+ipp54yNQ01GyQn/P33356/L7/8cr/tdPAiy5EjRyLeLyA30oxiHfBLaWbVtGnT5IknnpACBQr4bK9tBgwYYLKztL6m+s9//mOypAAEp3V1Z82aZf7WWv6+aiPHMs0Atb67KM0w1SuPYkE4v1Pp98N+/fqZ74KaYbt48WK59NJLAz4m0PfEZ5991tRZVlorWGv8ZlSvXj155513PP/Wz5JwD3ociW3lnY394osv+r2CrUWLFmbdg2Xhjx49Wvbv32/+vuqqq+SBBx7I1KZixYpmHSwvvPCCyYz2p1SpUiZz2Rr7xFo+ACD6CCQDsD0dVETt2rXLDIrlj963c+fOdI9xgsOHD3v+DvTjwXtAEv1hmlNOnTplLlm+9957pVWrVp7ByvSHnA56dOONN5qgW1bLjGjgbsiQIWZwG10fvSRVB2HRAVx0YLR58+Z5LqFU+lx66an3ZZ6+BjjSAQkz0rIpOtiRXraqZUOKFi1qnk8HirnkkkvM5ZVaNiUY74GirAEe9ZJOXQ8ty6LbRLePDvTja9CZtWvXyt13321+nGrQUte3devWIQUtw/06LFiwwPwo994e+mNT+6Ynd5588kn55Zdfgi7nxx9/lKFDh5rXUX8Y6iWr+oOyU6dO8t5778np06clJ2kQ2dpv9BL7rl27hlyGRy9z9g4YWEGGnKADa+rz60ky3X66HXV76gCBegmw7juh0n0gISFBOnfubAIpGkTXfe388883+6pesh2Mvo8yDrpqXf6sy9T+VahQwTyHXh4dLhkHVNP9RwP7evm6bhctOaLr4ouWKtH3sh5DdDBFfX/o+6RNmzbyyiuvhDxQqQ68OGbMGPPetAaTqlGjhrk0XEsRZRz8zd8gWN7bMBi91Fzfc02bNvX0XT8PtA+jRo0yZRFCHfjNOgbq/qsDd+nxQpep+4FuVx1A7M8//5Rw0eCRFVzzFYzLKb6Oz3pcv++++8xxTo+VWj5K9w8Nch0/fjzg8rLy+gWj7z0dvM2ipRfCsd/rdv/ss8/klltuMQOeaRBfj+f6eaSvsx7n/dH79Tm9g9xaDsHX56odaGkF68S7vqbBSqEFomUqvEuR+AqMWvR7gyY9qE2bNpkyLnanAzZb6tSpE7Ct9/3+Ar8a6A1lWzVr1swcb5Qeb60TTIG2rXWs4uQtANhIDmQ9A8hF7Fba4q677nKfOXPGc+mglrrwxyploW31MfrYcJe28CXSpS28B/Txt211fWvWrOm5BDPQ5ZrhLm2hg5CFMljLhRdeGNK+oZdseg8uGGj6/PPPfZa0yMrgOuPGjQvpcXq57MsvvxzyvqD73Pvvv++5NNbXdO+993oeq+0DtR04cGDA5w7X66CD3HTv3j2kZdWtW9fvcnRgnm7dugVdRo0aNdy///57wHXzbp+dS6tXrFjhWU7
"text/plain": [
"<Figure size 1411.2x960 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"hm3, seg3, upos3 = generate_heatmap(160, 180)\n",
"fig3 = plot_heatmap(\n",
" hm3, seg3, upos3,\n",
" title=\"TEM-1 β-lactamase: Ω-loop region (UniProt 160180)\",\n",
" highlight=[166]\n",
")"
]
},
{
"cell_type": "markdown",
"id": "d15136f6",
"metadata": {},
"source": [
"The heatmap shows that ESM-2 identifies glutamate 166 as an important catalytic residue, with most substitutions producing negative scores, though generally not as extreme as those seen in the previous examples. This more moderate penalty likely reflects the Ω-loops inherent flexibility, which may allow certain substitutions to retain partial function or be accommodated structurally. Additionally, E166s role as a proton acceptor might be more chemically replaceable than the covalent interactions made by S68 for example, meaning the model does not assign uniformly catastrophic scores to all substitutions."
]
},
{
"cell_type": "markdown",
"id": "c92d5f28",
"metadata": {},
"source": [
"### DMS Benchmark\n",
"\n",
"This function benchmarks ESM-2s mutation effect predictions against experimental deep mutational scanning (DMS) data from Stiffler et al. (2015). It takes the list of experimentally measured β-lactamase TEM mutations, scores each with ESM-2 using masked-marginal log-likelihood ratios, and then compares the predicted scores to experimental fitness values across antibiotic concentrations. The correlation analysis, reported as Spearman |ρ|, quantifies how well ESM-2s sequence-only predictions align with measured resistance phenotypes. This provides a direct test of the models ability to capture functional effects observed in large-scale experimental screens."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "137fe913",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loading DMS mutations from data/BLAT_ECOLX_Ranganathan2015.csv...\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Scoring DMS mutations: 100%|██████████| 4996/4996 [00:50<00:00, 98.14it/s] "
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Spearman |ρ| vs. experimental fitness:\n",
" 2500 μg/mL: |ρ| = 0.556\n",
" 625 μg/mL: |ρ| = 0.459\n",
" 156 μg/mL: |ρ| = 0.393\n",
" 39 μg/mL: |ρ| = 0.284\n",
" 0 μg/mL: |ρ| = 0.164\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAzkAAAKtCAYAAAAEkqD4AAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjUsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvWftoOwAAAAlwSFlzAAAViAAAFYgBxNdAoAABAABJREFUeJzsvQecZOdV5n0qdo7TPTlHZckKliXZkpxzwDiBWGNjbGAxu7ZhwZ/BgZXBhsWGXQNL9lqAQLblhINwUs6y4kiTY0/onKurK3+//7n11tyuqaquDjPT3XMeu9TTVXXvfe97b1Wf533OeU4gl8vlxGAwGAwGg8FgMBiWCILnegAGg8FgMBgMBoPBMJ8wkmMwGAwGg8FgMBiWFIzkGAwGg8FgMBgMhiUFIzkGg8FgMBgMBoNhScFIjsFgMBgMBoPBYFhSMJJjMBgMBoPBYDAYlhSM5BgMBoPBYDAYDIYlBSM5BoPBYDAYDAaDYUnBSI7BYDAYDAaDwWBYUjCSYzAYDAaDwWAwGJYUjOQYDAaDwWAwGAyGJQUjOQaDwWAwGAwGg2FJwUiOwWAwGAwGg8FgWFIwkmMwGAwGwxnG8PCw/NZv/ZZs3LhRIpGIBAIBufnmm/W1//f//t+U3w0zwy//8i/r/D388MPneiiGM4gbb7xRamtr5fDhw+d6KIZFAiM5BkOV4I/obB4EMP5AZibbOLzvfe8rvNbS0iITExNV/dHnwR+F2eCJJ56Qv/iLv5Bf+qVfkgsuuECCwaDu7zOf+YzMBQR5xedbX18vK1eulKuvvlo+9KEPyde//nVJpVIV9+Pf/ld/9Vcrvnd8fFwaGxsL7//1X//1ku87cuSI/I//8T/kiiuukObmZolGo7Jq1Sq5/PLLdU7/4R/+Qfr6+mQ+UWo+Sj14XykQ2DG2rVu36jzW1dXJ+vXr5dprr5X/9t/+W8m5LL4XH3jggYpj/PKXvzzl/Y888siMzjGXy+k2v//7vy833XSTLF++XAP9trY2eelLXyp//ud/LvF4XJYy3va2t8lf/uVfSm9vr1x22WVyww03yKWXXjrtdnwG+cxZYFcajz/+uPzzP/+zvOlNb5LrrrtOlioymYzeP9dcc400NTXp9xP//qu/+ivJZrOz2iekerrvnbvuuqvkttV8Z01OTlY8/v79+5X479ixQxoaGvSc+FvD37vHHnvstPffeuutkkgk5Hd/93dndb6G8w/hcz0Ag2GxgKCkFB588EH9uW3bNg3eirFixYopv9fU1GgwXwnF2/gxOjoq3/jGN5R8lMLY2JgGtnMFxOGZZ56RMwX/fBGEs9L97LPPys9+9jP5+7//eyU9/AF/+9vfPu2+vva1r8n/+T//R4P8UvjqV78qsVis4j6+/e1vyy/+4i8qgeQP9Jo1a/QPLsH3nj17dGy33XabksZycz8XlLt/HCBbxYA0fO5zn1MSAWlYt26ddHR0yODgoJJUAoUvfelL0tXVJWvXri27b0gPZKMSyZkLfvrTn8qrXvWqwu8Qtg0bNui4+Pzw+Lu/+zv50Y9+VHGcixU7d+6Ue++9VwnoCy+8cBphZeGCQA9yWorkQL4JSMsR3fMZH/3oR/X+/8M//ENZqkgmk0ri+HwA7hUWYPiu5HP+H//xH/rgO2C+v3tYiKgE/pbxN60UWBgrh3/8x3+U3/zN31TS4sgN53ns2DH5yle+oos2L37xi6dswwLJy1/+cv2+Z3FnKZNawzwhZzAY5gQ+Rjy+/OUvV3wfr/O+DRs2zPgYv/zLv6zbXnjhhfrzFa94Rdn3/sM//MOU99bU1ORmg5/7uZ/Lvfvd78796Z/+ae6nP/1p7qabbtL9ffrTn87NBZx/ufmKx+O5733ve7mXvexlhXn98z//85L7ca+787ztttvKHvOlL33plPf+2q/92pTXjx49mquvr9fX3vCGN+R27do15fVUKqVz8IEPfCB355135uYTleajEhiHm4Pf/d3fzfX19U15fXR0NPfVr34197rXvS53/Pjxkvfi2rVrc42NjbmmpqZcLBYreZx9+/ZNmTseDz/88IzG+qMf/UjP8wtf+ELuxIkTU177zne+k2tra9P9Xn/99bmliDvuuEPP75prrpn1/XH33XefkbEtZjz66KM6N1deeWVuKYPPN+fZ2tqau/feewvPP/fcc7l169bpa5/4xCdmvF/3nT7T7x7gvgsOHTo04235XgoEArn29vbcv//7v+fS6fSU15955pnck08+WXLbf/3Xf9Xjvutd75rxcQ3nHyxdzWBYRHjZy14mmzdvlrvvvltXdyutur///e+f07FQi/793/9d07dYPZtt2ttMwDHe8IY3yD333CO/8Ru/oc/99m//tq5WlgOpWpXUhn379mk61kUXXXTayqDDv/3bv6mCs2zZMlXBWFX0IxwO6xyQrlaNsnQ2wFgA8/Unf/InquD4QUrLO9/5TvnBD34gq1evLrkPUkTe8Y53VFT/XOrkXO4n5p3r8LGPfew0RerNb36zqk3goYceUsVsqcGl4pVTGg2zw1//9V9P+Q5Yiujv71eVGvzpn/6p1qU4XHLJJYXvAVI+UXAXOlDs+W5H5fn+978v7373uyUUCk15D+mcL3rRi0pu/3M/93Oq/Hzzm9+U7u7uszRqw2KFkRyDYRGBNCr+oLOQhqRfDAJJUn8uvvhizdderOAPIH/YIRvkm3/2s58t+17+6EP8IEal6haqCdIPHDigP0mRIKVoMcCNuZq6jkpw81JcBwaYe1L0CEL+y3/5L7M+BkFJpVQaiJrDrl27qt4v14vPxO233172PSMjI3pNeZ+fLJMm88UvflFrl0gXY3yk7BBgkUbz5JNPylzhap+oMQCkrPlrFrhnyxkPuOfcYgYk27+t26e/toJtCHQ/8pGPaGobaUSkXX7wgx+cNiBkUYOUKFJlSYXiJ3VE9913X9lt7rzzTnnd6143pcZq+/bt8gu/8AsahBaDOb3llls0VZGxUSfHONnHF77wBf1eqxakNpG2BN761rdO+37SUSHb3IssWjBmxkkq6kLGd77zHa1tYUGi1GfwNa95jX7/QaR570IHi1EDAwO6uMFnb6bgs8w5k+J8xx13nJExGpYOjOQYDIsMzlSAgKY4KJgvFWchgEDkv/7X/6r//s///E8NSkvBBXyliJ8L0tlXpSCdwMcF2PwBnimcMcTZdMdyY4bUziQ4LKUObtmypSRJ/PGPf6x1M69//eu1RupMwW86QDBXLdw15RqXA/VYBIkoea4WjiLu1772taoSUrfU3t6uK8fMKQsFKAQE/XMFRIFaPmoeAPvnd/eAXE23rat3YNXevy1kohjUM2CaQS0bx4JAQG5Y7Wcb6vmKwecKNe/nf/7n5Xvf+57eSxwrnU4rMeCe/rM/+7PTtvvkJz+p2/HZBJBDFMOenh5VgCEtflDA/pKXvEQJKUSM8fNAQWUfv/M7v6PXpVpw3diW+xLSVAmf+MQnlLBhUsC8YiTCXDDOq666Su6//35ZqEDdBBC0cmq6U3fce2cKVFwUkle84hWq/qIMHz9+vKptWYBikeLVr361vPe979Vam0rGON/61rf0J+8fGhpSdQqSyu+Q8e9+97vTHtPV4pDRYDBUxLnOlzMYFjvOZk2OqyV55StfeVqefiaTya1ZsyYXDodz3d3d+tpcanKK8drXvvaM1+QU4+mnny7M7yOPPDLlNX99yJEjR3LBYDC3adOmXDabLbznrrvu0ve8+c1vLjmPDvfcc09hf5dddlnun//5n3M9PT1Vn5PbLznuZ6sm5zOf+UxhzG95y1v0XMfHx6va1t2LO3bs0N9vvfXWktf2Pe95jz7v6pBmW5MzHf7X//pfut9IJJIbGBioersDBw5obn8oFDqt1sfB1Xd9/vOfLzz3zW9+U5/j80L+f3H91fe///3cD37wg9x8wc13ufuj0uvV1OS42grm7zWvec2UuaC2YcWKFfr6pz71qdO2/Y3f+A197eKLL8498MADU177l3/5F61VY475jDhQ/8Wc811DfYX/MweeeOIJrQ304/LLLy/Uj01MTEx5jc/vn/z
"text/plain": [
"<Figure size 840x700 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def run_dms_benchmark(csv_path):\n",
" \"\"\"Compare ESM-2 predictions with experimental DMS data.\"\"\"\n",
" print(f\"Loading DMS mutations from {csv_path}...\")\n",
" df = pd.read_csv(csv_path)\n",
" \n",
" preds = []\n",
" for mut in tqdm(df[\"mutant\"], desc=\"Scoring DMS mutations\"):\n",
" wt = mut[0]\n",
" u_pos = int(mut[1:-1])\n",
" mt = mut[-1]\n",
" llr = score_mutation_llr(u_pos, wt, mt)\n",
" preds.append(llr)\n",
" df[\"esm2_llr\"] = preds\n",
" \n",
" exp_cols = [c for c in [\"2500\", \"625\", \"156\", \"39\", \"0\"] if c in df.columns]\n",
" if exp_cols:\n",
" print(\"\\nSpearman |ρ| vs. experimental fitness:\")\n",
" for col in exp_cols:\n",
" mask = ~(df[col].isna() | df[\"esm2_llr\"].isna())\n",
" if mask.sum() > 0:\n",
" rho = spearmanr(df.loc[mask, \"esm2_llr\"], df.loc[mask, col])[0]\n",
" print(f\" {col:5} μg/mL: |ρ| = {abs(rho):.3f}\")\n",
" \n",
" top_col = None\n",
" for candidate in [\"2500\", \"1250\", \"1000\", \"625\", \"156\", \"39\"]:\n",
" if candidate in exp_cols:\n",
" top_col = candidate\n",
" break\n",
" \n",
" if top_col:\n",
" mask = ~(df[top_col].isna() | df[\"esm2_llr\"].isna())\n",
" x = df.loc[mask, \"esm2_llr\"].values\n",
" y = df.loc[mask, top_col].values\n",
" \n",
" plt.figure(figsize=(6, 5), dpi=140)\n",
" plt.scatter(x, y, s=8, alpha=0.6)\n",
" plt.xlabel(\"ESM-2 masked-marginal LLR (mut wt)\")\n",
" plt.ylabel(f\"Experimental fitness @ {top_col} μg/mL\")\n",
" rho = spearmanr(x, y)[0]\n",
" plt.title(f\"TEM-1 DMS: ESM-2 vs. fitness (ρ = {rho:.3f})\")\n",
" plt.tight_layout()\n",
" plt.show()\n",
" else:\n",
" print(\"No experimental concentration columns found.\")\n",
" \n",
" return df\n",
"\n",
"dms_df = run_dms_benchmark(\n",
" \"data/BLAT_ECOLX_Ranganathan2015.csv\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "8c76a31f",
"metadata": {},
"source": [
"The scatter plot shows a moderate correlation (ρ = 0.556) between ESM-2 predictions and experimental fitness, but the relationship is noisy with substantial variation around the trend. In general, more negative ESM-2 scores tend to correspond to lower experimental fitness, yet many individual mutations deviate from this pattern. Some mutations with highly negative LLR scores still maintain reasonable fitness, while others with only modestly negative scores perform poorly. The increase in correlation strength with higher antibiotic concentration (from |ρ| = 0.164 to |ρ| = 0.556) suggests that ESM-2s evolutionary-based predictions become more relevant under strong selective pressure. However, even at the highest concentration, the considerable scatter indicates that predicting mutation effects remains challenging and that experimental fitness is influenced by factors beyond what ESM-2 can infer from sequence patterns alone. It's important to note that these results are based on the 35M parameter ESM-2 model, and it is likely that larger models would achieve stronger correlations and improved predictive performance."
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}