contact_clustering.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "bb55010f",
   "metadata": {},
   "source": [
    "### WARIO: Weighted fAmilies of contact maps to chaRacterize conformational ensembles of (highly-)flexIble prOteins\n",
    "\n",
    "This is the Jupyter notebook implementing the pipeline that defines WARIO. The notebook takes an ensemble as input, features its conformations using contact information and perform a clustering algorithm to retrieve the ensemble characterization. The user can follow the guidelines presented below to implement WARIO step-by-step. First, the required libraries must be imported together with the experimental parameters that calibrate the contact function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "09b972d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Required imports\n",
    "import ipynb\n",
    "from ipynb.fs.full.contact_features import *\n",
    "from ipynb.fs.full.utils import *\n",
    "\n",
    "import os\n",
    "import numpy as np\n",
    "import hdbscan\n",
    "path_to_notebook = os.path.abspath(os.getcwd())\n",
    "#print(path_to_notebook) # Visualize the notebook directory\n",
    "\n",
    "th_file = \"/\".join([path_to_notebook,'contact_thresholds_range.txt']) # Load experimentally determined contact parameters"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d71c087",
   "metadata": {},
   "source": [
    "#### Set the folder containing the ensemble information\n",
    "\n",
    "The input ensemble must be given as a directory containing trajectory and topology information. The folder must contain only:\n",
    "\n",
    "* One .xtc file together with a topology file in one of the formats admitted by [MDTraj](https://www.mdtraj.org/1.9.8.dev0/api/generated/mdtraj.load.html#mdtraj.load) or,\n",
    "* One multiframe .pdb file or,\n",
    "* If the ensemble is given as a list of .pdb files (one file per conformation), one folder containing one .pdb file per conformation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "402b3b92",
   "metadata": {},
   "outputs": [],
   "source": [
    "ensemble_folder = '/directory_with_data/'\n",
    "ensemble_name = 'my_ensemble' \n",
    "\n",
    "# To run an example with real data, uncomment:\n",
    "#ensemble_folder = \"/\".join([path_to_notebook,'P113']) # Path to the folder containing trajectory files\n",
    "#ensemble_name = \"P113\" # Name the ensemble"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8029a85",
   "metadata": {},
   "source": [
    "We ask you not to choose ```ensemble_folder``` as a folder where multiple or redundant ensembles are located, but to create a specific directory per ensemble. The analysis can be launched in an interactive mode, which checks with the user whether the input data has been correctly understood. This mode is activated by setting the parameter ```interactive``` to ```True``` in the function ```contact_features``` described below. If you chose to set ```interactive``` to ```False```, the function will automatically interpret the input and launch the computation without checking with the user.\n",
    "\n",
    "All the results produced along the pipeline will be saved in subdirectories of ```ensemble_folder```. If the computation for the same ensemble has to be repeated from the beggining, we ask you to empty ```ensemble_folder``` or to create a new directory containing only the ensemble input data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3593eaf",
   "metadata": {},
   "source": [
    "### 1. Data featurization\n",
    "#### Compute contact weights and embed data into a low-dimensional UMAP space\n",
    "\n",
    "The function ```contact_features``` takes the input ensemble data and returns the $n\\times L(L-1)/2$ matrix $W$ containing the contact information, where $n$ is the number of conformations and $L$ the sequence length. The output is automatically saved in ``ensemble_folder`` as a .txt file. The arguments of ```contact_features``` are:\n",
    "\n",
    "* ```ensemble_name```: The name of the ensemble, to appear in the results files and plots,\n",
    "* ```ensemble_path```: The directory containing the ensemble information, as described above,\n",
    "* ```N_cores```: The number of threads to use in parallel computation,\n",
    "* ```interactive```: Whether to check with the user that data has been correctly understood,\n",
    "* ```thresholds```: The file containing the parameters that calibrate the contact function.\n",
    "\n",
    "The function computes and saves $W$ and its embeddings into a 2-dimensional and 10-dimensional UMAP spaces."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9a6f349",
   "metadata": {},
   "source": [
    "If the entire sequence has to be analyzed, run the following line:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e4218615",
   "metadata": {},
   "outputs": [],
   "source": [
    "subseq = None"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fdb9bf41",
   "metadata": {},
   "source": [
    "Otherwise, an array specifying the residue positions to be set can be chosen:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cc4a2cd0",
   "metadata": {},
   "outputs": [],
   "source": [
    "#subseq = np.concatenate([np.arange(100, 151, 1), np.arange(170, 181, 1)]) # Example"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ba9a7d7",
   "metadata": {},
   "source": [
    "Once the value of this last argument is chosen, the function can be run:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70f5c839",
   "metadata": {},
   "outputs": [],
   "source": [
    "contact_features(ensemble_name = ensemble_name, ensemble_path = ensemble_folder, N_cores = 1, interactive = True, thresholds = th_file, subsequence = subseq)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e98848e5",
   "metadata": {},
   "source": [
    "### 2. Contact-based clustering\n",
    "\n",
    "Once data has been featured with contact information, and embedded into a low-dimensional UMAP space, clustering can be performed to get the ensemble characterization.\n",
    "\n",
    "#### Perform HDBSCAN clustering in the low-dimensional space"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75f335b4",
   "metadata": {},
   "source": [
    "First, load the featured data frame and its embeddings into the low-dimensional UMAP spaces. These have been automatically saved by ```contact_features``` in the following directories."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80d44e80",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Directory containing results\n",
    "results_path = \"/\".join([os.path.abspath(ensemble_folder),\"_\".join(['results',ensemble_name])])\n",
    "\n",
    "# Matrix W with contact information\n",
    "wcont_data = pd.read_csv(\"/\".join([results_path, \"_\".join([ensemble_name,'wcontmatrix.txt'])]), sep = ' ', header = None)\n",
    "\n",
    "# Embedding of W into a 2-dimensional UMAP space for visualization\n",
    "embedding_2d = np.load('/'.join([results_path, \"_\".join([ensemble_name,'embedding_2d_wcont.npy'])]))\n",
    "\n",
    "# Embedding of W into a 10-dimensional UMAP space for clustering\n",
    "clusterable_embedding = np.load('/'.join([results_path, \"_\".join([ensemble_name,'clusterable_embedding_wcont.npy'])]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a53758cd",
   "metadata": {},
   "source": [
    "Now, we can perform HDBSCAN clustering in the low-dimensional UMAP space.\n",
    "\n",
    "**Minimum cluster size**\n",
    "\n",
    "The clustering is calibrated through the minimum number of conformations that can define a cluster. This quantity will indirectly define the number of retrieved clusters. The user might choose an appropiate value according to the desired level of precision in the classification and the encountered results. The default minimum cluster size is set to 1% of the total number of conformations. Large sequences might require a smaller minimum cluster size."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "76c8d967",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose minimum cluster size\n",
    "min_cluster_size = int(wcont_data.shape[0]*0.01) # Default = 1% of individuals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c26a5745",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Perform HDBSCAN clustering\n",
    "labels_umap = hdbscan.HDBSCAN(\n",
    "    min_samples = 10,\n",
    "    min_cluster_size = min_cluster_size, \n",
    ").fit_predict(clusterable_embedding)\n",
    "\n",
    "classified = np.where(labels_umap >= 0)[0]\n",
    "print(\"\".join([\"\\nThe clustering algorithm found \",str(len(np.unique(labels_umap[labels_umap >= 0]))),\" clusters and classified the \",str(np.round(100*len(classified)/len(labels_umap),2)),\"% of conformations. \\n\"]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53f0ce88",
   "metadata": {},
   "source": [
    "#### Results visualization\n",
    "\n",
    "Clustering partition visualized on the 2-dimensional UMAP space. This illustrates the repartition of conformations among clusters and their corresponding occupancy. By looking at the number of connected components in the space, the minimum cluster size might be re-calibrated. Note that unclassified points appear in gray.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "616b989a",
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_2umap(embedding_2d, labels_umap, ensemble_name, results_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc831da1",
   "metadata": {},
   "source": [
    "### 3. Cluster-specific $\\omega$-contact maps\n",
    "\n",
    "Compute the ensemble characterization defined as a weighted family of cluster-specific $\\omega$-contact maps."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e821614",
   "metadata": {},
   "source": [
    "#### 3.1 Plot cluster-specific $\\omega$-contact maps"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "396acb0c",
   "metadata": {},
   "source": [
    "Run the cell below to create and save cluster-specific $\\omega$-contact maps. Plots are saved in a new subdirectory of ```ensemble_folder```."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "96f6dcf0",
   "metadata": {},
   "outputs": [],
   "source": [
    "get_wmaps(wcont_data, labels_umap, ensemble_name, results_path, subsequence = subseq)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9916b8d",
   "metadata": {},
   "source": [
    "#### 3.2 Create cluster-specific files (for .xtc trajectories) or folders (for .pdb folders)\n",
    "\n",
    "The clustering partition can be used to build new cluster-specific files containing the cluster conformations. The files are automatically saved in a new directory inside of ```ensemble_folder```. Note that the files will be produced in the same format as the one of the input data: one .xtc file per cluster (if data is given as a .xtc file with a topology file, or as a multiframe .pdb file) or a folder per cluster containing its conformations (if data is given as a folder with one .pdb file per conformation)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5d146a1",
   "metadata": {},
   "outputs": [],
   "source": [
    "get_cluster_files(ensemble_path = ensemble_folder, ensemble_name = ensemble_name, labels_umap = labels_umap)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "290cce9e",
   "metadata": {},
   "source": [
    "#### 3.3 Sample a representative family of conformations\n",
    "\n",
    "The ensemble characterization can be used to sample a representative family of conformations of a given size. This is done by sampling conformations from clusters with probabilites given by the cluster occupancies. In other words, if $p_1,\\ldots,p_K$ are the (normalized) occupancies of clusters $\\mathcal{C}_1,\\ldots,\\mathcal{C}_K$ respectively, sample from the distribution\n",
    "\n",
    "$$\n",
    " p_1\\mathcal{U}(\\mathcal{C}_1)+\\cdots + p_K\\mathcal{U}(\\mathcal{C}_K),\n",
    "$$\n",
    "\n",
    "where $\\mathcal{U}(\\mathcal{S})$ denotes the discrete uniform distribution on the set $\\mathcal{S}\\subset\\lbrace 1,\\ldots,n\\rbrace$. This is performed by the function ```representative_ensemble``` below, which needs to be given the ```size``` of the representative family (in number of conformations) as an argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2eb7586a",
   "metadata": {},
   "outputs": [],
   "source": [
    "representative_ensemble(size = 10, ensemble_path = ensemble_folder, ensemble_name = ensemble_name, labels_umap = labels_umap)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b89412b2",
   "metadata": {},
   "source": [
    "#### 3.4 Secondary structure propensities and average radius of gyration\n",
    "\n",
    "The function below computes the average DSSP secondary structure propensities per cluster, together with the average radius of gyration across cluster conformations. One plot per cluster is produced, and they are automatically saved in a subdirectory of ```ensemble_folder```. The DSSP categories are defined [here](https://mdtraj.org/1.9.4/api/generated/mdtraj.compute_dssp.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "68305f88",
   "metadata": {},
   "outputs": [],
   "source": [
    "cluster_descriptors(ensemble_path = ensemble_folder, ensemble_name = ensemble_name, labels_umap = labels_umap, subsequence = subseq)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}