From f80a553abfef8975d9f1913c98a39c8b788a5604 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Javier=20Gonz=C3=A1lez-Delgado?= <jgonzalezd@laas.fr>
Date: Wed, 15 May 2024 17:30:42 +0000
Subject: [PATCH] Save contact data as .h5 files + add supplementary plotting
 parameters

---
 wario/contact_clustering.ipynb | 97 ++++++++++++++++++++++++++--------
 1 file changed, 74 insertions(+), 23 deletions(-)

diff --git a/wario/contact_clustering.ipynb b/wario/contact_clustering.ipynb
index adf7456..6202a76 100644
--- a/wario/contact_clustering.ipynb
+++ b/wario/contact_clustering.ipynb
@@ -160,7 +160,7 @@
    "id": "75f335b4",
    "metadata": {},
    "source": [
-    "First, load the featured data frame and its embeddings into the low-dimensional UMAP spaces. These have been automatically saved by ```contact_features``` in the following directories."
+    "First, load the data embeddings into the low-dimensional UMAP spaces. These have been automatically saved by ```contact_features``` in the following directories."
    ]
   },
   {
@@ -173,9 +173,6 @@
     "# Directory containing results\n",
     "results_path = \"/\".join([os.path.abspath(ensemble_folder),\"_\".join(['results',ensemble_name])])\n",
     "\n",
-    "# Matrix W with contact information\n",
-    "wcont_data = pd.read_csv(\"/\".join([results_path, \"_\".join([ensemble_name,'wcontmatrix.txt'])]), sep = ' ', header = None)\n",
-    "\n",
     "# Embedding of W into a 2-dimensional UMAP space for visualization\n",
     "embedding_2d = np.load('/'.join([results_path, \"_\".join([ensemble_name,'embedding_2d_wcont.npy'])]))\n",
     "\n",
@@ -230,7 +227,9 @@
    "source": [
     "#### Results visualization\n",
     "\n",
-    "Clustering partition visualized on the 2-dimensional UMAP space. This illustrates the repartition of conformations among clusters and their corresponding occupancy. By looking at the number of connected components in the space, the minimum cluster size might be re-calibrated. Note that unclassified points appear in gray.\n"
+    "Clustering partition visualized on the 2-dimensional UMAP space. This illustrates the repartition of conformations among clusters and their corresponding occupancy. By looking at the number of connected components in the space, the minimum cluster size might be re-calibrated. Note that unclassified points appear in gray.\n",
+    "\n",
+    "The figure produced by the function can be formatted by customizing the plotting parameters below."
    ]
   },
   {
@@ -243,6 +242,17 @@
     "plot_2umap(embedding_2d, labels_umap, ensemble_name, results_path)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "9e5b8993",
+   "metadata": {},
+   "source": [
+    "_Supplementary plotting parameters_\n",
+    "\n",
+    "* ```pdf```: Whether to save the figure in .pdf format. If ```False```, figure is saved as .png. Default is ```False```.\n",
+    "* ```dpi_png```: Resolution of the .png file. Ignored if ```pdf = True```. Default is ```dpi_png = 200```."
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "dc831da1",
@@ -266,7 +276,9 @@
    "id": "396acb0c",
    "metadata": {},
    "source": [
-    "Run the cell below to create and save cluster-specific $\\omega$-contact maps. Plots are saved in a new subdirectory of ```ensemble_folder```."
+    "Run the cell below to create and save cluster-specific $\\omega$-contact maps. Plots are saved in a new subdirectory of ```ensemble_folder```.\n",
+    "\n",
+    "The figures produced by the function can be formatted by customizing the plotting parameters below. This can be useful to adapt the representation to the sequence length."
    ]
   },
   {
@@ -276,7 +288,27 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "get_wmaps(wcont_data, labels_umap, ensemble_name, results_path, subsequence = subseq)"
+    "get_wmaps(labels_umap, ensemble_name, results_path, subsequence = subseq)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "972c177f",
+   "metadata": {},
+   "source": [
+    "_Supplementary plotting parameters_\n",
+    "\n",
+    "* ```pdf```: Whether to save the figure in .pdf format. If ```False```, figure is saved as .png. Default is ```False```,\n",
+    "* ```dpi_png```: Resolution of the .png file. Ignored if ```pdf = True```. Default is ```dpi_png = 500```,\n",
+    "* ```fontsize_title```: Title font size. Default is ```fontsize_title = 10```,\n",
+    "* ```fontsize_axis```:  Axes font size. Default is ```fontsize_axis = 5```,\n",
+    "* ```fontsize_suptitle```: Subtitle font size. Default is ```fontsize_suptitle = 12```,\n",
+    "* ```shrink_cbar```:  Shrink parameter of [matplotlib's colorbar](https://matplotlib.org/stable/api/figure_api.html#matplotlib.figure.Figure.colorbar). Default is ```shrink_cbar = .5```,\n",
+    "* ```xticks_angle```: xaxis ticks rotation angle. Default is ```xticks_angle = 90```.\n",
+    "\n",
+    "_Add lines to help identify contacts_\n",
+    "\n",
+    "Straight lines can be superimposed to the plot to help a preliminary visual inspection of a particular region. This can be done by adding the parameter ```marks```: an array of sequence positions where vertical and horizontal lines should be drawn (e.g. ```marks = [10, 35, 90, 100]```). Default is ```None```."
    ]
   },
   {
@@ -301,48 +333,67 @@
   },
   {
    "cell_type": "markdown",
-   "id": "290cce9e",
+   "id": "b89412b2",
    "metadata": {},
    "source": [
-    "#### 3.3 Sample a representative family of conformations\n",
-    "\n",
-    "The ensemble characterization can be used to sample a representative family of conformations of a given size. This is done by sampling conformations from clusters with probabilites given by the cluster occupancies. In other words, if $p_1,\\ldots,p_K$ are the (normalized) occupancies of clusters $\\mathcal{C}_1,\\ldots,\\mathcal{C}_K$ respectively, sample from the distribution\n",
+    "#### 3.3 Secondary structure propensities and average radius of gyration\n",
     "\n",
-    "$$\n",
-    " p_1\\mathcal{U}(\\mathcal{C}_1)+\\cdots + p_K\\mathcal{U}(\\mathcal{C}_K),\n",
-    "$$\n",
+    "The function below computes the average DSSP secondary structure propensities per cluster, together with the average radius of gyration across cluster conformations. One plot per cluster is produced, and they are automatically saved in a subdirectory of ```ensemble_folder```. The DSSP categories are defined [here](https://mdtraj.org/1.9.4/api/generated/mdtraj.compute_dssp.html).\n",
     "\n",
-    "where $\\mathcal{U}(\\mathcal{S})$ denotes the discrete uniform distribution on the set $\\mathcal{S}\\subset\\lbrace 1,\\ldots,n\\rbrace$. This is performed by the function ```representative_ensemble``` below, which needs to be given the ```size``` of the representative family (in number of conformations) as an argument."
+    "The figures produced by the function can be formatted by customizing the plotting parameters below. This can be useful to adapt the representation to the sequence length."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "2eb7586a",
+   "id": "68305f88",
    "metadata": {},
    "outputs": [],
    "source": [
-    "representative_ensemble(size = 10, ensemble_path = ensemble_folder, ensemble_name = ensemble_name, labels_umap = labels_umap)"
+    "cluster_descriptors(ensemble_path = ensemble_folder, ensemble_name = ensemble_name, labels_umap = labels_umap, subsequence = subseq)"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "b89412b2",
+   "id": "8a5bb9cb",
    "metadata": {},
    "source": [
-    "#### 3.4 Secondary structure propensities and average radius of gyration\n",
+    "_Supplementary plotting parameters_\n",
     "\n",
-    "The function below computes the average DSSP secondary structure propensities per cluster, together with the average radius of gyration across cluster conformations. One plot per cluster is produced, and they are automatically saved in a subdirectory of ```ensemble_folder```. The DSSP categories are defined [here](https://mdtraj.org/1.9.4/api/generated/mdtraj.compute_dssp.html)."
+    "* ```pdf```: Whether to save the figure in .pdf format. If ```False```, figure is saved as .png. Default is ```False```,\n",
+    "* ```dpi_png```: Resolution of the .png file. Ignored if ```pdf = True```. Default is ```dpi_png = 200```,\n",
+    "* ```fig_width```: Figure width. Default is ```fig_width = 10```,\n",
+    "* ```fig_height```: Figure height. Default is ```fig_width = 1.7```.\n",
+    "* ```fontsize_title```: Title font size. Default is ```fontsize_title = 8```,\n",
+    "* ```fontsize_axis```:  Axes font size. Default is ```fontsize_axis = 7```,\n",
+    "* ```shrink_cbar```:  Shrink parameter of [matplotlib's colorbar](https://matplotlib.org/stable/api/figure_api.html#matplotlib.figure.Figure.colorbar). Default is ```shrink_cbar = .7```,\n",
+    "* ```yticks_angle```: yaxis ticks rotation angle. Default is ```yticks_angle = 0```."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "290cce9e",
+   "metadata": {},
+   "source": [
+    "#### 3.4 Sample a representative family of conformations\n",
+    "\n",
+    "The ensemble characterization can be used to sample a representative family of conformations of a given size. This is done by sampling conformations from clusters with probabilites given by the cluster occupancies. In other words, if $p_1,\\ldots,p_K$ are the (normalized) occupancies of clusters $\\mathcal{C}_1,\\ldots,\\mathcal{C}_K$ respectively, sample from the distribution\n",
+    "\n",
+    "$$\n",
+    " p_1\\mathcal{U}(\\mathcal{C}_1)+\\cdots + p_K\\mathcal{U}(\\mathcal{C}_K),\n",
+    "$$\n",
+    "\n",
+    "where $\\mathcal{U}(\\mathcal{S})$ denotes the discrete uniform distribution on the set $\\mathcal{S}\\subset\\lbrace 1,\\ldots,n\\rbrace$. This is performed by the function ```representative_ensemble``` below, which needs to be given the ```size``` of the representative family (in number of conformations) as an argument."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "68305f88",
+   "id": "2eb7586a",
    "metadata": {},
    "outputs": [],
    "source": [
-    "cluster_descriptors(ensemble_path = ensemble_folder, ensemble_name = ensemble_name, labels_umap = labels_umap, subsequence = subseq)"
+    "representative_ensemble(size = 10, ensemble_path = ensemble_folder, ensemble_name = ensemble_name, labels_umap = labels_umap)"
    ]
   }
  ],
@@ -362,7 +413,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.9"
+   "version": "3.10.12"
   }
  },
  "nbformat": 4,
-- 
GitLab