Newer
Older
{
"cells": [
{
"cell_type": "markdown",
"id": "bb55010f",
"metadata": {},
"source": [
"### WARIO: Weighted fAmilies of contact maps to chaRacterize conformational ensembles of (highly-)flexIble prOteins\n",
"This is the Jupyter notebook implementing the pipeline that defines WARIO. The notebook takes an ensemble as input, features its conformations using contact information and perform a clustering algorithm to retrieve the ensemble characterization. The user can follow the guidelines presented below to implement WARIO step-by-step. First, the required libraries must be imported together with the experimental parameters that calibrate the contact function."
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "09b972d2",
"metadata": {},
"outputs": [],
"source": [
"# Required imports\n",
"import ipynb\n",
"from ipynb.fs.full.contact_features import *\n",
"from ipynb.fs.full.utils import *\n",
"\n",
"import os\n",
"import numpy as np\n",
"import hdbscan\n",
"path_to_notebook = os.path.abspath(os.getcwd())\n",
"#print(path_to_notebook) # Visualize the notebook directory\n",
"\n",
"th_file = \"/\".join([path_to_notebook,'contact_thresholds_range.txt']) # Load experimentally determined contact parameters"
]
},
{
"cell_type": "markdown",
"id": "0d71c087",
"metadata": {},
"source": [
"#### Set the folder containing the ensemble information\n",
"\n",
"The input ensemble must be given as a directory containing trajectory and topology information. The folder must contain only:\n",
"\n",
"* One .xtc file together with a topology file in one of the formats admitted by [MDTraj](https://www.mdtraj.org/1.9.8.dev0/api/generated/mdtraj.load.html#mdtraj.load) or,\n",
"* One multiframe .pdb file or,\n",
"* If the ensemble is given as a list of .pdb files (one file per conformation), one folder containing one .pdb file per conformation."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "402b3b92",
"metadata": {},
"outputs": [],
"source": [
"ensemble_folder = '/directory_with_data/'\n",
"ensemble_name = 'my_ensemble' \n",
"\n",
"# To run an example with real data, uncomment:\n",
"#ensemble_folder = \"/\".join([path_to_notebook,'P113']) # Path to the folder containing trajectory files\n",
"#ensemble_name = \"P113\" # Name the ensemble"
]
},
{
"cell_type": "markdown",
"id": "b8029a85",
"metadata": {},
"source": [
"We ask you not to choose ```ensemble_folder``` as a folder where multiple or redundant ensembles are located, but to create a specific directory per ensemble. The analysis can be launched in an interactive mode, which checks with the user whether the input data has been correctly understood. This mode is activated by setting the parameter ```interactive``` to ```True``` in the function ```contact_features``` described below. If you chose to set ```interactive``` to ```False```, the function will automatically interpret the input and launch the computation without checking with the user.\n",
"\n",
"All the results produced along the pipeline will be saved in subdirectories of ```ensemble_folder```. If the computation for the same ensemble has to be repeated from the beggining, we ask you to empty ```ensemble_folder``` or to create a new directory containing only the ensemble input data."
]
},
{
"cell_type": "markdown",
"id": "c3593eaf",
"metadata": {},
"source": [
"### 1. Data featurization\n",
"#### Compute contact weights and embed data into a low-dimensional UMAP space\n",
"\n",
"The function ```contact_features``` takes the input ensemble data and returns the $n\\times L(L-1)/2$ matrix $W$ containing the contact information, where $n$ is the number of conformations and $L$ the sequence length. The output is automatically saved in ``ensemble_folder`` as a .txt file. The arguments of ```contact_features``` are:\n",
"\n",
"* ```ensemble_name```: The name of the ensemble, to appear in the results files and plots,\n",
"* ```ensemble_path```: The directory containing the ensemble information, as described above,\n",
"* ```N_cores```: The number of threads to use in parallel computation,\n",
"* ```interactive```: Whether to check with the user that data has been correctly understood,\n",
"* ```thresholds```: The file containing the parameters that calibrate the contact function.\n",
"\n",
"The function computes and saves $W$ and its embeddings into a 2-dimensional and 10-dimensional UMAP spaces."
]
},
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
{
"cell_type": "markdown",
"id": "c9a6f349",
"metadata": {},
"source": [
"If the entire sequence has to be analyzed, run the following line:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e4218615",
"metadata": {},
"outputs": [],
"source": [
"subseq = None"
]
},
{
"cell_type": "markdown",
"id": "fdb9bf41",
"metadata": {},
"source": [
"Otherwise, an array specifying the residue positions to be set can be chosen:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cc4a2cd0",
"metadata": {},
"outputs": [],
"source": [
"#subseq = np.concatenate([np.arange(100, 151, 1), np.arange(170, 181, 1)]) # Example"
]
},
{
"cell_type": "markdown",
"id": "8ba9a7d7",
"metadata": {},
"source": [
"Once the value of this last argument is chosen, the function can be run:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "70f5c839",
"metadata": {},
"outputs": [],
"source": [
"contact_features(ensemble_name = ensemble_name, ensemble_path = ensemble_folder, N_cores = 1, interactive = True, thresholds = th_file, subsequence = subseq)"
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
]
},
{
"cell_type": "markdown",
"id": "e98848e5",
"metadata": {},
"source": [
"### 2. Contact-based clustering\n",
"\n",
"Once data has been featured with contact information, and embedded into a low-dimensional UMAP space, clustering can be performed to get the ensemble characterization.\n",
"\n",
"#### Perform HDBSCAN clustering in the low-dimensional space"
]
},
{
"cell_type": "markdown",
"id": "75f335b4",
"metadata": {},
"source": [
"First, load the featured data frame and its embeddings into the low-dimensional UMAP spaces. These have been automatically saved by ```contact_features``` in the following directories."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "80d44e80",
"metadata": {},
"outputs": [],
"source": [
"# Directory containing results\n",
"results_path = \"/\".join([os.path.abspath(ensemble_folder),\"_\".join(['results',ensemble_name])])\n",
"\n",
"# Matrix W with contact information\n",
"wcont_data = pd.read_csv(\"/\".join([results_path, \"_\".join([ensemble_name,'wcontmatrix.txt'])]), sep = ' ', header = None)\n",
"\n",
"# Embedding of W into a 2-dimensional UMAP space for visualization\n",
"embedding_2d = np.load('/'.join([results_path, \"_\".join([ensemble_name,'embedding_2d_wcont.npy'])]))\n",
"\n",
"# Embedding of W into a 10-dimensional UMAP space for clustering\n",
"clusterable_embedding = np.load('/'.join([results_path, \"_\".join([ensemble_name,'clusterable_embedding_wcont.npy'])]))"
]
},
{
"cell_type": "markdown",
"id": "a53758cd",
"metadata": {},
"source": [
"Now, we can perform HDBSCAN clustering in the low-dimensional UMAP space.\n",
"\n",
"**Minimum cluster size**\n",
"\n",
"The clustering is calibrated through the minimum number of conformations that can define a cluster. This quantity will indirectly define the number of retrieved clusters. The user might choose an appropiate value according to the desired level of precision in the classification and the encountered results. The default minimum cluster size is set to 1% of the total number of conformations. Large sequences might require a smaller minimum cluster size."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "76c8d967",
"metadata": {},
"outputs": [],
"source": [
"# Choose minimum cluster size\n",
"min_cluster_size = int(wcont_data.shape[0]*0.01) # Default = 1% of individuals"
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c26a5745",
"metadata": {},
"outputs": [],
"source": [
"# Perform HDBSCAN clustering\n",
"labels_umap = hdbscan.HDBSCAN(\n",
" min_samples = 10,\n",
" min_cluster_size = min_cluster_size, \n",
").fit_predict(clusterable_embedding)\n",
"\n",
"classified = np.where(labels_umap >= 0)[0]\n",
"print(\"\".join([\"\\nThe clustering algorithm found \",str(len(np.unique(labels_umap[labels_umap >= 0]))),\" clusters and classified the \",str(np.round(100*len(classified)/len(labels_umap),2)),\"% of conformations. \\n\"]))"
]
},
{
"cell_type": "markdown",
"id": "53f0ce88",
"metadata": {},
"source": [
"#### Results visualization\n",
"\n",
"Clustering partition visualized on the 2-dimensional UMAP space. This illustrates the repartition of conformations among clusters and their corresponding occupancy. By looking at the number of connected components in the space, the minimum cluster size might be re-calibrated. Note that unclassified points appear in gray.\n"
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "616b989a",
"metadata": {},
"outputs": [],
"source": [
"plot_2umap(embedding_2d, labels_umap, ensemble_name, results_path)"
]
},
{
"cell_type": "markdown",
"id": "dc831da1",
"metadata": {},
"source": [
"### 3. Cluster-specific $\\omega$-contact maps\n",
"\n",
"Compute the ensemble characterization defined as a weighted family of cluster-specific $\\omega$-contact maps."
]
},
{
"cell_type": "markdown",
"id": "9e821614",
"metadata": {},
"source": [
"#### 3.1 Plot cluster-specific $\\omega$-contact maps"
]
},
{
"cell_type": "markdown",
"id": "396acb0c",
"metadata": {},
"source": [
"Run the cell below to create and save cluster-specific $\\omega$-contact maps. Plots are saved in a new subdirectory of ```ensemble_folder```."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "96f6dcf0",
"metadata": {},
"outputs": [],
"source": [
"get_wmaps(wcont_data, labels_umap, ensemble_name, results_path, subsequence = subseq)"
]
},
{
"cell_type": "markdown",
"id": "f9916b8d",
"metadata": {},
"source": [
"#### 3.2 Create cluster-specific files (for .xtc trajectories) or folders (for .pdb folders)\n",
"\n",
"The clustering partition can be used to build new cluster-specific files containing the cluster conformations. The files are automatically saved in a new directory inside of ```ensemble_folder```. Note that the files will be produced in the same format as the one of the input data: one .xtc file per cluster (if data is given as a .xtc file with a topology file, or as a multiframe .pdb file) or a folder per cluster containing its conformations (if data is given as a folder with one .pdb file per conformation)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b5d146a1",
"metadata": {},
"outputs": [],
"source": [
"get_cluster_files(ensemble_path = ensemble_folder, ensemble_name = ensemble_name, labels_umap = labels_umap)"
]
},
{
"cell_type": "markdown",
"id": "290cce9e",
"metadata": {},
"source": [
"#### 3.3 Sample a representative family of conformations\n",
"\n",
"The ensemble characterization can be used to sample a representative family of conformations of a given size. This is done by sampling conformations from clusters with probabilites given by the cluster occupancies. In other words, if $p_1,\\ldots,p_K$ are the (normalized) occupancies of clusters $\\mathcal{C}_1,\\ldots,\\mathcal{C}_K$ respectively, sample from the distribution\n",
"\n",
"$$\n",
" p_1\\mathcal{U}(\\mathcal{C}_1)+\\cdots + p_K\\mathcal{U}(\\mathcal{C}_K),\n",
"$$\n",
"\n",
"where $\\mathcal{U}(\\mathcal{S})$ denotes the discrete uniform distribution on the set $\\mathcal{S}\\subset\\lbrace 1,\\ldots,n\\rbrace$. This is performed by the function ```representative_ensemble``` below, which needs to be given the ```size``` of the representative family (in number of conformations) as an argument."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2eb7586a",
"metadata": {},
"outputs": [],
"source": [
"representative_ensemble(size = 10, ensemble_path = ensemble_folder, ensemble_name = ensemble_name, labels_umap = labels_umap)"
]
},
{
"cell_type": "markdown",
"id": "b89412b2",
"metadata": {},
"source": [
"#### 3.4 Secondary structure propensities and average radius of gyration\n",
"\n",
"The function below computes the average DSSP secondary structure propensities per cluster, together with the average radius of gyration across cluster conformations. One plot per cluster is produced, and they are automatically saved in a subdirectory of ```ensemble_folder```. The DSSP categories are defined [here](https://mdtraj.org/1.9.4/api/generated/mdtraj.compute_dssp.html)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "68305f88",
"metadata": {},
"outputs": [],
"source": [
"cluster_descriptors(ensemble_path = ensemble_folder, ensemble_name = ensemble_name, labels_umap = labels_umap, subsequence = subseq)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}