Skip to content
Snippets Groups Projects
Commit f80a553a authored by Javier González-Delgado's avatar Javier González-Delgado
Browse files

Save contact data as .h5 files + add supplementary plotting parameters

parent dba809e9
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:bb55010f tags:
### WARIO: Weighted fAmilies of contact maps to chaRacterize conformational ensembles of (highly-)flexIble prOteins
This is the Jupyter notebook implementing the pipeline that defines WARIO. The notebook takes an ensemble as input, features its conformations using contact information and perform a clustering algorithm to retrieve the ensemble characterization. The user can follow the guidelines presented below to implement WARIO step-by-step. First, the required libraries must be imported together with the experimental parameters that calibrate the contact function.
%% Cell type:code id:09b972d2 tags:
``` python
# Required imports
import ipynb
from ipynb.fs.full.contact_features import *
from ipynb.fs.full.utils import *
import os
import numpy as np
import hdbscan
path_to_notebook = os.path.abspath(os.getcwd())
#print(path_to_notebook) # Visualize the notebook directory
th_file = "/".join([path_to_notebook,'contact_thresholds_range.txt']) # Load experimentally determined contact parameters
```
%% Cell type:markdown id:0d71c087 tags:
#### Set the folder containing the ensemble information
The input ensemble must be given as a directory containing trajectory and topology information. The folder must contain only:
* One .xtc file together with a topology file in one of the formats admitted by [MDTraj](https://www.mdtraj.org/1.9.8.dev0/api/generated/mdtraj.load.html#mdtraj.load) or,
* One multiframe .pdb file or,
* If the ensemble is given as a list of .pdb files (one file per conformation), one folder containing one .pdb file per conformation.
%% Cell type:code id:402b3b92 tags:
``` python
ensemble_folder = '/directory_with_data/'
ensemble_name = 'my_ensemble'
# To run an example with real data, uncomment:
#ensemble_folder = "/".join([path_to_notebook,'P113']) # Path to the folder containing trajectory files
#ensemble_name = "P113" # Name the ensemble
```
%% Cell type:markdown id:b8029a85 tags:
We ask you not to choose ```ensemble_folder``` as a folder where multiple or redundant ensembles are located, but to create a specific directory per ensemble. The analysis can be launched in an interactive mode, which checks with the user whether the input data has been correctly understood. This mode is activated by setting the parameter ```interactive``` to ```True``` in the function ```contact_features``` described below. If you chose to set ```interactive``` to ```False```, the function will automatically interpret the input and launch the computation without checking with the user.
All the results produced along the pipeline will be saved in subdirectories of ```ensemble_folder```. If the computation for the same ensemble has to be repeated from the beggining, we ask you to empty ```ensemble_folder``` or to create a new directory containing only the ensemble input data.
%% Cell type:markdown id:c3593eaf tags:
### 1. Data featurization
#### Compute contact weights and embed data into a low-dimensional UMAP space
The function ```contact_features``` takes the input ensemble data and returns the $n\times L(L-1)/2$ matrix $W$ containing the contact information, where $n$ is the number of conformations and $L$ the sequence length. The output is automatically saved in ``ensemble_folder`` as a .txt file. The arguments of ```contact_features``` are:
* ```ensemble_name```: The name of the ensemble, to appear in the results files and plots,
* ```ensemble_path```: The directory containing the ensemble information, as described above,
* ```N_cores```: The number of threads to use in parallel computation,
* ```interactive```: Whether to check with the user that data has been correctly understood,
* ```thresholds```: The file containing the parameters that calibrate the contact function.
The function computes and saves $W$ and its embeddings into a 2-dimensional and 10-dimensional UMAP spaces.
%% Cell type:markdown id:c9a6f349 tags:
If the entire sequence has to be analyzed, run the following line:
%% Cell type:code id:e4218615 tags:
``` python
subseq = None
```
%% Cell type:markdown id:fdb9bf41 tags:
Otherwise, an array specifying the residue positions to be set can be chosen:
%% Cell type:code id:cc4a2cd0 tags:
``` python
#subseq = np.concatenate([np.arange(100, 151, 1), np.arange(170, 181, 1)]) # Example
```
%% Cell type:markdown id:8ba9a7d7 tags:
Once the value of this last argument is chosen, the function can be run:
%% Cell type:code id:70f5c839 tags:
``` python
contact_features(ensemble_name = ensemble_name, ensemble_path = ensemble_folder, N_cores = 1, interactive = True, thresholds = th_file, subsequence = subseq)
```
%% Cell type:markdown id:e98848e5 tags:
### 2. Contact-based clustering
Once data has been featured with contact information, and embedded into a low-dimensional UMAP space, clustering can be performed to get the ensemble characterization.
#### Perform HDBSCAN clustering in the low-dimensional space
%% Cell type:markdown id:75f335b4 tags:
First, load the featured data frame and its embeddings into the low-dimensional UMAP spaces. These have been automatically saved by ```contact_features``` in the following directories.
First, load the data embeddings into the low-dimensional UMAP spaces. These have been automatically saved by ```contact_features``` in the following directories.
%% Cell type:code id:80d44e80 tags:
``` python
# Directory containing results
results_path = "/".join([os.path.abspath(ensemble_folder),"_".join(['results',ensemble_name])])
# Matrix W with contact information
wcont_data = pd.read_csv("/".join([results_path, "_".join([ensemble_name,'wcontmatrix.txt'])]), sep = ' ', header = None)
# Embedding of W into a 2-dimensional UMAP space for visualization
embedding_2d = np.load('/'.join([results_path, "_".join([ensemble_name,'embedding_2d_wcont.npy'])]))
# Embedding of W into a 10-dimensional UMAP space for clustering
clusterable_embedding = np.load('/'.join([results_path, "_".join([ensemble_name,'clusterable_embedding_wcont.npy'])]))
```
%% Cell type:markdown id:a53758cd tags:
Now, we can perform HDBSCAN clustering in the low-dimensional UMAP space.
**Minimum cluster size**
The clustering is calibrated through the minimum number of conformations that can define a cluster. This quantity will indirectly define the number of retrieved clusters. The user might choose an appropiate value according to the desired level of precision in the classification and the encountered results. The default minimum cluster size is set to 1% of the total number of conformations. Large sequences might require a smaller minimum cluster size.
%% Cell type:code id:76c8d967 tags:
``` python
# Choose minimum cluster size
min_cluster_size = int(wcont_data.shape[0]*0.01) # Default = 1% of individuals
```
%% Cell type:code id:c26a5745 tags:
``` python
# Perform HDBSCAN clustering
labels_umap = hdbscan.HDBSCAN(
min_samples = 10,
min_cluster_size = min_cluster_size,
).fit_predict(clusterable_embedding)
classified = np.where(labels_umap >= 0)[0]
print("".join(["\nThe clustering algorithm found ",str(len(np.unique(labels_umap[labels_umap >= 0])))," clusters and classified the ",str(np.round(100*len(classified)/len(labels_umap),2)),"% of conformations. \n"]))
```
%% Cell type:markdown id:53f0ce88 tags:
#### Results visualization
Clustering partition visualized on the 2-dimensional UMAP space. This illustrates the repartition of conformations among clusters and their corresponding occupancy. By looking at the number of connected components in the space, the minimum cluster size might be re-calibrated. Note that unclassified points appear in gray.
The figure produced by the function can be formatted by customizing the plotting parameters below.
%% Cell type:code id:616b989a tags:
``` python
plot_2umap(embedding_2d, labels_umap, ensemble_name, results_path)
```
%% Cell type:markdown id:9e5b8993 tags:
_Supplementary plotting parameters_
* ```pdf```: Whether to save the figure in .pdf format. If ```False```, figure is saved as .png. Default is ```False```.
* ```dpi_png```: Resolution of the .png file. Ignored if ```pdf = True```. Default is ```dpi_png = 200```.
%% Cell type:markdown id:dc831da1 tags:
### 3. Cluster-specific $\omega$-contact maps
Compute the ensemble characterization defined as a weighted family of cluster-specific $\omega$-contact maps.
%% Cell type:markdown id:9e821614 tags:
#### 3.1 Plot cluster-specific $\omega$-contact maps
%% Cell type:markdown id:396acb0c tags:
Run the cell below to create and save cluster-specific $\omega$-contact maps. Plots are saved in a new subdirectory of ```ensemble_folder```.
The figures produced by the function can be formatted by customizing the plotting parameters below. This can be useful to adapt the representation to the sequence length.
%% Cell type:code id:96f6dcf0 tags:
``` python
get_wmaps(wcont_data, labels_umap, ensemble_name, results_path, subsequence = subseq)
get_wmaps(labels_umap, ensemble_name, results_path, subsequence = subseq)
```
%% Cell type:markdown id:972c177f tags:
_Supplementary plotting parameters_
* ```pdf```: Whether to save the figure in .pdf format. If ```False```, figure is saved as .png. Default is ```False```,
* ```dpi_png```: Resolution of the .png file. Ignored if ```pdf = True```. Default is ```dpi_png = 500```,
* ```fontsize_title```: Title font size. Default is ```fontsize_title = 10```,
* ```fontsize_axis```: Axes font size. Default is ```fontsize_axis = 5```,
* ```fontsize_suptitle```: Subtitle font size. Default is ```fontsize_suptitle = 12```,
* ```shrink_cbar```: Shrink parameter of [matplotlib's colorbar](https://matplotlib.org/stable/api/figure_api.html#matplotlib.figure.Figure.colorbar). Default is ```shrink_cbar = .5```,
* ```xticks_angle```: xaxis ticks rotation angle. Default is ```xticks_angle = 90```.
_Add lines to help identify contacts_
Straight lines can be superimposed to the plot to help a preliminary visual inspection of a particular region. This can be done by adding the parameter ```marks```: an array of sequence positions where vertical and horizontal lines should be drawn (e.g. ```marks = [10, 35, 90, 100]```). Default is ```None```.
%% Cell type:markdown id:f9916b8d tags:
#### 3.2 Create cluster-specific files (for .xtc trajectories) or folders (for .pdb folders)
The clustering partition can be used to build new cluster-specific files containing the cluster conformations. The files are automatically saved in a new directory inside of ```ensemble_folder```. Note that the files will be produced in the same format as the one of the input data: one .xtc file per cluster (if data is given as a .xtc file with a topology file, or as a multiframe .pdb file) or a folder per cluster containing its conformations (if data is given as a folder with one .pdb file per conformation).
%% Cell type:code id:b5d146a1 tags:
``` python
get_cluster_files(ensemble_path = ensemble_folder, ensemble_name = ensemble_name, labels_umap = labels_umap)
```
%% Cell type:markdown id:b89412b2 tags:
#### 3.3 Secondary structure propensities and average radius of gyration
The function below computes the average DSSP secondary structure propensities per cluster, together with the average radius of gyration across cluster conformations. One plot per cluster is produced, and they are automatically saved in a subdirectory of ```ensemble_folder```. The DSSP categories are defined [here](https://mdtraj.org/1.9.4/api/generated/mdtraj.compute_dssp.html).
The figures produced by the function can be formatted by customizing the plotting parameters below. This can be useful to adapt the representation to the sequence length.
%% Cell type:code id:68305f88 tags:
``` python
cluster_descriptors(ensemble_path = ensemble_folder, ensemble_name = ensemble_name, labels_umap = labels_umap, subsequence = subseq)
```
%% Cell type:markdown id:8a5bb9cb tags:
_Supplementary plotting parameters_
* ```pdf```: Whether to save the figure in .pdf format. If ```False```, figure is saved as .png. Default is ```False```,
* ```dpi_png```: Resolution of the .png file. Ignored if ```pdf = True```. Default is ```dpi_png = 200```,
* ```fig_width```: Figure width. Default is ```fig_width = 10```,
* ```fig_height```: Figure height. Default is ```fig_width = 1.7```.
* ```fontsize_title```: Title font size. Default is ```fontsize_title = 8```,
* ```fontsize_axis```: Axes font size. Default is ```fontsize_axis = 7```,
* ```shrink_cbar```: Shrink parameter of [matplotlib's colorbar](https://matplotlib.org/stable/api/figure_api.html#matplotlib.figure.Figure.colorbar). Default is ```shrink_cbar = .7```,
* ```yticks_angle```: yaxis ticks rotation angle. Default is ```yticks_angle = 0```.
%% Cell type:markdown id:290cce9e tags:
#### 3.3 Sample a representative family of conformations
#### 3.4 Sample a representative family of conformations
The ensemble characterization can be used to sample a representative family of conformations of a given size. This is done by sampling conformations from clusters with probabilites given by the cluster occupancies. In other words, if $p_1,\ldots,p_K$ are the (normalized) occupancies of clusters $\mathcal{C}_1,\ldots,\mathcal{C}_K$ respectively, sample from the distribution
$$
p_1\mathcal{U}(\mathcal{C}_1)+\cdots + p_K\mathcal{U}(\mathcal{C}_K),
$$
where $\mathcal{U}(\mathcal{S})$ denotes the discrete uniform distribution on the set $\mathcal{S}\subset\lbrace 1,\ldots,n\rbrace$. This is performed by the function ```representative_ensemble``` below, which needs to be given the ```size``` of the representative family (in number of conformations) as an argument.
%% Cell type:code id:2eb7586a tags:
``` python
representative_ensemble(size = 10, ensemble_path = ensemble_folder, ensemble_name = ensemble_name, labels_umap = labels_umap)
```
%% Cell type:markdown id:b89412b2 tags:
#### 3.4 Secondary structure propensities and average radius of gyration
The function below computes the average DSSP secondary structure propensities per cluster, together with the average radius of gyration across cluster conformations. One plot per cluster is produced, and they are automatically saved in a subdirectory of ```ensemble_folder```. The DSSP categories are defined [here](https://mdtraj.org/1.9.4/api/generated/mdtraj.compute_dssp.html).
%% Cell type:code id:68305f88 tags:
``` python
cluster_descriptors(ensemble_path = ensemble_folder, ensemble_name = ensemble_name, labels_umap = labels_umap, subsequence = subseq)
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment