### WARIO: Weighted fAmilies of contact maps to chaRacterize conformational ensembles of (highly-)flexIble prOteins
This is the Jupyter notebook implementing the pipeline that defines WARIO. The notebook takes an ensemble as input, features its conformations using contact information and perform a clustering algorithm to retrieve the ensemble characterization. The user can follow the guidelines presented below to implement WARIO step-by-step. First, the required libraries must be imported together with the experimental parameters that calibrate the contact function.
%% Cell type:code id:09b972d2 tags:
``` python
# Required imports
importipynb
fromipynb.fs.full.contact_featuresimport*
fromipynb.fs.full.utilsimport*
importos
importnumpyasnp
importhdbscan
path_to_notebook=os.path.abspath(os.getcwd())
#print(path_to_notebook) # Visualize the notebook directory
#### Set the folder containing the ensemble information
The input ensemble must be given as a directory containing trajectory and topology information. The folder must contain only:
* One .xtc file together with a topology file in one of the formats admitted by [MDTraj](https://www.mdtraj.org/1.9.8.dev0/api/generated/mdtraj.load.html#mdtraj.load) or,
* One multiframe .pdb file or,
* If the ensemble is given as a list of .pdb files (one file per conformation), one folder containing one .pdb file per conformation.
%% Cell type:code id:402b3b92 tags:
``` python
ensemble_folder='/directory_with_data/'
ensemble_name='my_ensemble'
# To run an example with real data, uncomment:
#ensemble_folder = "/".join([path_to_notebook,'P113']) # Path to the folder containing trajectory files
#ensemble_name = "P113" # Name the ensemble
```
%% Cell type:markdown id:b8029a85 tags:
We ask you not to choose ```ensemble_folder``` as a folder where multiple or redundant ensembles are located, but to create a specific directory per ensemble. The analysis can be launched in an interactive mode, which checks with the user whether the input data has been correctly understood. This mode is activated by setting the parameter ```interactive``` to ```True``` in the function ```contact_features``` described below. If you chose to set ```interactive``` to ```False```, the function will automatically interpret the input and launch the computation without checking with the user.
All the results produced along the pipeline will be saved in subdirectories of ```ensemble_folder```. If the computation for the same ensemble has to be repeated from the beggining, we ask you to empty ```ensemble_folder``` or to create a new directory containing only the ensemble input data.
%% Cell type:markdown id:c3593eaf tags:
### 1. Data featurization
#### Compute contact weights and embed data into a low-dimensional UMAP space
The function ```contact_features``` takes the input ensemble data and returns the $n\times L(L-1)/2$ matrix $W$ containing the contact information, where $n$ is the number of conformations and $L$ the sequence length. The output is automatically saved in ``ensemble_folder`` as a .txt file. The arguments of ```contact_features``` are:
*```ensemble_name```: The name of the ensemble, to appear in the results files and plots,
*```ensemble_path```: The directory containing the ensemble information, as described above,
*```N_cores```: The number of threads to use in parallel computation,
*```interactive```: Whether to check with the user that data has been correctly understood,
*```thresholds```: The file containing the parameters that calibrate the contact function.
The function computes and saves $W$ and its embeddings into a 2-dimensional and 10-dimensional UMAP spaces.
%% Cell type:markdown id:c9a6f349 tags:
If the entire sequence has to be analyzed, run the following line:
%% Cell type:code id:e4218615 tags:
``` python
subseq=None
```
%% Cell type:markdown id:fdb9bf41 tags:
Otherwise, an array specifying the residue positions to be set can be chosen:
%% Cell type:code id:cc4a2cd0 tags:
``` python
#subseq = np.concatenate([np.arange(100, 151, 1), np.arange(170, 181, 1)]) # Example
```
%% Cell type:markdown id:8ba9a7d7 tags:
Once the value of this last argument is chosen, the function can be run:
Once data has been featured with contact information, and embedded into a low-dimensional UMAP space, clustering can be performed to get the ensemble characterization.
#### Perform HDBSCAN clustering in the low-dimensional space
%% Cell type:markdown id:75f335b4 tags:
First, load the featured data frame and its embeddings into the low-dimensional UMAP spaces. These have been automatically saved by ```contact_features``` in the following directories.
Now, we can perform HDBSCAN clustering in the low-dimensional UMAP space.
**Minimum cluster size**
The clustering is calibrated through the minimum number of conformations that can define a cluster. This quantity will indirectly define the number of retrieved clusters. The user might choose an appropiate value according to the desired level of precision in the classification and the encountered results. The default minimum cluster size is set to 1% of the total number of conformations. Large sequences might require a smaller minimum cluster size.
%% Cell type:code id:76c8d967 tags:
``` python
# Choose minimum cluster size
min_cluster_size=int(wcont_data.shape[0]*0.01)# Default = 1% of individuals
```
%% Cell type:code id:c26a5745 tags:
``` python
# Perform HDBSCAN clustering
labels_umap=hdbscan.HDBSCAN(
min_samples=10,
min_cluster_size=min_cluster_size,
).fit_predict(clusterable_embedding)
classified=np.where(labels_umap>=0)[0]
print("".join(["\nThe clustering algorithm found ",str(len(np.unique(labels_umap[labels_umap>=0])))," clusters and classified the ",str(np.round(100*len(classified)/len(labels_umap),2)),"% of conformations. \n"]))
```
%% Cell type:markdown id:53f0ce88 tags:
#### Results visualization
Clustering partition visualized on the 2-dimensional UMAP space. This illustrates the repartition of conformations among clusters and their corresponding occupancy. By looking at the number of connected components in the space, the minimum cluster size might be re-calibrated. Note that unclassified points appear in gray.
The clustering partition can be used to build new cluster-specific files containing the cluster conformations. The files are automatically saved in a new directory inside of ```ensemble_folder```. Note that the files will be produced in the same format as the one of the input data: one .xtc file per cluster (if data is given as a .xtc file with a topology file, or as a multiframe .pdb file) or a folder per cluster containing its conformations (if data is given as a folder with one .pdb file per conformation).
#### 3.3 Sample a representative family of conformations
The ensemble characterization can be used to sample a representative family of conformations of a given size. This is done by sampling conformations from clusters with probabilites given by the cluster occupancies. In other words, if $p_1,\ldots,p_K$ are the (normalized) occupancies of clusters $\mathcal{C}_1,\ldots,\mathcal{C}_K$ respectively, sample from the distribution
where $\mathcal{U}(\mathcal{S})$ denotes the discrete uniform distribution on the set $\mathcal{S}\subset\lbrace 1,\ldots,n\rbrace$. This is performed by the function ```representative_ensemble``` below, which needs to be given the ```size``` of the representative family (in number of conformations) as an argument.
#### 3.4 Secondary structure propensities and average radius of gyration
The function below computes the average DSSP secondary structure propensities per cluster, together with the average radius of gyration across cluster conformations. One plot per cluster is produced, and they are automatically saved in a subdirectory of ```ensemble_folder```. The DSSP categories are defined [here](https://mdtraj.org/1.9.4/api/generated/mdtraj.compute_dssp.html).