Protein Clustering | Data Science: Biotec

Applying a Clustering algorithm to identify proteins patterns

Intro

The dataset consists of the expression levels of 77 proteins produced detectable signals in the nuclear fraction cortex. For the study there are 38 control mice and 34 trisomic mice. In the experiments, 15 measurements were registered for each mouse. So there are in total 370 measurement for control mice and 510 measurements for trisomic mice, but for this analysis could be consider each measure as a sample/mouse.

Eight classes of mice are described based on the features: genotype, behavior and treatment; all these features are condensed in the column “class”.

Levels and treatment

According to genotype, mice can be control or trisomic. According to behavior, some mice have been stimulated to learn (context-shock) and others have not (shock-context) and in order to assess the effect of the drug memantine in recovering the ability to learn in trisomic mice, some mice have been injected with the drug and others have not.

* Note: If you want to read the entire paper, in the reference section is the link.

Figure 1: Experiment diagram [1]

The proteins involved in the study are the following:

Objective

Basically the general objetive is to identify subsets of proteins that are discriminant between classes.

Data

The data provided by the authors is the several experiments perfomed in different mice. The dataframe contains:

Mouse ID

Proteins Measurements

Genotype

Treatement Behavior

Class

Figure 2: Raw data from repository [2]

For the following analysis, the columns: Treatment, Behaivior, Genotype and Mouse ID are going to be erased from the dataframe, all these information is contain th the 'class' column.

What we want is just the protein measurements and the class, this to perform the ANOVA and later the Cluster process.

Figure 3: Clean data

First Step: One-Way ANOVA analysis

The fisrt step of all after the cleansing of data is to determine which proteins actually were produce significantly more or less between the different levels of the experiment performed. For this objective a One-Way ANOVA with an alfa =0.001, thi rigurous parameter was determined due the characteristics of the dataframe. Also we need that the algorithm saves every single protein that shows a mean of expression different.

For the ANOVA we are just comparing:

c-CS-m

c-CS-s

t-CS-m

t-CS-s

Figure 4: ANOVA code

For visualize the differet the script create a boxplot for every single protein that shows difference between samples, here is an example:

Figure 5: Boxplot generated by Python

Second Step: Protein Clustering

* Note: everything should be scaled or standarized from now on.

For the clustering that is actually the main point of the entire section we need to define some parameters first. There are several methods to performe a clustering of elements, the method use this time is equivalent to K-Means and it is called Ward model, this model is based in the reduction of the variance between each elements. Nevertheless, there is another parameter to choose, the threshold that acutally choose how many "groups" of proteins we will have, for determine this the elbow rule will be apply for the following graph.

Figure 6: K-means Distortion Curve

This graph is provided by the library yellowbrick, this plot actually tell us how many cluster is convenient to have in the process for this dataset.

With this information we can proceed to actually plot the dendrogram, thi using the library sklearn.

The parameters for this dendrogram are:

Method= ward (same as kmeans)

Metric= Euclidean

This parameters are used generally for any purpose, but we can always try another kind of method or metric depending the phenomena we are studying.

Figure 7: Dendrogram of proteins

So this is the final clustering made by the ward method, it is worth to mention that each method probably will offer different agrupations. We are not done yet, there are a some extra plots to make a more robust analysis.

Protein Cluster

Third Step: Protein Clustering vs Mice type

Inthis part there is no process left to do, just a simple code using seaborn library and some extra steps in order to label it correctly. What is show below is called a cluster map, this kind of graph allows us to visualize the protein cluster made the last step.

Figure 7: Cluster map Mice type vs Protein

Fourth Step: Protein Clustering with a correlation analysis

Applying the correlation to the dataset allow us to analyze which protein interacts with another, this culd be convenient when the variables have some sort of relation (positive or negative). The final result is show below.

Figure 8: Correlation Matrix: Proteins

If you want to, you can always export the correlation values in order to perfom another type of analysis, here is the CSV obtained:

Correlation CSV

Applying the correlation to the dataset allow us to analyze which class interacts with another, this could be convenient when the variables have some sort of relation (positive or negative). The final result is show below.

Figure 9: Correlation Matrix: Mice Class

If you want to, you can always export the correlation values in order to perfom another type of analysis, here is the CSV obtained:

Correlation CSV

Conclusion and final thoughs

Even though we do not know every single detail of the experimentation, all these tools allows everyone to make an exploratory analysis of the data and even more, actually make some result analysis by performing clustering. The 'ideal' of clustering is to be back up by some theoretical explanation. This tools are the proof that every dataset contains more information than we ever imagine, we just need to apply the right concepts and algorithms.

REFERENCES

Data

[1]Higuera C, Gardiner KJ, Cios KJ, Mice Expression Repository, Kaggle.com: https://www.kaggle.com/ruslankl/mice-protein-expression/kernels
[2]Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): e0129126. [Web Link] journal.pone.0129126

Python Libraries

Seaborn Documentation: https://seaborn.pydata.org/generated/seaborn.clustermap.html
Scipy Documentation https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html

Applying a Clustering algorithm to identify proteins patterns

Intro

* Note: If you want to read the entire paper, in the reference section is the link.

Objective

Data

First Step: One-Way ANOVA analysis

Second Step: Protein Clustering

* Note: everything should be scaled or standarized from now on.

Third Step: Protein Clustering vs Mice type

Fourth Step: Protein Clustering with a correlation analysis

Conclusion and final thoughs

REFERENCES

Liked the post?

Any comments or ideas are more than welcome. If you want to obtain codes like this, be sure to contact me!