"An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem."
-John Tukey
Applying a Clustering algorithm to identify proteins patterns
Intro
The dataset consists of the expression levels of 77 proteins produced detectable signals in the nuclear fraction cortex. For the study there are 38 control mice and 34 trisomic mice. In the experiments, 15 measurements were registered for each mouse. So there are in total 370 measurement for control mice and 510 measurements for trisomic mice, but for this analysis could be consider each measure as a sample/mouse.
Eight classes of mice are described based on the features: genotype, behavior and treatment; all these features are condensed in the column “class”.
​
Levels and treatment
According to genotype, mice can be control or trisomic. According to behavior, some mice have been stimulated to learn (context-shock) and others have not (shock-context) and in order to assess the effect of the drug memantine in recovering the ability to learn in trisomic mice, some mice have been injected with the drug and others have not.
​
​
* Note: If you want to read the entire paper, in the reference section is the link.
Figure 1: Experiment diagram [1]
The proteins involved in the study are the following:
Objective
Basically the general objetive is to identify subsets of proteins that are discriminant between classes.
Data
The data provided by the authors is the several experiments perfomed in different mice. The dataframe contains:
​
Mouse ID
Proteins Measurements
Genotype
Treatement Behavior
Class
Figure 2: Raw data from repository [2]
For the following analysis, the columns: Treatment, Behaivior, Genotype and Mouse ID are going to be erased from the dataframe, all these information is contain th the 'class' column.
What we want is just the protein measurements and the class, this to perform the ANOVA and later the Cluster process.
Figure 3: Clean data
First Step: One-Way ANOVA analysis
The fisrt step of all after the cleansing of data is to determine which proteins actually were produce significantly more or less between the different levels of the experiment performed. For this objective a One-Way ANOVA with an alfa =0.001, thi rigurous parameter was determined due the characteristics of the dataframe. Also we need that the algorithm saves every single protein that shows a mean of expression different.
For the ANOVA we are just comparing:
c-CS-m
c-CS-s
t-CS-m
t-CS-s
Figure 4: ANOVA code
For visualize the differet the script create a boxplot for every single protein that shows difference between samples, here is an example:
Figure 5: Boxplot generated by Python
Second Step: Protein Clustering
* Note: everything should be scaled or standarized from now on.
For the clustering that is actually the main point of the entire section we need to define some parameters first. There are several methods to performe a clustering of elements, the method use this time is equivalent to K-Means and it is called Ward model, this model is based in the reduction of the variance between each elements. Nevertheless, there is another parameter to choose, the threshold that acutally choose how many "groups" of proteins we will have, for determine this the elbow rule will be apply for the following graph.
Figure 6: K-means Distortion Curve
This graph is provided by the library yellowbrick, this plot actually tell us how many cluster is convenient to have in the process for this dataset.
With this information we can proceed to actually plot the dendrogram, thi using the library sklearn.
The parameters for this dendrogram are:
​
Method= ward (same as kmeans)
Metric= Euclidean
​
This parameters are used generally for any purpose, but we can always try another kind of method or metric depending the phenomena we are studying.
Figure 7: Dendrogram of proteins
So this is the final clustering made by the ward method, it is worth to mention that each method probably will offer different agrupations. We are not done yet, there are a some extra plots to make a more robust analysis.
Third Step: Protein Clustering vs Mice type
Inthis part there is no process left to do, just a simple code using seaborn library and some extra steps in order to label it correctly. What is show below is called a cluster map, this kind of graph allows us to visualize the protein cluster made the last step.
Figure 7: Cluster map Mice type vs Protein
Fourth Step: Protein Clustering with a correlation analysis
Applying the correlation to the dataset allow us to analyze which protein interacts with another, this culd be convenient when the variables have some sort of relation (positive or negative). The final result is show below.
Figure 8: Correlation Matrix: Proteins
If you want to, you can always export the correlation values in order to perfom another type of analysis, here is the CSV obtained:
Applying the correlation to the dataset allow us to analyze which class interacts with another, this could be convenient when the variables have some sort of relation (positive or negative). The final result is show below.
Figure 9: Correlation Matrix: Mice Class
If you want to, you can always export the correlation values in order to perfom another type of analysis, here is the CSV obtained:
Conclusion and final thoughs
Even though we do not know every single detail of the experimentation, all these tools allows everyone to make an exploratory analysis of the data and even more, actually make some result analysis by performing clustering. The 'ideal' of clustering is to be back up by some theoretical explanation. This tools are the proof that every dataset contains more information than we ever imagine, we just need to apply the right concepts and algorithms.
REFERENCES
Data
[1]Higuera C, Gardiner KJ, Cios KJ, Mice Expression Repository, Kaggle.com: https://www.kaggle.com/ruslankl/mice-protein-expression/kernels
[2]Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): e0129126. [Web Link] journal.pone.0129126
Python Libraries
Seaborn Documentation: https://seaborn.pydata.org/generated/seaborn.clustermap.html
Scipy Documentation https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html