top of page

pH recognition with RGB detector

This case was presented in kaggle´s forum, (For more information the link is in the reference section)

The CSV, that is available to download, is really simple, the are just 3 columns of input data and one output:

pH.PNG

For the preliminary analysis the script prints the stats of the entire data set. After that, the script performed the plotting of the data. The plot shown is the following:

sigmoidal.PNG
sigmoi.PNG

The image shows the raw information uploaded in kaggle. The red, green and blue column are the inputs of the system and the output is the pH of the solution, in this case the column name "label"

PLOT.png

Nontheless, if a modification is done to the data set by performing what I call a sigmoidal modification that actually measure how big is one concentration of color against another, the formula looks like the following and is apply to the following relations:

​

Red/Blue     

Red/Green   

Blue/Green

​

If the sigmoidal value is added the prediction made by the for model improves approximately in a 10%. So is a process worth doing in this kind of case.

Parameters

After the data set cleaning and adding the sigmoidal modification the prediction models are applied. The conditions are the following for each model.

KNN Regression

Inputs: red, blue, green, R/B, R/G and B/G.

Output: label (pH)

k's to try: 20

Random Forest

Inputs: red, blue, green, R/B, R/G and B/G.

Output: label (pH)

Random State: 20

Linear Regression

Inputs: red, blue, green, R/B, R/G and B/G.

Output: label (pH)

​

Neural Network

Inputs: red, blue, green, R/B, R/G and B/G.

Neural structure: 

Output: label (pH)

Epoch: 1500

Layers:

1: 24 neurons relu acivation

2: 24 neurons sigmoidal activation

3: 12 neurons sigmoidal activation

4: 3 neurons selu activation

5: 1 neuron (output)

The result of the 3 predictions models are the following:

MethodComp.PNG

The figure shows several values for wach model:

LR: RMSE value, MAE value and R^2 of the prediction/test graph.

KNN: RMSE value, MAE value, best k for the model and R^2 of the prediction/test graph.

NN: RMSE value, MAE value and R^2 of the prediction/test graph.

RF: RMSE value, MAE value and R^2 of the prediction/test graph.

PTRF.png
PTNN.png
PTLR.png
PTKNN.png

The error distribution is graph below and is used to analyze the error made by every single model, the purpose of this plot is to observe if the error made has some kind of tendency.

EKNN.png
E_LR.png
E_RF.png
ENN.png

None of the distributions could be statistical consider as a normal distribution, Furthermore, there are a couple of detail in each graph, for the Neural Network graph, the tendency seems to have a tendcy predicting lower pH than it should be. For the linear regression, the error distribution has a big range of error, according to the results presented in the table before.

About the models....

KNN

For this particular model 20 k´s were used for the reduction of the cross validation value, the graph is shown on the right. The minor value of the cross validation for this particular case : k=11

NN.PNG
K.png

Neural Network

For the neural network the structure chose for the analysis is shown on the left. After that, th epoch for the model, for optimizing the fitting a patience model is added so the 1500 epoch are not run, the model stops at the epoch 250.

The learning rate for every epoch is this show in the following graphs:

MAE.png
MSE.png

Conclusion

The analysis made by this for models brings a lot of information of the behavior of the pH and its relation with the intensity of the RGB spectrum. The best two model for this analysis considering the R^2 and the distribution of the error are the Random Forest Algorithm and the K-Nearest Neighbors method. This models do not really need any sort of modification, but one of them could be modify: the Neural Network method. This method could be potentially modified in 3 ways: the method of optimization, the number of layers and neurons per layer and the activation method; this versatile model could potentially beat the other two, but several runs must be performed and different types of network.

¿More information?

If you want more information about the topic, want to know how to build the code or have any suggestion, email me!

Reference

+ 52 55 49 55 31 37

  • linkedin

©2020 por Data Science: Models and Applications Creada con Wix.com

bottom of page