EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

66
EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO APLICADOS AL PROCESO DE DETECCIÓN DE BACTERIA-BACTERIÓFAGO JUAN FERNANDO LÓPEZ SILVA CÓDIGO 2136482 UNIVERSIDAD AUTÓNOMA DE OCCIDENTE DEPARTAMENTO DE AUTOMÁTICA Y ELECTRÓNICA FACULTAD DE INGENIERÍA PROGRAMA DE INGENIERÍA MECATRÓNICA SANTIAGO DE CALI 2019

Transcript of EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

Page 1: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO APLICADOS AL PROCESO DE DETECCIÓN DE BACTERIA-BACTERIÓFAGO

JUAN FERNANDO LÓPEZ SILVA CÓDIGO 2136482

UNIVERSIDAD AUTÓNOMA DE OCCIDENTE DEPARTAMENTO DE AUTOMÁTICA Y ELECTRÓNICA

FACULTAD DE INGENIERÍA PROGRAMA DE INGENIERÍA MECATRÓNICA

SANTIAGO DE CALI 2019

Page 2: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO APLICADOS AL PROCESO DE DETECCIÓN DE BACTERIA-BACTERIÓFAGO

JUAN FERNANDO LÓPEZ SILVA

PROYECTO DE GRADO PARA OPTAR POR EL TÍTULO DE INGENIERO MECATRÓNICO

DIRECTOR JESÚS ALFONSO LÓPEZ SOTELO

MAGÍSTER EN AUTOMÁTICA DOCTOR EN INGENIERÍA

UNIVERSIDAD AUTÓNOMA DE OCCIDENTE DEPARTAMENTO DE AUTOMÁTICA Y ELECTRÓNICA

FACULTAD DE INGENIERÍA PROGRAMA DE INGENIERÍA MECATRÓNICA

SANTIAGO DE CALI 2019

Page 3: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

3

Nota de aceptación:

Aprobado por el Comité de Grado en cumplimiento de los requisitos exigidos por la Universidad Autónoma de Occidente para optar al título de

Ingeniero Mecatrónico.

Víctor Adolfo Romero Cano

Jurado

Juan Carlos Perafán Villota

Jurado

Santiago de Cali, 6 de febrero de 2019

Page 4: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

4

CONTENT pag.

ABSTRACT 7

RESUMEN 8

1 PROBLEM DESCRIPTION 11

2 JUSTIFICATION 13

3 STATE OF THE ART 14

4 OBJECTIVES 17

4.1 GENERAL OBJECTIVE 17

4.2 SPECIFIC OBJECTIVES 17

5 THEORETICAL FRAMEWORK 18

5.1 SUPERVISED LEARNING BASICS 18

5.1.1 Artificial Neural Networks (ANN) 19

5.1.2 K-nearest neighbors (KNN) 20

5.1.3 Decision trees 20

5.1.4 Support Vector Machines (SVM’s) 21

5.2 ONE-CLASS LEARNING 22

5.2.1 Replicator Neural Network (RNN) 23

5.2.2 Local Outlier Factor (LOF) 24

5.2.3 Isolation Forest (iTree) 25

Page 5: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

5

5.2.4 One-Class Support Vector Machine (OC-SVM) 26

5.2.5 Elliptic Envelope (FAST-MCD) 27

5.3 ENSEMBLE LEARNING 27

5.4 GRIDSEARCH 28

5.5 PARETO EFFICIENCY 29

5.6 PERFORMANCE METRICS 30

5.6.1 Accuracy 31

5.6.2 Sensitivity (or Recall) 31

5.6.3 Specificity 32

5.6.4 Positive Predictive Value (PPV) or precision 32

5.6.5 Negative Predictive Value (NPV) 32

5.6.6 F1-Score 33

5.7 Phage-Bacteria interaction 33

6 METHODOLOGY 36

6.1 EXPLORATORY PROCEDURE 37

6.2 DATASETS GENERATION AND FEATURES EXTRACTION 38

6.3 DATASETS DESCRIPTION 39

6.4 HYPERPARAMETERS SELECTION 40

7 RESULTS AND ANALYSIS 41

7.1 EXPLORATORY PROCEDURE 41

7.2 FINAL PROCEDURE 42

8 CONCLUSIONS 60

Page 6: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

6

9 FUTURE WORK 62

BIBLIOGRAPHY 63

Page 7: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

7

ABSTRACT

The misuse of antibiotic drugs contributes to the emergence and rapid dissemination of antibiotic resistance worldwide, threatening medical progress. The development of innovative alternatives is necessary to fight against this public health problem. A re-emerging therapy, dubbed phage-therapy, might represent such an alternative. Phage-therapy is based on viruses (bacteriophages) that specifically infect and kill bacteria during their life cycle. The success of phage therapy mainly relies on the exact matching between the pathogenic bacteria and the therapeutic phage. However, this is a time-consuming process achieved in laboratories and time is a precious and critical resource in a clinical context. Hence, the fast identification of potential phage candidates capable of dealing with a given bacteria is essential for using phage-therapy in routine. Machine learning algorithms trained on public genome databases constitute a promising approach to achieve this goal. Unfortunately, public databases contain highly imbalanced interaction data (i.e., mostly positive phage-bacterium interactions); making it harder to use classic machine learning algorithms that needs relatively-balanced classes to work. To address this problem, we are exploring the use of One-Class learning methods, which are robust tools to deal with imbalanced datasets. We have tested an odd number of One-Class learning techniques merged with the ensemble-learning paradigm on real medical data presenting accuracy results from 75% up to 85%, encouraging towards tailoring and scaling them up to our phage-bacteria data. Using gridsearch and pareto fronts to refine the algorithms, the results on the phage-bacteria interactions datasets were promising showing good performance and scalability. Further work could include developing new methods for One-Class classification and applying them to other type of real data as well as try the algorithms with real tested negative interactions.

Page 8: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

8

RESUMEN

El mal uso de los antibióticos ha contribuido a la aparición y rápida diseminación de la resistencia a los antibióticos en todo el mundo, amenazando el progreso médico. El desarrollo de alternativas innovadoras es una necesidad para luchar contra este problema de salud pública. Una terapia reemergente, llamada terapia de fagos (fagoterapia), podría representar esa alternativa. La terapia con fagos se basa en virus (bacteriófagos) que específicamente infectan y matan a las bacterias durante su ciclo de vida. El éxito de esta terapia se basa principalmente en coincidir una respectiva bacteria patógena con el fago terapéutico. Sin embargo, este es un proceso que consume mucho tiempo en los laboratorios y el tiempo es un recurso valioso y crítico en un contexto clínico. Por lo tanto, la identificación rápida de posibles fagos candidatos capaces de tratar con una bacteria dada es esencial para el uso habitual de la fagoterapia. Los algoritmos de aprendizaje automático entrenados con bases de datos públicas del genoma constituyen un enfoque prometedor para lograr este objetivo. Desafortunadamente, las bases de datos públicas contienen datos de interacción entre las bacterias y los fagos altamente desequilibrados (es decir, en su mayoría interacciones positivas entre fagos y bacterias); dificultando el uso de algoritmos clásicos de aprendizaje automático que requieren clases relativamente equilibradas para funcionar. Para abordar este problema, estamos explorando el uso de los métodos de aprendizaje de una clase, que son herramientas robustas para tratar con conjuntos de datos desequilibrados. Hemos probado un número impar de técnicas de aprendizaje de una clase combinadas con el paradigma de aprendizaje conjunto en datos médicos reales que presentan resultados de precisión del 75% al 85%, fomentando la adaptación y la ampliación a nuestros datos de fagos-bacterias. Al usar gridsearch y frentes de pareto para refinar los algoritmos, los resultados en los conjuntos de datos de interacciones fago-bacterias fueron prometedores, mostrando buen rendimiento y escalabilidad. El trabajo adicional podría incluir el desarrollo de nuevos métodos para la clasificación de una clase y su aplicación a otro tipo de datos reales, así como probar los algoritmos con interacciones negativas reales probadas. Palabras Claves: Aprendizaje automático, interacción bacteria-bacteriofago, one-class learning, algoritmos, evaluación

Page 9: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

9

INTRODUCTION

Artificial intelligence (AI) is one of the branches of engineering that has had the most impact in recent years, not only technologically speaking but also in the economic and social sectors1. Therefore, this field today is immersed in the daily life of most people without even realizing it and thanks to the revolution of information and communication technologies, artificial intelligence is growing more and more and gaining much strength. One of the areas of AI with more demand nowadays is Machine Learning (ML). Its fame relies on the variety of applications that this area can have and the high performance it has. Even though it has been used for more than 50 years, thanks to the technological improvements nowadays and because of the accessibility to data and information, this sector of IA has had a great impact in the industrial, economic, social, technological and health sector.2 Its relevance lies on what can be done with it, because most situations in nature and daily tasks are represented through high-dimension data which a human would not even be able to process, such as: the number of variables to measure in a greenhouse (e.g. temperature, humidity, pressure, percentage of oxygen, pH in the soil, etc.), also the number of axis in a KUKA industrial robot (up to six axis meaning six variables to control) and even the number of combinations of codons in the DNA that lead to the generation of aminoacids (64 possible combinations meaning 64 dimensions to deal with). This is where ML plays an important role in solving a variety of high-dimensional complex problems in a very accurate and efficient way. One of its applications is precisely in biology, consequently bringing benefits to medicine areas. Many of the situations in these areas are of high-dimensionality, for example the DNA sequences, the characteristics of the plants and/or animals (e.g. width and height of a flower’s petal, length of the plant stem, color of the plant’s flower, etc.), the human features (e.g. skin color, height, race, age, weight, sex, etc.), health variables (e.g. blood pressure, body temperature, glucose level, platelets

1 MARADIAGA, Jorge Roberto. “La inteligencia artificial y su impacto en la sociedad” [en linea]. La Tribuna en línea. Tegucigalpa, Honduras. (Abril 20 de 2017). [Consultado el 14 de Febrero de 2018]. Disponible en: http://www.latribuna.hn/2017/04/20/la-inteligencia-artificial-impacto-la-sociedad/ 2 HISTORY OF COMPUTING, VIDEO LECTURES. (Diciembre 1 de 2006). [en linea] Washington, Estados Unidos de America. The History of Artificial Intelligence. Washington: Universidad de Washington, 2006. [Consultado el 14 de Febrero de 2018]. Disponible en: https://courses.cs.washington.edu/courses/csep590/06au/projects/history-ai.pdf

Page 10: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

10

count, etc.), among others.3 Machine Learning allows to classify, predict and analyze all these data in a very precise way taking into account each of the factors involved regardless of whether the data has two, three or more dimensions and that is why it is a very efficient method in this area. More specifically, throughout this project, Machine Learning tools will be used to process, analyze and generate data on phague-bacteria pairs in order to allow an efficient phage therapy. The advantage of using these tools is the fact that the data used is a high-dimensional one: the DNA and its respective transcription (RNA) of each of the bacteria and the bacteriophages as well as their interactions; being a highly tedious, slow and imprecise procedure without the help of computational tools.

3 D’ALCHÉ-BUC, Florence y WEHENKEL, Louis. Machine Learning in Systems Biology [en linea]. De: Selected Proceedings of Machine Learning in Systems Biology, BMC. Septiembre 24 y 25 de 2018, vol. 2, no. 4. [Consultado el 14 de Febrero de 2018]. Disponible en: https://bmcproc.biomedcentral.com/articles/10.1186/1753-6561-2-S4-S1

Page 11: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

11

1 PROBLEM DESCRIPTION

The abuse and misuse of antibiotics has led to what is known as bacterial resistance, which is a condition where the medicine becomes ineffective and the bacteria continues infecting the body. This issue has not only alarmed doctors and the scientific community but also organizations such as the World Health Organization, which has responded to such situation4. In addition, in 2015, a global action plan on antibiotic resistance was launched, which aims to prevent and treat infectious diseases through drugs and/or effective and safe methods. One of these alternative methods is the use of bacteriophages (also called phages) or as it is commonly called: phage therapy; which consists in introducing the phage to the infected organism and it will only attack the bacteria.5 However, this therapy has several limitations and one of them is to find the specific bacteriophage for a bacterium, since there are estimated to be trillions of bacteriophages but each one attacks a specific bacterium. Currently, the process of finding phague-bacteria pairs is carried out in laboratories through infection tests, which can take between eight hours and two days of testing.6 Therefore, when performing phage therapy, one of the most important factors is to find the correct phague for the treatment; if the infection is very high there wouldn’t be enough time to perform tests in the laboratory, demanding to have detailed information about the different bacteria-bacteriophage pairs and thus being able to respond quickly to any situation. This is where Machine Learning begins to play a very important role in this problem, giving tools to classify pre-existing data of phague-bacteria pairs and predict missing pairs, being able to generate a detailed database so that it can be accessible to people who need this type of treatment. 4 WORLD HEALTH ORGANIZATION. Antibiotic resistance, descriptive note [en linea]. World Health Organization, Press Release. (Octubre de 2017). [Consultado el 14 de Febrero de 2018]. Disponible en: http://www.who.int/news-room/fact-sheets/detail/antibiotic-resistance?fbclid=IwAR1HxBL4h6xkoGdZy06Oom5A0INki3UA1Rtx0d2tg6q0-HjHK3VIF3xVWIU 5 ORREGO, Rodrigo. “Fagoterapia: alternativa para el control de enfermedades bacterianas” [en linea]. Salmonexpert. Santiago, Chile. (Junio 18 de 2015). [Consultado el 14 de Febrero de 2018]. Disponible en: https://www.salmonexpert.cl/article/fagoterapia-alternativa-para-el-control-de-enfermedades-bacterianas/ 6 MULLIN, Emily. “Virus e inteligencia artificial se unen contra las bacterias resistentes” [en linea]. MIT Technology Review. Boston, Estados Unidos de America. (Febrero 1 de 2018). [Consultado el 14 de Febrero de 2018]. Disponible en: https://www.technologyreview.es/s/9960/virus-e-inteligencia-artificial-se-unen-contra-las-bacterias-resistentes

Page 12: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

12

Unfortunately, there is also another limitation for this approach. The pre-existing data of phague-bacteria pairs can be found in public databases such as Genbank and phagesdb.org but these datasets contain highly imbalanced interaction data. This means that they contain only positive phage-bacteria interactions and in order to apply classic Machine Learning algorithms the data must be relatively-balanced (i.e. approximately the same amount of positive and negative phage-bacteria interactions). To address this problem, One-Class learning methods are going to be explored, studied, implemented and validated which are robust tools to deal with imbalanced datasets.

Page 13: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

13

2 JUSTIFICATION

This project will allow the Universidad Autónoma de Occidente to be part of an innovative field, not only in the medicine area but also in artificial intelligence. This field, although it has been applied to the medical area since the last century, has returned and has generated very good expectations, having many investigations in course. Alongside, it has led to apply artificial intelligence, specifically speaking of machine learning, in innovative ways since this project uses very precise experimental data with consequences in human health. However, the most important motivation is the fact that human health is highly affected by resistance to antibiotics and an immediate and effective alternative is necessary to fight against it, being an extremely beneficial project for humanity. Also, it is a project that shows that laboratory procedures can also be automated through Machine Learning, saving not only time, which is critical for this type of applications, but also saving money and materials, including manpower. At last, it should be mentioned that the health of the laboratory worker is exposed to several risks in many of the procedures they do, since they are vulnerable to biological hazards that can affect a person. Therefore, this project also has a humanistic foundation and occupational safety and health approach.

Page 14: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

14

3 STATE OF THE ART There are a variety of Machine Learning applications in biology and medicine using both bioinformatic and medical data, for example in the analysis of diagnostic images7 (x-rays, mammography, ultrasound, etc.) and in the processing of bioinformatic data making classification and prediction8 (genomics, proteomics, systems biology, etc.). However, we are going to focus in examples of Machine Learning applications specifically in bacteria and their behaviors and in the use of One-Class Learning methods. One of the works carried out in the prediction of behavior among bacteria and phages is the work made by Carvalho et al9. The work entitled "Computational prediction of host-pathogen interaction through omics data analysis and machine learning" consists in the omics analysis of some bacteria and their respective phages (DNA structure, transcription, proteomics) in order to process this data through some of the most used machine learning algorithms (multilayer perceptron, decision tree, N-nearest neighbors and support vector machines). Good results and quite accurate predictions were obtained, however, there were many limitations such as the imbalanced dataset they used, also the ensemble learning they used was made with four algorithms (there must be an odd number of algorithms for a good ensemble learning), among others. Another related work is the one carried out by Venkatesh Vijaykumar10 in which it classifies different species of bacteria using machine learning and machine vision algorithms (support vector machines and deep convolutional networks respectively);

7 BARBUZANO, Javier. “Inteligencia artificial aplicado al diagnóstico y tratamiento de enfermedades”. [en línea]. World Diagnostic News. Buenos Aires, Argentina. (Mayo 14 de 2017). [Consultado el 14 de Febrero de 2018]. Disponible en: https://www.diagnosticsnews.com/empresas/26573-inteligencia-artificial-aplicada-al-diagnostico-y-al-tratamiento-de-las-enfermedades 8 RAZAVIAN, Narges. Application of Machine Learning in Computational Biology [en linea]. Universidad de Nueva York, Nueva York, Estados Unidos de America. 2004. [Consultado el 14 de Febrero de 2018]. Disponible en: http://people.csail.mit.edu/dsontag/courses/ml13/slides/lecture26.pdf 9 CARVALHO, Diogo Manuel, et al. Computational prediction of host-pathogenic interactions through omics data analysis and machine learning [en linea]. En: Springer-Verlag Berlin Heidelberg 2011. [Consultado el 14 de Febrero de 2018]. Disponible en: https://drive.switch.ch/index.php/s/9mk6uRb1HtMNmZD?path=%2FPapers#pdfviewer 10 VIJAYKUMAR, Venkatesh. Classifying bacterial species using computer vision and machine learning [en linea]. En: International Journal of Computer Applications. Octubre de 2016, vol. 151. No.8. Mumbai, India. [Consultado el 14 de Febrero de 2018]. Disponible en: http://www.ijcaonline.org/archives/volume151/number8/vijaykumar-2016-ijca-911851.pdf

Page 15: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

15

it had a very good result when classifying three types of bacteria, however, due to lack of enough training data, the accuracy was low. In the work entitled "deep learning approach to bacterial colony classification"11 deep convolution networks were used to identify diagnostic images of bacterial colonies, this in order to identify some bacteria in patients without incurring in recognition faults and to treat them assertively. They use classification methods such as decision trees and support vector machines and for image recognition a deep convolutional network was used resulting in a precision of 97%. However, they seek to extend the dataset in order to improve precision. We found a work done at the University of Chicago12 which focuses on making predictions and classifications of bacterial resistance to several types of antibiotics. Different types of decision trees were used obtaining accuracies up to 92%.

One of the works related to One-Class Learning is made by Hawkins et al entitled “Outlier detection using replicator neural networks”13. They describe a very robust One-Class Learning technique called replicator neural network (RNN) as well as describing its mathematical foundation. They also tested the RNN in two datasets (Network intrusion detection and Wisconsin breast cancer) turning them both to imbalanced data (not including labels of negative classes in the training process). They got high accuracies (in between 68% and 77% confidence) in both datasets being able to identify outliers successfully. There’s also the work from Iriogoien, Sierra and Arenas14 which consisted in applying different One-Class Learning methods to real medical data. They selected 11 ZIELINSKI, Bartosz; PLICHTA, Anna; MISZTAL, Krzysztof; SPUREK, Przemyslaw; BRZYCHCZY-WŁOCH, Monika y OCHOŃSKA, Dorota. Deep learning approach to bacterial colony classification [en linea]. En: PLoS ONE. Septiembre 14 de 2017, vol. 12, no. 9. [Consultado el 14 de Febrero de 2018]. Disponible en: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5599001/ 12 SANTERRE, John; DAVIS, James; XIA, Fangfang y STEVENS, Rick. Machine learning for antimicrobial resistance [en linea]. En: arXiv: 1607.01224. Julio 5 de 2016. [Consultado el 14 de Febrero de 2018]. Disponible en: https://arxiv.org/abs/1607.01224 13 HAWKINS, Simon; HE, Hongxing; WILLIAMS, Graham and BAXTER, Rohan. Outlier detection using replicator neural networks [en linea]. En: CSIRO Mathematical and Information Sciences. [Consultado el 4 de Marzo de 2018]. Disponible en: https://togaware.com/papers/dawak02.pdf 14 IRIGOEIN, Itziar; SIERRA, Basilio and ARENAS, Concepción. Towards application of One-Class classification methods to medical data [en linea].En: The Scientific World Journal, vol. 2014, Article ID 730712, 7 pages, 2014. [Consultado el 13 de Abril de 2018]. Disponible en: https://www.hindawi.com/journals/tswj/2014/730712/

Page 16: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

16

ten different datasets from the UCI repository and getting very good results, concluding that One-Class Learning methods are a strong approach regarding medical problems. Lastly, a paper entitled “A One-Class Classification Approach for Protein Sequences and Structures” by Bánhalmi et al15 uses 9 different One-Class Learning algorithms applied to two different protein sequences datasets. They compare these algorithms to conventional binary classification methods getting very solid results and AUC scores from 90% above.

15 BÁNHALMI, András; BUSA-FEKETE, Róbert and KÉGL, Balázs. A One-Class Classification Approach for Protein Sequences and Structures [en linea]. Springer-Verlag Berlin Heidelberg, ISBRA, LNBI 5542, pp. 310–322, 2009. [Consultado el 13 de Abril de 2018]. Disponible en: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.800.4532&rep=rep1&type=pdf

Page 17: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

17

4 OBJECTIVES

4.1 GENERAL OBJECTIVE Applying machine learning techniques in the process of detection of phage-bacteria interactions to reduce human intervention at its most. 4.2 SPECIFIC OBJECTIVES ● Investigate about the process of detection of phage-bacteria interactions. ● Analyze the experimental data of phage-bacteria interactions previously

obtained. ● Apply different machine learning techniques to the data. ● Evaluate the performance of the different machine learning algorithms.

Page 18: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

18

5 THEORETICAL FRAMEWORK It is important to know the basics of machine learning as well as basic knowledge of the interaction between bacteria and their respective bacteriophages. Machine learning was born due to the human need to solve highly complex problems that require a very high reasoning capacity; although computers and algorithms do not have the ability to reason, they can detect patterns, make predictions, classify and learn even better than human beings. There are two types of learning: supervised and unsupervised learning. In supervised learning the algorithm is trained from known input and output data, i.e. the algorithm is trained having both the questions and the answers to the problem (features and labels respectively) for further predictions. On the other hand, there is unsupervised learning. Here, algorithm only knows and learn based on the characteristics of the data and it automatically classifies the data according to their similarity, that is, it groups the similar data and generates labels based on the features. This type of learning focuses on data clustering or data reconstruction. Also, it is important to know the phases of machine learning, in which most of the cases consist in the training phase and the test phase. When you have a set of data, it is recommendable to divide this set into two groups, one part of the set is used to train the algorithm and the other part to check its correct functioning. If the algorithm classifies the test data well, it is because it learned correctly, otherwise there is probably some error during the training phase.

5.1 SUPERVISED LEARNING BASICS Although, during this project, no supervised learning algorithms are going to be used but only One-Class Learning algorithms (which are a type of unsupervised learning), it is important to know the basics of supervised learning because of the analogy they have to One-Class Learning methods.

Page 19: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

19

5.1.1 Artificial Neural Networks (ANN)

The complexity of the human brain and its ability to process data in such a precise and rapid way has been the motivation for the scientific community to decipher it’s functioning. This has given way to algorithms and computational procedures based on functions performed by the brain; one of the most famous algorithms are artificial neural networks, which were one of the first branches of machine learning. One of the most widely used and popular structure of an artificial neural network today is the multilayer perceptron (MLPs). It consists of several simple neurons connected to each other as seen in Figure 1. Its applications are very broad and therefore it is one of the most used algorithms over the years; can perform classification, regression (prediction), among others. It has its own learning method called backpropagation using gradient descent.

Figure 1. MLP graph structure

Fuente: M.U. de ARAUJO, Fabio. Figure 7 [figura]. EN: Assessment and Certification of Neonatal Incubator Sensors through an Inferential Neural Network. Sensors 2013,13, 15613-15632; doi:10.3390/s131115613. Web. Noviembre 15 de 2013. p 15621.

Page 20: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

20

5.1.2 K-nearest neighbors (KNN)

Another of the most used machine learning algorithms for classification is the KNN, although its accuracy is not the best. This algorithm uses, as its name indicates, the closest data points to predict or classify a sample, furthermore the amount of nearby data to consider depends on the K factor. As a discriminating factor, it uses the Euclidean distance between each of the data points to determine its distance or proximity regardless of the dimensionality of the data set. Figure 2. K-nearest neighbor structure example

Fuente: BRONSHTEIN, Adi. Example of K-NN Classification [Figura]. EN: A Quick Introduction to K-Nearest Neighbors Algorith. Medium Corporation US. Web. Abril 11 de 2007

5.1.3 Decision trees

Decision trees are widely used in machine learning because of their interpretability as they can be represented as if-else commands. These algorithms are based on nodes connected to each other in a descending way, where each can be split into other nodes or decisions. It can be said that the algorithm asks the dataset for its features and depending on each data the algorithm advances to another question or decides about it. They can be used to predict or classify and, as the name implies, to make decisions as well.

Page 21: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

21

The learning process of the tree is based on determining which the most important feature is to start with, in order to effectively discriminate a data point and then gives value to the answer and move on to the next question; for this process it uses different factors such as impurity, the Gini index, information gain, entropy and among others. In figure 3 you can see a very simple but illustrative example about the basic operation of this algorithm. Figure 3. Decision tree graph example

Fuente: SANJEEVI, Madhu. Without Name [Figura]. EN: Chapter 4: Decision Trees Algorithms. Medium Corporation US. Web. Octubre 6 de 2017

5.1.4 Support Vector Machines (SVM’s)

Support vector machines are algorithms very similar to artificial neural networks because of their feedforward structure. However, this algorithm stands out for its good performance in data classification since it uses a technique called the “kernel trick” which transforms the input data into a much simpler feature space and then its output generates an exact separation between the data, maximizing the margin between them and optimizing the classification. The strength of the SVM lies on the capability to transform a non-linearly separable feature space into one that can be linearly separable using the technique mentioned above: the kernel trick; this technique transforms the original feature space using different kind of “kernels” or functions such as the polynomial function, radial basis function (RBF), sigmoid function, etc. Figure 4 shows how the support vector machine works, after applying the kernel trick, by finding the separation

Page 22: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

22

hyperplane between two classes, maximizing its margin and thus being the most optimal compared to the infinity of possible separation planes that may exist. Figure 4. Support vector machine description

Fuente: Without Author. Without Name [Figura]. EN: Introduction to Support Vector Machines. OpenCV tutorials. Web. Febrero 14 de 2014 5.2 ONE-CLASS LEARNING Most of the real data in nature and in daily problems gives us only a representation of one type of class’ targets or gives us a very imbalanced dataset. This means that classes are not represented equally and that there are many more examples of one class than the other, which makes one class insignificant. This is a big problem in classic supervised learning as they only work with binary classification and need two equally distributed classes to perform correctly; one of the most common issues is the accuracy paradox which tells that a model has an excellent accuracy, but it is only reflecting the predominant class accuracy. Specifically speaking, the public databases of phage-bacteria interactions from Genbank and phagesdb.org consists only of positive interactions between them, this is because the laboratories that contributed to the databases weren’t supposed to register negative interactions of phages and bacteria but only register the positive ones. This is a perfect example of One-Class classification where only examples from one of the classes are available but there’s also the need for predicting the other unknown class.

Page 23: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

23

In the following sections, it will be mentioned and described each of the One-Class Learning algorithms used in this project, considering that they have some similarities with the supervised learning algorithms described before. Often, samples belonging to the positive class or to the known class are referred as inliers while the negative class or unknown class is referred as outliers.

5.2.1 Replicator Neural Network (RNN)

The RNN is a type of neural net that, as its name indicates, replicates the input data in order to learn a representation of it. The structure is very similar to an MLP but as it only uses samples from one class, the target of the net is the same as the input of it. It sounds redundant, but this structure allows the neural net to learn how to reconstruct the input very well, meaning that it will learn to reconstruct data of one single class; when the net tries to reconstruct data from the other class, it wouldn't know how to and here’s where the reconstruction error takes an important role. Figure 5. Replicator Neural Network structure

Fuente: HAWKINS, Simon. Figure 1 [Figura]. EN: Outlier Detection Using Replicator Neural Networks. Data Warehousing and Knowledge Discovery. DaWaK 2002. Lecture Notes in Computer Science, vol 2454. Springer, Berlin, Heidelberg. Web. Septiembre 2 de 2002.

The learning method is backpropagation, as an MLP, but this time the mean squared error (MSE) is reduced in each iteration meaning that the reconstruction error is the same as the MSE. When a sample has a very high MSE it means it is an outlier but on the contrary if there’s a low MSE it means the sample is an inlier making the correct classification process. In this project we are modifying three important hyperparameters to this algorithm: the number of epochs the network is going to be trained, the threshold of the mean squared error that’d be determining the outliers or

Page 24: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

24

inliers and the number of hidden units which are determined by a fraction of the number of input units. 5.2.2 Local Outlier Factor (LOF) The idea of the LOF is very similar to KNN but its difference is that it considers the local density deviation of each sample compared to its neighbors. This means that an outlier is detected when it has much lower density deviation than its neighbors (K) while inliers have much more density as they are thought to have almost similar characteristics. In figure 6 we can observe that O1, O2 and O3 are local outliers to C1 but from a global perspective O4 isn’t an outlier because of the global density between the centroids C1 and C2, this is the mainly difference between regular clustering methods and LOF. Also, it can be categorized as a One-Class Learning algorithm because it determines the local density of the known class and classifies it according to the local outlier factor. If a new sample is fed to the algorithm with a high distance from the local density (local outlier factor) it is classified as an outlier, if not, is because it is contained inside the neighbors. Figure 6. Nearest neighbor’s density representation

Fuente: CHEPENKO, Daniel. Without Name [Figura]. EN: A Density-based algorithm for outlier detection. Medium Corporation US. Web. Septiembre 15 de 2011.

The two important hyperparameters modified to this algorithm is the number of neighbors considered for the training and prediction, as well as the type of algorithm used to determine the distance.

Page 25: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

25

5.2.3 Isolation Forest (iTree) The iTree is an ensemble of randomly selected decision trees but the difference is they are only trained with one of the classes trying to isolate, as its names says, the outlier samples. The isolation process occurs when the length path or the decision is made very near the root (as shown in figure 7 up), meaning that the random partitioning of the trees produces shorter paths for anomalies. This early partitioning is due to the high difference between the features of inliers and outliers, and normally, decision trees converges shorter when the feature-value is distinguishable from normal values. Figure 7. Isolation Forest graphical explanation

Fuente: Without Author. Schema 1 [Figura]. EN: General Electric Digital, Isolation Forest Outlier Detection. Web. Junio 11 de 2016.

We can see the isolation forest behavior in figure 7 down and how the outlier (Xo) was classified using just four partitioning while the inlier (Xi) was classified using 11 partitioning, embracing the idea of the shorter paths for anomalies. Using a 2D representation we can see that the inliers tend to be accumulated as they are considered to have similar features (the known class) but the outliers (the unknown

Page 26: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

26

class) tend to be “isolated” and far from the inliers as they are considered to have anomalous or very different characteristics than the inliers, this is why the outliers are easier to detect and classify as the inliers. For this algorithm, the number of estimators (number of trees used for predicting) are going to be modified as well as the maximum of samples used in each tree and the number of features used (given as a fraction of the total of features). 5.2.4 One-Class Support Vector Machine (OC-SVM) The OC-SVM is very similar to the regular SVM as it applies the feature space transformation and then tries to maximize a hyperplane in order to separate or classify the data in an optimal way, but as it only uses data points from one class, the process of training is different. In the feature space, it generates a boundary from the origin of the data points and maximizes the distance between the boundary (hyperplane) and the origin creating a separation between inliers and outliers. We can see in figure 8 on the left, the original and non-linearly separable feature space and on the right the transformed linearly separable feature space. The boundary is created from the origin of the datapoints until the support vectors, all the samples that are up the boundary are considered inliers and down the boundary are considered outliers having in mind the margin as well. Figure 8. Graphical representation of OC-SVM

Fuente: GUERBAY, Yasmine. Fig. 6 [Figura]. EN: The effective use of the one-class SVM classifier for handwritten signature verification based on writer-independent parameters. ELSEVIER,Vol 48, no. 1, Enero 2015, P 103-113.

Page 27: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

27

This algorithm considers several hyperparameters; the type of kernel used in the SVM, as well as the Nu parameter which is the one that determines the fraction of support vectors used, the Degree of the polynomial function used as a kernel and the Gamma which is the kernel coefficient. 5.2.5 Elliptic Envelope (FAST-MCD) The elliptic envelope is based on the minimum covariance determinant (MCD) method which is a robust estimator of multivariate data. It uses the Mahanalobis distance as a discriminant factor for high-dimensional data as well as the minimum volume ellipsoid (MVE) as shown in figure 9. The foundation of this process is fitting the known data into a Gaussian density and then being able to calculate each data point Mahanalobis distance in order to optimize the minimum covariance determinant. Figure 9. Elliptic envelope graphical representation

Fuente: Without Author. Mahalanobis distances of contaminated datasets [Figura]. EN: 2.7. Novelty and Outlier Detection. scikit-learn: Machine Learning in Python. Web.

For the Elliptic Envelope we are going to modify the Support Fraction which is the number of points to be considered when calculating the MCD and also the Assume Centered parameter which determines whether or not to compute a robust location and recompute a covariance out of it. 5.3 ENSEMBLE LEARNING Ensemble models are used to improve the individual performance of each algorithm. It consists in combining the results of different algorithms in order to reduce the likelihood of a model to classify wrongly a sample. There are several types of

Page 28: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

28

ensemble learning techniques but in this project, we are going to use the most simple and intuitive one called Bagging technique which can help reduce the variance and avoid overfitting of the models. The idea of bagging is simple, the individual results of each algorithm are the new inputs of the model and by simple majority voting the final prediction of the ensemble learning is made, as illustrated in figure 10. Figure 10. Ensemble learning graph

5.4 GRIDSEARCH Machine learning models have several variables to adjust in order for them to perform correctly, these variables are often called hyperparameters. Some models are more sensitive to the changes of these parameters than others, and most of the time the results of each model are highly dependable on the correct choice of these parameters.

Page 29: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

29

In order to choose the correct configurations of hyperparameters for each model, a method dubbed Gridsearch is going to be implemented with a slightly variation using pareto efficiency. Gridsearch consists in a very straightforward algorithm, as it name states, searching for the best configuration of parameters using a grid of them. In other words, training the algorithms with all possible combinations of parameters and then choosing the combination with the higher performance. 5.5 PARETO EFFICIENCY As it is said before, the best parameter configuration is chosen by the highest performance based on a performance metric (accuracy or f1-score are commonly used), but when the models are evaluated in two or more metrics, Gridsearch process becomes a little more complicated not knowing which performance metric would overcome against the other. In One-Class learning algorithms it’s more convenient to use different type of metrics (not only accuracy or f1-score), that’s why in this project there is going to be used seven different types of metrics to validate and test each algorithm. Here is where pareto efficiency takes an important role, because now, Gridsearch is not going to be based in only one performance metric but in seven different kind of metrics, having a more complex situation at the time of choosing one hyperparameters configuration.

Figure 11. Pareto efficiency graph

Fuente: PETTINGER, Tejvan. Without Name [Figura]. EN: Pareto Efficiency. Economics.Help. 2017.

Page 30: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

30

The pareto efficiency comes from the economics and it is defined as the operating frontier where a point is not possible to increase the output of item “Y” without reducing the output of item “X”. In a two-dimensional space, the pareto efficiency is true if there is a hyperparameters configuration that is better in one or two metrics without being worse in the other metric. In our case, the space would be seven-dimensional, and the algorithm will try to choose the best configuration based on seven different performance metrics.

In figure 11 we can observe four different points in which three of them are pareto efficient, but one isn’t. Points A, B and C are pareto efficient because the three of them are high in one or both axis without making the other ax worst. For example focusing only in point A,B and C; point A is the highest in the “Goods” ax but the lowest in the “Service” ax while point C is the highest in the “Service” ax but the lowest in the “Goods” one, while B is in between both; point A is pareto efficient because is better than B and C in the “Goods” ax but without making them worst because B is better than A in the “Service” ax and C even more. On the other hand, point D is the lowest in both axes compared with point A and B even though point D is higher than point C in the “Goods” ax it still being worse than A and B, making it pareto inefficient. There could be more than one pareto efficient point, and when this happens it forms a frontier of undominated points called the pareto front. In this project we are going to use the pareto front to determine which points are the most efficient ones. 5.6 PERFORMANCE METRICS In order to test a specific machine learning algorithm, whether is supervised or unsupervised learning, several performance metrics can be used to measure the quality and certainty of them. But first, we need to define several terms associated with each metric: true positives (TP) are the cases when a sample which is positive is correctly classified as such; on the other hand, true negatives (TN) are cases in which the sample is negative, and it is classified as such. Also, we have false positives (FP) which are the cases in which a negative class is classified as a positive one and false negatives (FN) which is the opposite, when a positive class is classified as a negative one. The previous terms can be summarized through the next figure which is what is commonly known as confusion matrix.

Page 31: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

31

Figure 12. Confusion matrix

Fuente: RASCHKA, Sebastian. Confusion Matrix [Figura]. EN: CONFUSION MATRIX. Github.

5.6.1 Accuracy Accuracy is one of the most used performance metrics in machine learning. It measures the overall performance of an algorithm by finding the relation between the correct predictions and all the predictions made. It can only be used when we have a nearly balanced dataset as it tends to fall on accuracy paradox problem. It can be read as the percentage of times the algorithm would make a correct prediction.

5.6.2 Sensitivity (or Recall) Sensitivity tells us the proportion of the samples that are positive and were classified as such. In other words, it measures the ability of the algorithm to correctly classify the positive class. A highly sensitive algorithm means that there are few false negative results, ergo fewer positive class were misclassified.

Page 32: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

32

5.6.3 Specificity On the other hand, specificity is the opposite of sensitivity, as it measures the ability of an algorithm to correctly classify the negative class. A highly specific algorithm means that few false positives were detected and therefore fewer negative class were misclassified.

5.6.4 Positive Predictive Value (PPV) or precision PPV or precision, measures the probability or the proportion of samples that were classified as positive and are positive. In other words, PPV determines if the sample is predicted as positive what are the chances of it being positive.

5.6.5 Negative Predictive Value (NPV) On the other hand, NPV is the counterpart of PPV as it measures the percentage of samples classified as negative which are truly negative ones.

Page 33: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

33

5.6.6 F1-Score F1-Score is also one of the most used performance metrics because it comprises two of the previous metrics: PPV and Sensitivity. It consists in the harmonic mean between PPV and sensitivity and is mostly used when we want to seek a balance between this two metrics and can be interpreted as having predicted the positive classes as such while being careful into including some negative ones.

5.7 PHAGE-BACTERIA INTERACTION On the other hand, the bioinformatics area addresses the knowledge necessary to analyze the behavior of bacteria according to their omic state. That is, its genomics and transcriptomics (DNA and RNA sequence respectively) are used to determine patterns in organisms. These will be the data used to perform each of the machine learning methods. Bacteriophages or just phages are the most abundant viruses on Earth and are even more common than bacteria. They are very important in the evolution of bacteria as they regulate their population growth. They also have a high impact on the global carbon cycle therefore having impact in global climate. Phages were formally discovered in the 20th century and at first it was thought to be viral, but due to the ability to reduce bacterial infections they were rather thought to be non-viral and described them as “bacteria eaters”. Phage primary task is to infect the bacteria, it means that after moving onto the bacteria it will produce new phages and then release them into other bacteria cells. Regularly this task is called lysis which consist in destructing the outer layer of the

Page 34: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

34

bacteria and releasing new phage particles that were formed inside the infected cell, then these phage virions will start infecting new cells of the bacteria. The lytic cycle can be graphically explained through figure 13 in which the first step is the attachment of the phage to the bacteria (having in mind that the receptor-binding proteins takes an important role in this step), then the phage DNA is being introduced while the bacteria’s DNA is being degraded. After the phage is inside the bacteria it will start to synthesize proteins in order to assembly new phages and break the bacteria from the inside releasing the new phages to the other cells of the bacteria.

Figure 13. Phage therapy microscopic process

Fuente: SEHIC, Emina. Lytic Cell Summary [Fgiura]. EN: BACTERIOPHAGE T4. Google Sites.

The receptor-binding proteins (RBP) from the phages are the ones that allows them to bind into the receptors of a bacteria. There are different types of RBP and some of them are very specifically, it means not all the variety of phages can bind into all the bacterias. The infectivity between phages and bacteria can vary drastically even across bacterial strains of the same species. Therefore, as said before, it is very important for phage therapy, to find the exact match between the therapeutic phage and the host bacteria. Here is where protein-protein interactions (PPI) takes place because the relation between a phage and its host is mainly due to the interaction of their encoded

Page 35: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

35

proteins. Bacteria can encode approximately 3000 proteins while phages can express in average 74 proteins, having approximately 220,000 PPI’s for each phage-bacteria interaction; clarifying that for each pair there can be different number of PPI’s, in consequence a post-processing of the feature extraction must be done. But, in order to analyze the PPI’s of each phage-bacteria pair it is necessary to analyze a functional subunit of the proteins: a domain. This, since a PPI occur when one or more bindings between pairs of their domains are done. This other type of interaction is called domain-domain interactions (DDI).

Page 36: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

36

6 METHODOLOGY For better understanding of the methodology explained below, Figure 14 illustrates the process as a flux diagram. Figure 14. Summary of the methodology.

Page 37: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

37

6.1 EXPLORATORY PROCEDURE

The exploratory process lies on the search of suitable and effective One-Class Learning methods in order to apply them later. Very precise information of each method is needed as well as studying them in detail. There will be selected an odd number of different algorithms for their further implementation. Next, each algorithm is going to be implemented using the programming language Python and different kind of machine learning libraries such as scikit-learn, Keras and Tensorflow. For exploratory purposes and to have a better understanding of each algorithm’s functioning, two public datasets are going to be used: Wisconsin Breast Cancer and MILE Leukemia dataset. Even though this aren’t imbalanced datasets they are going to be pre-processed in order to have a One-Class classification problem in each (i.e. training each model with only one of the classes of the dataset and leaving the other class apart for validation purposes). Wisconsin Breast Cancer consists of a dataset with 30 features and 569 samples of which 357 samples are positive class (benign cancer) and 212 negative class (malign cancer). The MILE Leukemia dataset has 210 features with 2096 samples which 750 are positive class (non-leukemia) and 1346 are negative class (having leukemia). Each model was trained using only samples from the positive class. Having implemented and tried each algorithm with real medical data, it's convenient to try it on real phage-bacteria interactions data and observe its functioning. It is going to be used three datasets given by HEIG-VD and the Inphinity Project of the research group CI4CB, which contains samples of different phage-bacteria interactions described by protein-protein interactions (PPI) and domain-domain interactions (DDI). All three datasets have 2130 normalized samples differing in feature size: 5, 10 and 15 respectively. There is going to be an addition to the algorithms which is using an ensemble method, voting classifier, which gives a better performance of the process. After this exploratory process, the final dataset of phage-bacteria interactions must be used to train, validate and refine each of the algorithms as well as making the respective predictions.

Page 38: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

38

6.2 DATASETS GENERATION AND FEATURES EXTRACTION For the final part, several datasets of phage-bacteria interactions are going to be used in which PPI and DDI are contained, being divided into different numbers of features (e.g. 5, 10,15 or 20 features), also another dataset is going to be used but presenting the chemical composition of each RBP in the interaction. The collection of interactions was made from two databases (PhageDB and GenBank) at species level but the training process was made at the strain level. The negative interactions for the validation and test process were assumed by the fact that phages are very specific to each strain of bacteria; so, if a phage interacts with a specie of bacteria means that no negative-interaction is made, meaning that phages only attack one bacterial specie. Equilibrium is made by generating the same amount of negative interactions as positive ones. Figure 15. Taxonomy distribution of the bacteria’s family name used

As said before, domains are structural subunits of proteins in which interactions between the phage proteins and the bacteria proteins often involves interactions between them. A database call DOMINE was used to extract each DDI which is

Page 39: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

39

described as a score of each interaction. PPI score is calculated as a result of the sum of all its DDI scores. Different datasets were made from these interactions by dividing the frequencies of scores into different number of bins or different size of bins and considering or not PPI’s with score zero. There was another type of dataset based on chemical composition of PPI’s. 21 features of this dataset consist in the relative abundance of amino acids in the protein sequence while other five features consist in the abundance of some important elements such as oxygen, carbon, hydrogen, nitrogen, etc. One more feature was extracted from the molecular weight of the protein and as each interaction consists of two proteins, there were 54 features in total. But also, as each phage-bacteria interaction has approximately 250 thousand PPI’s, the dimensionality of this dataset was higher than expected, ergo a dimensionality reduction technique was made (dubbed Principal Component Analysis) resulting in only 108 features for the last dataset. 6.3 DATASETS DESCRIPTION DS_CH corresponds to the last dataset explained while DS_ZBX are the datasets using score zero in PPI’s, being X the number of bins; DS_BX are the datasets that do not consider score zero in the PPI’s and being X the number of bins. Also, DS_SX are datasets that do not consider score zero in PPI’s, being X the size of the bins. Notice that there are no datasets that can vary the size of bins while considering score zero for PPI’s, this due to practicality purposes and faster results. Each dataset contains exactly 4594 samples which are going to be divided as shown in the next figure. It’s important to consider that 10-Fold cross validation is going to be implemented using only positive interactions in the respective training process but using positive and negative interactions for the validation process. Metrics such as sensitivity, specificity, accuracy, positive predictive value (PPV), negative predictive value (NPV) and f1-score are going to be extracted out of this process. Notice that in figure 16 we have the same amount of positive and negative samples even though the negative samples were artificially generated by a substantiated assumption explained before, we must remember that the original dataset was highly imbalanced but in order to validate the algorithms’ performance the need for a negative class was prevailing (clarifying that all the algorithms in this project were trained only with the positive class but validated and tested with both).

Page 40: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

40

Figure 16. Dataset partitioning for cross-validation process

6.4 HYPERPARAMETERS SELECTION Gridsearch is also going to be implemented for each algorithm in order to select the best parameters for the best performance considering each of the metrics during the cross-validation process and then refining the search with pareto optimality. Heatmaps are also going to be used for a better understanding and visualization of the gridsearch process. Having the best parameters and the best datasets for the performance, is appropriate to try the process using the test dataset and analyze also the metrics in order to predict interactions and make conclusions out of the results.

Page 41: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

41

7 RESULTS AND ANALYSIS 7.1 EXPLORATORY PROCEDURE The results of the exploratory process are shown below, being able to observe good performance meaning that the algorithms are suitable for this type of imbalanced datasets. These results encouraged us to use the algorithms in real phage-bacteria interaction datasets. It should be mentioned that the algorithms used during the exploratory process were manually refined for the better performance, meaning that the hyperparameters were chosen heuristically without the use of any refining technique. Table 1. Exploratory results for Cancer dataset

Algorithm Accuracy F1_Score Sensitivity Specificity PPV NPV

RNN 0.904 0.753 0.953 0.896 0.622 0.99

LOF 0.612 0.752 0.886 0.074 0.652 0.25

ELL 0.796 0.758 0.944 0.721 0.633 0.962

ISO 0.843 0.807 0.972 0.778 0.690 0.982

OSVM 0.875 0.903 0.882 0.861 0.925 0.788 Table 2. Exploratory results for MILES Leukemia dataset

Algorithm Accuracy F1_Score Sensitivity Specificity PPV NPV

RNN 0.903 0.689 0.751 0.928 0.637 0.957

LOF 0.741 0.833 0.993 0.274 0.717 0.956

ELL 0.840 0.630 0.951 0.821 0.471 0.990

ISO 0.938 0.812 0.924 0.941 0.724 0.986

OSVM 0.940 0.770 0.693 0.982 0.866 0.950 We can observe in both tables that very high-performance scores were obtained in overall showing us confidence in the proposed algorithms, even though LOF

Page 42: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

42

algorithm tends to overfit in the positive data showing a very high sensitivity but a very low specificity in both cases. It is noteworthy that the previous results are merely exploratory, and their purpose was to observe how the One-Class Learning algorithms behave and perform in real medical datasets. These results also helped us to confirm that the algorithms are reliable because in these datasets we knew both classes with certainty and they were capable of correctly classify both even though trained with only one class. Therefore, the results encouraged us to follow the methodology and try the algorithms into the final datasets which are the results that follows up next. 7.2 FINAL PROCEDURE We are going to show the training results of each model and their gridsearch performance in 15 different datasets (DS_CH, DS_ZB5, DS_ZB50, DS_ZB108, DS_B10 , DS_B15 , DS_B27 , DS_B54 , DS_ZB10 , DS_ZB15 , DS_ZB27 , DS_ZB54 , DS_S10, DS_S15, DS_S27). For simplicity we are only showing the F1_Score and accuracy performance presented through heatmaps; these were made with the results of the cross-validation process where each cell represents each combination of hyperparameters in each corresponding dataset (y-axis are the different datasets used and x-axis the index of each combination of hyperparameters, i.e. iterations). Black represents a performance of 1 (being darker colors higher performance) and white represents a performance of 0 (being lighter colors a worst performance); The fuchsia color represents a performance of 0.5 approximately. The heatmaps are a way for better understanding the behavior of the gridsearch in a visual way but the exact selection of the hyperparameters combinations is made in the next section.

Page 43: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

43

Table 3. One-Class Learning algorithms configurations

The different configurations of parameters are shown above, each one representing one iteration in the gridsearch process. Notice that the incremental steps are different for each, this because each one has different range. Each iteration for each algorithm were saved in an Excel sheet letting us search easily the given iteration number corresponding to each combination of hyperparameters.

Algorithm Parameter Configurations

One-Class SVM (800 iterations)

Gamma Degree Nu Kernel

[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1] [1,2,3,4] [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1] ['poly','rbf']

Isolation Forest (798 iterations)

N_estimators Max_samples Max_features

[1,5,10,20,50,70,100,120,150,180,200,220,230,250,280,300,350,400,500] [1,5,10,50,70,100,150,180,200,230,250,280,300,1500] [0.5,0.8,1]

RNN (80 iterations)

Layer_div Thresh Epoch

[2,3,4,5] [0.1,0.2,0.3,0.4,0.5] [50,100,150,200]

Elliptic Envelope (30 iterations)

Support_fraction Assumed_centerd

[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1,1.1,1.2,1.3,1.4,1.5] [True,False]

LOF (48 iterations)

N_neighbors Algorithm

[1,5,10,20,40,50,80,100,200,350,400,500,750,800,900,1000] ['ball_tree', 'kd_tree', 'brute']

Page 44: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

44

Figure 17. Heatmap for RNN F1_score

Figure 18. Heatmap for RNN accuracy

In figure 17 we can remark that the RNN f1-score in the dataset DS_ZB108 was almost the same and high (0.75 approximately) for every parameter’s combination while for the DS_CH, very low f1-score were obtained. We were expecting to get a

Dat

aset

s

Combinations

Dat

aset

s

Combinations

Page 45: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

45

similar behavior on the accuracy heatmap but in figure 18 we observe that every combination in all the dataset were approximately the same, obtaining the best accuracy with the DS_CH dataset in the first combinations and in the last ones.

Figure 19. Heatmap for OSVM f1-Score

Figure 20. Heatmap for OSVM accuracy

Dat

aset

s

Combinations

Dat

aset

s

Combinations

Page 46: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

46

For both OSVM’s heatmaps we can observe a pattern-like map going from low f1-score and accuracy to higher performances for each 17 combinations. When observing the corresponding combinations’ number of hyperparameters (in the Excel spreadsheet) for these iterations, we realize that the fraction of support vectors (‘nu’) plays an important role in the algorithm. When going above 0.8, both metrics can go down even to 0 performance. Therefore, we choose this hyperparameter lower than 0.8 which can drastically change the behavior for this algorithm.

Figure 21. Heatmap for LOF F1_score

Dat

aset

s

Combinations

Page 47: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

47

Figure 22. Heatmap for LOF accuracy

From both LOF heatmaps there is a noticeable observation in which almost all the combinations in every single dataset have the same behavior or varies very little, having an f1-score of approximately 0.7 and an accuracy of 0.5, making us believe that maybe this algorithm tends to overfit or to classify the data randomly.

Figure 23. Heatmap for ISO F1-Score

Dat

aset

s

Combinations

Dat

aset

s

Combinations

Page 48: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

48

Figure 24. Heatmap for ISO accuracy

From figure 23 we can observe also a pattern-like behavior in the heatmap every 120 combinations approximately in which the f1-score drastically goes down in some of the datasets, happening when the number of max samples was 1 hence the number of max samples for these datasets must be higher. On the other hand, with the DS_S27 we see another pattern which correspond to the combination of max features and max samples where it has low f1-score when max features outside the range between 0.4 and 0.7 and max samples are very small or close to 300. For the accuracy we can see that almost all the behaviors were above 60%. Figure 25. Heatmap for ELL F1-Score

Dat

aset

s

Combinations

Dat

aset

s

Combinations

Page 49: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

49

Figure 26. Heatmap for ELL accuracy

From the last heatmaps for the algorithm Elliptic Envelope we can observe that the best performance for both metrics was obtained with the DS_CH dataset getting accuracies and f1-scores above 0.8. For the other datasets, accuracy didn’t make that well even though with the DS_ZB50 and with DS_ZB108 it did, having the combination of hyperparameters “FALSE” and 0.1 and “TRUE” and 0.6 respectively.

As we said earlier, it is not enough with the heatmap process to choose the suitable hyperparameters for each algorithm as we have an imbalanced dataset and some of the previous metrics aren’t very representative. Hence, we are implementing the pareto fronts next to optimize the hyperparameters for the seven metrics used in the project and find a balance between the optimal performance and choosing this combination of hyperparameters for the final training/test phase. Following, the pareto fronts of each metrics’ relation (the line in red) is shown for the One-Class SVM trained on the chemical composition dataset (CH), even though we are only showing one algorithm’s fronts, the process has been made to each one of the algorithms to determine which combination of parameters suits best for each dataset. For visual effects the following fronts are going to be shown in two dimensions, meaning that each metric is going to be related with each other to calculate the fronts for each relation, but in practice the fronts are calculated in seven dimensions (as there are seven metrics) and the best combination of parameters (if there is more than one in each front) is chosen by the higher accuracy. The blue dots represent each combination of hyperparameters.

Dat

aset

s

Combinations

Page 50: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

50

Figure 27. Pareto fronts of the OC-SVM (from top-down and left-right: accuracy, PPV, NPV and F1-score)

Page 51: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

51

Figure 28. Pareto fronts of the OC-SVM (from top-down: accuracy, PPV, NPV and F1-score and left to right: train sensitivity, test sensitivity and specificity)

Figure 29. Pareto fronts of the OC-SVM (from top-down: train sensitivity, test sensitivity and specificity and left to right: accuracy, PPV, NPV and F1-score)

Page 52: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

52

Figure 30. Pareto fronts of the OC-SVM (from top-down and left to right: train sensitivity, test sensitivity and specificity)

We can observe that in figure 27, in the PPV vs. NPV subplot, we got a pareto front of approximately nine points in which each of these points are pareto efficient for the mentioned metrics because they have the higher PPV without being worst in NPV and vice versa. In this case, the need of a “choosing criteria” must be done and it could be choosing the point that has higher NPV or higher PPV (as we can only choose one combination of hyperparameters that belongs to the pareto front to train each algorithm). Furthermore, in NPV vs. F1-Score subplot, we can observe that there is only one point belonging to the pareto front, meaning that only that point is pareto efficient for those metrics and there’s no need of a choosing criteria. In practice, pareto fronts are seven-dimensions high and can’t be observed graphically, but the process is the same as in 2D: having each metrics’ results and for each dataset the pareto fronts are calculated for the seven metrics meaning that a pareto front of this type is constituted by the points that has the highest result in one or more metrics without being worst in the rest of the metrics. Then, if only one point is given by the pareto front, that point is the final combination of hyperparameters for the given algorithm in the given dataset, if there’s more than one point, the one that has higher accuracy is the one.

Page 53: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

53

For example, with the Elliptic Envelope algorithm and the DS_ZB108 dataset the best combination of hyperparameters according to the heatmap process were “TRUE” for the “assumed centered” one and 0.6 for the “support fraction”. But after applying pareto fronts to the algorithm, we find out that the most optimal combination taking into account not only f1-score and accuracy (as we did in the heatmaps), but all of the seven metrics were “FALSE” and 0.6 respectively. We had also in mind that the pareto front process can give us more than one combination of hyperparameters in the front, selecting the one that has higher accuracy. This decision was because the dataset used for the validation phase is equally distributed (equal positive and negative samples even though negative ones were assumed), making accuracy a more suitable metric. The hyperparameters used for each algorithm based on the gridsearch process and refined by pareto fronts are shown in the next table, considering that those hyperparameters are suitable for the DS_CH dataset. Later, more hyperparameters are shown for different datasets using the same process. Table 4. Hyperparameters final configuration for DS_CH

Algorithm Parameters

One-Class SVM

Kernel = “poly” Degree = 3 Nu = 0.1 Gamma = 1

Isolation Forest

N_estimators = 1 Max_samples = 1500

Max_features = 0.3

Elliptic Envelope

Assume_centered = True

Support_fraction = 1.5

LOF N_neighbors = 100 Algorithm = “brute”

RNN Layer_division = 3 Threshold = 0.1 Epochs = 100 Having selected the optimal hyperparameters for each algorithm, the results of the algorithms are going to be shown in the next tables for the Chemical composition dataset which was the dataset that presented the best average results.

Page 54: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

54

The higher results for each metric are highlighted in bold-font and the bolded-font algorithms are the three best ones for this dataset. Following, the ensemble learning results with all of the five algorithms and then the ensemble learning using only the three best algorithms. Scores in bold-font in table 6 and 7 are the scores which are improved towards the algorithms used. Table 5. Results for Chemical Composition dataset

Algorithm Accuracy F1-Score Sensitivity Specificity PPV NPV

ELLEnv 79.7% 80.7% 84.9% 74.4% 76.9% 83.1%

iForest 68.0% 73.3% 87.4% 48.5% 63.0% 79.3%

RNN 75.6% 78.1% 86.8% 64.2% 70.9% 69.3%

LOF 53.9% 67.0% 93.8% 14.1% 52.2% 69.3%

OC-SVM 72.2% 73.7% 77.6% 66.9% 70.2% 74.8%

Table 6. Results for ensemble learning with all algorithms

Accuracy F1-Score Sensitivity Specificity PPV NPV

72.7% 76.7% 89.6% 55.8% 67.1% 84.2%

Table 7. Results for ensemble learning with the three-best algorithms (ELLEnv, RNN and OC-SVM)

Accuracy F1-Score Sensitivity Specificity PPV NPV

82.3% 83.2% 87.5% 77.1% 79.3% 86.1% Through the above results we can observe that using all algorithms for the ensemble learning, only one metric was improved but using the three-best algorithms for it, all metrics were improved showing best results for predictive purposes.

Page 55: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

55

Also, we can observe that in overall, the algorithms show a better sensitivity than specificity, this being expected because of the way the One Class Learning algorithms are trained: only with the positive class; meaning that they are more likely to detect positive interactions than negative ones.

Even though, they are only trained with positive class, the NPV values are often higher than the PPV, meaning that the algorithms are most likely to correctly identify negative interactions than positive interactions, this is expectable too as the sensitivity is higher meaning the algorithm is less likely to predict a negative interaction as a positive one, so they are going to be predicted as negative ones correctly. Next, table 8 show the results using DS_ZB108 which has the same number of features as DS_CH but this is separated by 108 bins taking into account zero score value for PPI’s. Table 8. Hyperparameters final configuration for DS_ZB108

Algorithm Parameters

One-Class SVM

Kernel = “poly” Degree = 4 Nu = 0.2 Gamma = 1

Isolation Forest

N_estimators = 150 Max_samples = 1500

Max_features = 1

Elliptic Envelope

Assume_centered = False

Support_fraction = 0.6

LOF N_neighbors = 100 Algorithm = “brute”

RNN Layer_division = 5 Threshold = 0.1 Epochs = 50

Page 56: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

56

Table 9. Results for DS_ZB108 dataset

Algorithm Accuracy F1-Score Sensitivity Specificity PPV NPV

ELLEnv 56.8% 68.1% 92.3% 21.3% 54.0% 73.5%

iForest 59.6% 69.1% 90.6% 28.5% 55.9% 75.2%

RNN 62.8% 68.1% 79.3% 46.3% 59.6% 69.1%

LOF 51.4% 65.2% 91.3% 11.4% 50.8% 56.8%

OC-SVM 54.7% 60.2% 68.6% 40.8% 53.7% 56.5%

Table 10. Results for ensemble learning with all algorithms

Accuracy F1-Score Sensitivity Specificity PPV NPV

56.9% 69.1% 96.5% 17.2% 53.8% 83.2%

Table 11. Results for ensemble learning with the three-best algorithms (ELLEnv, iForest and RNN)

Accuracy F1-Score Sensitivity Specificity PPV NPV

60.5% 69.5% 90.0% 31.0% 56.6% 75.6% The latest results were not as good as the previous ones even though we can observe the same behavior with the sensitivity/specificity and PPV/NPV metrics. Also, the ensemble learning technique was biased by the very low specificity of the LOF algorithm tending to overfit the data into the positive class and the three-best algorithms for the ensemble learning showed a 0.4% approximately of improvement which is not that much. Even though both datasets present the same number of features, the results vary drastically due to the features themselves, as DS_CH takes into account different chemical composition scores as well as physical scores, DS_ZB108 only takes into account the overall PPI’s scores and divide them into 108 bins.

Page 57: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

57

Following, for comparison purposes we are trying two other datasets, one without considering score zero for the PPI’s interaction and separated by number of bins (DS_B54), and the other dataset separated by the size of the bin also without taking into account the zero score (DS_S27). Table 12. Hyperparameters final configuration for DS_B54

Algorithm Parameters

One-Class SVM

Kernel = “poly” Degree = 3 Nu = 0.1 Gamma = 1

Isolation Forest

N_estimators = 50 Max_samples = 1500

Max_features = 1

Elliptic Envelope

Assume_centered = False

Support_fraction = 0.9

LOF N_neighbors = 500 Algorithm = “kd_tree”

RNN Layer_division = 2 Threshold = 0.1 Epochs = 50 Table 13. Hyperparameters final configuration for DS_S27

Algorithm Parameters

One-Class SVM

Kernel = “poly” Degree = 3 Nu = 0.4 Gamma = 0.4

Isolation Forest

N_estimators = 1 Max_samples = 1500

Max_features = 0.8

Elliptic Envelope

Assume_centered = True

Support_fraction = 0.7

LOF N_neighbors = 400 Algorithm = “brute”

RNN Layer_division = 2 Threshold = 0.3 Epochs = 50

Page 58: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

58

Table 14. Results for DS_B54 dataset

Algorithm Accuracy F1-Score Sensitivity Specificity PPV NPV

ELLEnv 54.9% 67.0% 91.5% 18.4% 52.8% 68.3%

iForest 60.1% 69.4% 90.4% 29.8% 56.3% 75.7%

RNN 59.0% 65.0% 76.4% 41.5% 56.7% 63.8%

LOF 56.4% 68.9% 96.4% 16.5% 53.6% 82.0%

OC-SVM 56.9% 64.0% 76.7% 37.0% 54.9% 61.4%

Table 15. Results for ensemble learning with all algorithms for DS_B54

Accuracy F1-Score Sensitivity Specificity PPV NPV

58.1% 69.3% 94.9% 21.3% 54.7% 80.8%

Table 16. Results for ensemble learning with the three-best algorithms for DS_B54 (iForest, RNN and OC-SVM)

Accuracy F1-Score Sensitivity Specificity PPV NPV

55.4% 67.3% 91.8% 19.0% 53.1% 69.7%

Table 17. Results for DS_S27 dataset

Algorithm Accuracy F1-Score Sensitivity Specificity PPV NPV

ELLEnv 54.6% 66.9% 91.9% 17.4% 52.7% 68.2%

iForest 54.3% 67.1% 93.1% 15.6% 52.4% 69.2%

RNN 54.5% 65.2% 85.4% 23.6% 52.8% 61.7%

LOF 55.0% 67.8% 95.0% 15.1% 52.8% 74.8%

OC-SVM 50.0% 50.1% 51.5% 48.0% 49.7% 49.7%

Page 59: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

59

Table 18. Results for ensemble learning with all algorithms for DS_S27

Accuracy F1-Score Sensitivity Specificity PPV NPV

52.6% 66.5% 93.9% 11.3% 51.4% 65.0%

Table 19. Results for ensemble learning with the three-best algorithms for DS_S27 (ELLEnv, RNN and LOF) Accuracy F1-Score Sensitivity Specificity PPV NPV

54.4% 67.2% 93.3% 15.5% 52.5% 69.9% For these two last datasets, the results were not as good as before, showing accuracy scores of less than 60% in overall, meaning that the algorithms are not capable of truly classify the interactions very well in these datasets. This can be due to the number of features used as they are way a smaller number of features than DS_CH and DS_ZB108. We can also observe that neither dataset presented improvement using the ensemble learning with all the algorithms nor the ensemble learning with the three-best algorithms, letting us infer that the original results of the algorithms are overfitted to the positive class as it can normally happens because of the structure of the One Class Learning techniques. These cases in which the results weren’t as good as expected happened in some of the datasets and also, they varied depending on the optimal hyperparameters that we choose, meaning that these results aren’t absolute, and they can be improved but also worsen. Also, for better understanding of the results, we can clarify that almost all sensitivity scores are very high, being 90% or higher but as we said before, due to the training process of the algorithms they are very likely to correctly identify positive interactions as such but this also tends to overfit the data, expressing specificity results less than 50% meaning that negative interactions are misclassified as positive ones. Negative interactions classified as such tend to be correctly classified though, this based on the NPV scores which are relatively high, meaning that the ones that are classified as negative ones are more likely to be truly negative interactions.

Page 60: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

60

8 CONCLUSIONS First, we’d like to mention that the main contribution of this work was achieved by successfully investigated, implemented and applied one-class learning algorithms to phage-bacteria interactions using their omics features, which is a methodology that hasn’t been implemented before with the same tools and techniques used here. One aspect of the project to highlight because of it’s novelty, is the process of refining the methodology using several techniques such as gridsearch to find the combinations of hyperparameters, pareto efficiency to find the optimal combination of hyperparameters for each algorithm and each dataset and also using ensemble learning of two types (all algorithms and three-best algorithms) to try to improve the individual results of each algorithm. Throughout this project we can conclude that in-silico automation for several processes is a nowadays reality as some of the procedures relies on human intervention which can take several days or even more, for this, machine learning techniques are the state of the art in this type of automation. We showed that One Class Learning algorithms are a very successful approach in dealing with imbalanced datasets, specifically speaking in our phage-bacteria interaction datasets, showing very promising results (i.e. getting high performance metrics in overall using ensemble learning approach) and leading the way to more types of applications. The general objective of the project as well as the specific ones were fully covered, by first implementing the machine learning algorithms and test them on dummy datasets; then pre-process the final datasets, use them to train the algorithms only with the positive class and evaluate them in order to predict phage-bacteria interactions, trying to automate this time-consuming process. The correct selection of the dataset (number of features, types of features, number of samples) is a critical step in this type of processes as we saw that a slightly change in the dataset can alter the results of each algorithm, evidencing it while using DS_CH which got very high results in contrast with DS_S27. Through this, we can say that the best type of dataset in this type of problem is using physicochemical composition of the interactions.

Page 61: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

61

We want to highlight the performance of the Replicator Neural Network which was the best algorithm of the five we tried, showing very promising results in each test as well as always being part of the three-best algorithm in every three-best ensemble learning approach.

Page 62: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

62

9 FUTURE WORK

Foremost, the next step will be trying to use real negative interactions datasets tested in labs in order to try the previous process and validate the results. The process can be used as a filter, to confirm the real negative interactions, or also it can be used as a predictor in order to predict negative and positive interactions. Also, trying other One Class Learning algorithms with the same datasets to observe differences in performance and simplicity. Another important task is trying to observe or analyse the structure of the datasets and their behavior (distribution, prevalence, density) to know what types of algorithms are best suitable and to refine them better. At last but not least, trying to implement a more efficient and reliable way to select the appropriate hyperparameters for the algorithms, because of the complexity of pareto fronts and also the final selection based on accuracy might affect the whole performance.

Page 63: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

63

BIBLIOGRAPHY

ABEDON, Stephen. Introduction to Bacteriophages [en linea]. phage-therapy.org. 28 de Agosto de 2016. [Consultado el 16 de Febrero de 2018]. Disponible en: http://phage-therapy.org/writings/bacteriophages.html

BÁNHALMI, András; BUSA-FEKETE, Róbert and KÉGL, Balázs. A One-Class Classification Approach for Protein Sequences and Structures [en linea]. Springer-Verlag Berlin Heidelberg, ISBRA, LNBI 5542, p. 310–322, 2009. [Consultado el 13 de Abril de 2018]. Disponible en: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.800.4532&rep=rep1&type=pdf

BARBUZANO, Javier. “Inteligencia artificial aplicado al diagnóstico y tratamiento de enfermedades”. [en línea]. World Diagnostic News. Buenos Aires, Argentina. (Mayo 14 de 2017). [Consultado el 14 de Febrero de 2018]. Disponible en: https://www.diagnosticsnews.com/empresas/26573-inteligencia-artificial-aplicada-al-diagnostico-y-al-tratamiento-de-las-enfermedades

BREUNIG, Markus; KRIEGE, Hans-Peter; NG, Raymond and SANDER, Jörg. LOF: Identifying Density-Based Local Outliers [en linea]. International Conference on Management of Data, Dallas, Texas, Estados Unidos de America. 2000. [Consultado el 5 de Mayo de 2018]. Disponible en: http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf

CARVALHO, Diogo Manuel, et al. Computational prediction of host-pathogenic interactions through omics data analysis and machine learning [en linea]. En: Springer-Verlag Berlin Heidelberg 2011. [Consultado el 14 de Febrero de 2018]. Disponible en: https://drive.switch.ch/index.php/s/9mk6uRb1HtMNmZD?path=%2FPapers#pdfviewer D’ALCHÉ-BUC, Florence y WEHENKEL, Louis. Machine Learning in Systems Biology [en linea]. En: Selected Proceedings of Machine Learning in Systems Biology, BMC. Septiembre 24 y 25 de 2018, vol. 2, no. 4. [Consultado el 14 de Febrero de 2018]. Disponible en: https://bmcproc.biomedcentral.com/articles/10.1186/1753-6561-2-S4-S1

Page 64: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

64

HAWKINS, Simon; HE, Hongxing; WILLIAMS, Graham and BAXTER, Rohan. Outlier detection using replicator neural networks [en linea]. En: CSIRO Mathematical and Information Sciences. [Consultado el 4 de Marzo de 2018]. Disponible en: https://togaware.com/papers/dawak02.pdf HAYIKIN, Simon. Neural Networks and Learning Machines [en linea]. 3 ed. Hamilton, Ontario, Canada: PEARSON, 2009. [Consultado el 14 de Febrero de 2018]. Disponible en: http://dai.fmph.uniba.sk/courses/NN/haykin.neural-networks.3ed.2009.pdf

HISTORY OF COMPUTING, VIDEO LECTURES. (Diciembre 1 de 2006). [en linea] Washington, Estados Unidos de America. The History of Artificial Intelligence. Washington: Universidad de Washington, 2006. [Consultado el 14 de Febrero de 2018]. Available on: https://courses.cs.washington.edu/courses/csep590/06au/projects/history-ai.pdf IRIGOEIN, Itziar; SIERRA, Basilio and ARENAS, Concepción. Towards application of One-Class classification methods to medical data [en linea]. En: The Scientific World Journal, vol. 2014, no. 7 p, 2014. [Consultado el 13 de Abril de 2018]. Disponible en: https://www.hindawi.com/journals/tswj/2014/730712/ JOHANNES, David. One-Class classification: Concept-learning in the absence of counterexamples [en linea]. Holanda. Universidad Tecnologica de Delft, Delft, ISBN: 90-75691-05-x. 2001. [Consultado el 25 de Marzo de 2018]. Disponible en: http://homepage.tudelft.nl/n9d04/thesis.pdf

LIU, Fei; TING, Kai and ZHOU, Zhi-Hua. Isolation Forest [en linea].En: Eight IEEE International Conference of Data Mining. 2008. [Consultado el 5 de Mayo de 2018]. Disponible en: https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf

MARADIAGA, Jorge Roberto. “La inteligencia artificial y su impacto en la sociedad” [en linea]. La Tribuna en línea. Tegucigalpa, Honduras. (Abril 20 de 2017). [Consultado el 14 de Febrero de 2018]. Disponible en: http://www.latribuna.hn/2017/04/20/la-inteligencia-artificial-impacto-la-sociedad/ MATICH, Damian Jorge. “Informática Aplicada a la Ingeniería de Procesos – Orientación I: Redes Neuronales, Conceptos básicos y aplicaciones” [en linea]. Universidad Tecnológica Nacional, Buenos Aires, Argentina. Marzo de 2001. [Consultado el 14 de Febrero de 2018]. Disponible en:

Page 65: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

65

https://www.frro.utn.edu.ar/repositorio/catedras/quimica/5_anio/orientadora1/monograias/matich-redesneuronales.pdf

MULLIN, Emily. Virus e inteligencia artificial se unen contra las bacterias resistentes [en linea]. MIT Technology Review. Boston, Estados Unidos de America. (Febrero 1 de 2018). [Consultado el 14 de Febrero de 2018]. Disponible en: https://www.technologyreview.es/s/9960/virus-e-inteligencia-artificial-se-unen-contra-las-bacterias-resistentes PUIG, Alejandro. “Machine learning (Aprendizaje automático)” Ep. 0 [video]. YouTube. AMPTech. México. April 23th 2017. [Consultado el 14 de Febrero de 2018]. Disponible en: https://www.youtube.com/watch?v=wXVkIdF4D4I

ORREGO, Rodrigo. “Fagoterapia: alternativa para el control de enfermedades bacterianas” [en linea]. Salmonexpert. Santiago, Chile. (Junio 18 de 2015). [Consultado el 14 de Febrero de 2018]. Disponible en: https://www.salmonexpert.cl/article/fagoterapia-alternativa-para-el-control-de-enfermedades-bacterianas/

RAZAVIAN, Narges. Application of Machine Learning in Computational Biology [en linea]. Estados Unidos de America. Universidad de Nueva York, Nueva York, 2004. [Consultado el 14 de Febrero de 2018]. Disponible en: http://people.csail.mit.edu/dsontag/courses/ml13/slides/lecture26.pdf ROKACH, Lior y MAIMON Oded. DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK [en linea]. Universidad de Tel-Aviv, Tel-Aviv, Israel. Chapter 9: Decision Trees. p. 166. [Consultado el 14 de Febrero de 2018]. Disponible en: http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf

ROUSSEEUW, Peter and VAN DRIESSEN, Katrien. A Fast Algorithm for the Minimum Covariance Determinant [en linea].En: Technometrics vol.41 no.3,p. 212, Belgica. 1999. [Consultado el 5 de Mayo de 2018]. Disponible en: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.45.5870&rep=rep1&type=pdf

SANTERRE, John; DAVIS, James; XIA, Fangfang y STEVENS, Rick. Machine learning for antimicrobial resistance [en linea]. From: arXiv: 1607.01224. Julio 5 de 2016. [Consultado el 14 de Febrero de 2018]. Disponible en: https://arxiv.org/abs/1607.01224

Page 66: EVALUACIÓN DE MÉTODOS DE APRENDIZAJE AUTOMÁTICO …

66

SCHÖLKOPF, Bernhard; WILLIAMSON, Robert; SMOLA, Alex, SHAWE-TAYLORT, John and PLATT, John. Support Vector Method for Novelty Detection [en linea]. Conference in advances in neural information processing systems, p. 582-588. 2000. [Consultado el 5 de Mayo de 2018]. Disponible en: http://papers.nips.cc/paper/1723-support-vector-method-for-novelty-detection.pdf

VIJAYKUMAR, Venkatesh. Classifying bacterial species using computer vision and machine learning [en linea]. En: International Journal of Computer Applications. Octubre de 2016, vol. 151. No.8. Mumbai, India. [Consultado el 14 de Febrero de 2018]. Disponible en: http://www.ijcaonline.org/archives/volume151/number8/vijaykumar-2016-ijca-911851.pdf WORLD HEALTH ORGANIZATION. Antibiotic resistance, descriptive note [en linea]. World Health Organization, Press Release. (Octubre de 2017). [Consultado el 14 de Febrero de 2018]. Disponible en: http://www.who.int/news-room/fact-sheets/detail/antibiotic-resistance?fbclid=IwAR1HxBL4h6xkoGdZy06Oom5A0INki3UA1Rtx0d2tg6q0-HjHK3VIF3xVWIU ZHOU, Zhin-Hua. Ensemble Learning [en linea]. National Key Laboratory for Novel Software Technology, Universidad de Nanjing, Nanjing 210093, China. 2012. [Consultado el 10 de Mayo de 2018]. Disponible en: https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/springerEBR09.pdf

ZIELINSKI, Bartosz; PLICHTA, Anna; MISZTAL, Krzysztof; SPUREK, Przemyslaw; BRZYCHCZY-WŁOCH, Monika y OCHOŃSKA, Dorota. Deep learning approach to bacterial colony classification [en linea]. En: PLoS ONE. Septiembre 14 de 2017, vol. 12, no. 9. [Consultado el 14 de Febrero de 2018]. Disponible en: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5599001/