# 3. Supervised Learning

We're aiming to create a knowledge hub for 3D printing of the future.

Machine learning problems can be generally divided into three categories in terms of training data, i.e., supervised, semi-supervised and unsupervised learning. In a supervised learning problem, training data come with ground truth which is the target we expect the model to produce. In an unsupervised learning problem, data does not have corresponding labels or targets and its the model’s objective to observe the hidden relationship in the dataset. Semi-supervised learning is between the previous two as it has a portion of labelled data but the majority does not. The model is supposed to discover the hidden features using labelled data and connect those features with unlabelled data. In this section, we will focus on supervised learning by describing its problem formation and regular training process, and introducing an evaluation method called k-folder cross validation.

In terms of the target type, supervised learning can be further divided into classification and regression problems. For example, in a supervised image classification problem, each sample is an image and its target is its label which can be encoded as a discrete number. In the dog and cat classification example, we use for dog and for cat. For a supervised regression problem, like house price prediction, each sample could be a vector of different properties of this house and the target is the real price which is continuous number.

### 3. 1 Problem Formation

We will start with the notation used to describe the dataset. Specifically, the dataset is a combination of lots of samples and their targets. We use and to represent the th sample and its target in the dataset and use and for the whole set of samples and targets. The dataset consists of and , i.e., In supervised learning, we usually use cross-validation to evaluate the performance of this training. Cross-validation splits the dataset into a training set and validation set which we denote as and , respectively. Our model will first be trained on and then tested on . According to the model’s performance on , we can decide whether or not to modify the model. In practice, both training and validating are part of the model development process. After finishing developing the model, it will be tested on testing set which will likely be real-world data. Therefore, to better estimate the model’s performance on the testing set, we should select the validation set such that it has a similar distribution as the testing set. However, most of the time, we can not predict the distribution of , so we just randomly select a portion of the whole dataset as .

### 2. Cross Validation

To mitigate the gap between validation and testing dataset, an alternative to random cross validation is k-fold cross validation which divides the dataset into k folds. As shown in Fig.1, the dataset is divided into folds. In the first training, shown in the first row, the model will be trained with the second to the fourth folds and tested on the first fold. Then its performance on the first fold will be recorded as . Likewise, we perform and next three trainings and testings on corresponding folds. The final performance of the model is recorded as the average of , i.e., . K-fold can reflect the performance of the model more generally but also require longer training time, although those trainings can be performed in parallel to save time.

### 3.3 Overfitting & Underfitting

Two of the most well-known issues in machine learning community are overfitting or underfitting. Before giving the formal definition of overfitting and underfitting, we will first discuss what are bias and variance error. Bias error is the difference between targets and predictions while variance error occurs when there is a small change in the targets. We define the total error( ) as the sum of bias error( ) and variance error( ) see Eq.~1.

(1) As shown in Fig.2, the columns are low and high variance examples, and the rows are low and high bias examples. The red dot in the middle is the target the model tries to reach, and those purple dots are predictions made by the model. In the top left of Fig.2, the model is optimal as it has both low variance and low bias, i.e., the error between the current target and predictions is small, and even if the target shifts a little bit, the error will not increase dramatically. For the high variance, low bias example shown on the top right of Fig.2, the error of the difference between target and predictions is small since the mean of predictions is close to the target but the error will grow a lot if the target moves a little bit. On the contrary, the high bias, low variance example on the bottom left does not achieve low error because the predictions are off-target, even though it has the potential to achieve lower error if the target move towards the north but this could barely happen in practice because of the high dimensionality in a real-world problem. The worst case is in the bottom right of Fig.2 which has both high variance and high bias.