Machine learning problems can be generally divided into three categories in terms of training data, i.e., supervised, semi-supervised and unsupervised learning. In a supervised learning problem, training data come with ground truth which is the target we expect the model to produce. In an unsupervised learning problem, data does not have corresponding labels or targets and its the model’s objective to observe the hidden relationship in the dataset. Semi-supervised learning is between the previous two as it has a portion of labelled data but the majority does not. The model is supposed to discover the hidden features using labelled data and connect those features with unlabelled data. In this section, we will focus on supervised learning by describing its problem formation and regular training process, and introducing an evaluation method called k-folder cross validation.
In terms of the target type, supervised learning can be further divided into classification and regression problems. For example, in a supervised image classification problem, each sample is an image and its target is its label which can be encoded as a discrete number. In the dog and cat classification example, we use for dog and for cat. For a supervised regression problem, like house price prediction, each sample could be a vector of different properties of this house and the target is the real price which is continuous number.
3. 1 Problem Formation
We will start with the notation used to describe the dataset. Specifically, the dataset is a combination of lots of samples and their targets. We use and to represent the th sample and its target in the dataset and use and for the whole set of samples and targets. The dataset consists of and , i.e., In supervised learning, we usually use cross-validation to evaluate the performance of this training. Cross-validation splits the dataset into a training set and validation set which we denote as and , respectively. Our model will first be trained on and then tested on . According to the model’s performance on , we can decide whether or not to modify the model. In practice, both training and validating are part of the model development process. After finishing developing the model, it will be tested on testing set which will likely be real-world data. Therefore, to better estimate the model’s performance on the testing set, we should select the validation set such that it has a similar distribution as the testing set. However, most of the time, we can not predict the distribution of , so we just randomly select a portion of the whole dataset as .
2. Cross Validation
To mitigate the gap between validation and testing dataset, an alternative to random cross validation is k-fold cross validation which divides the dataset into k folds. As shown in Fig.1, the dataset is divided into folds. In the first training, shown in the first row, the model will be trained with the second to the fourth folds and tested on the first fold. Then its performance on the first fold will be recorded as . Likewise, we perform and next three trainings and testings on corresponding folds. The final performance of the model is recorded as the average of , i.e., . K-fold can reflect the performance of the model more generally but also require longer training time, although those trainings can be performed in parallel to save time.
3.3 Overfitting & Underfitting
Two of the most well-known issues in machine learning community are overfitting or underfitting. Before giving the formal definition of overfitting and underfitting, we will first discuss what are bias and variance error. Bias error is the difference between targets and predictions while variance error occurs when there is a small change in the targets. We define the total error() as the sum of bias error() and variance error() see Eq.~1.
As shown in Fig.2, the columns are low and high variance examples, and the rows are low and high bias examples. The red dot in the middle is the target the model tries to reach, and those purple dots are predictions made by the model. In the top left of Fig.2, the model is optimal as it has both low variance and low bias, i.e., the error between the current target and predictions is small, and even if the target shifts a little bit, the error will not increase dramatically. For the high variance, low bias example shown on the top right of Fig.2, the error of the difference between target and predictions is small since the mean of predictions is close to the target but the error will grow a lot if the target moves a little bit. On the contrary, the high bias, low variance example on the bottom left does not achieve low error because the predictions are off-target, even though it has the potential to achieve lower error if the target move towards the north but this could barely happen in practice because of the high dimensionality in a real-world problem. The worst case is in the bottom right of Fig.2 which has both high variance and high bias.
To connect the idea of high variance and bias with supervised learning problem, we can treat the bias as error(loss) the model has during training, and variance is the error occurs when the training and testing dataset do not have identical distribution. In machine learning community, we call high variance scenario overfitting and high-bias underfitting. Overfitting occurs when the model captures the noise in the dataset. Intuitively, overfitting occurs when the model fits the data too well and has a discrepancy on testing performance. On the other hand, underfitting occurs when the model does not capture the hidden trend in the dataset. Intuitively, underfitting occurs when the model does not fit the data well enough and therefore does not perform on the testing set either. Overfitting usually corresponds to an over-complicated model while underfitting is the result of an over-simplified model.
Those two issues are demonstrated in Fig.3 where the numbers are percentage of incorrectly classified samples. In Fig.3 (A), a simple straight boundary line is draw which does not separate two classes well, so the the model does not perform well on either training or validation. In Fig.3 (C), even though this zigzag boundary can separate those two classes perfectly, if the validation set is slightly different from the training set, the accuracy will drop dramatically. The optimal boundary line is shown in Fig.3 (B) where the error is not as low as the overfitting one, but the boundary line is robust to change in dataset. Thus, it achieves similar performance on validation set and outperforms the underfitting and overfitting counterparts.
The overfitting problem can be mitigate by using a simple model so that the model will try to capture the useful hidden information in the dataset rather than noise. An alternative solution is collecting more data to train the model. The intuition behind this is as more data are used, the hidden features shared among data will be more dominant than noise characteristic and the model will pay more attention to learn the dominant features. Other possible solutions are regularization, dropout layer, initialization, etc. We will discuss them in detail in the following sections.
In terms of underfitting, since the model is not capable of learning hidden features, we can increase its capacity by including more layers or nodes into the model. Another possible reason of underfitting is limited training time which can be solved by simply extending the training time. If data collection and long training is not feasible, we can also explore some architectures which are more suitable for the specific underfitting problem. Last but not lest, hyperparameters searching can also mitigate the underfitting problem. We will discuss the hyperparameters searching in the following section as well. A summary of solutions to overfitting and underfitting is shown in Table.1.