Machine Learning / Model Complexity Graph

By Marcelo Fernandes Dez 9, 2017

The model complexity graph compares the **training errors** and the
**cross-validation errors** in order to measure if a certain model either
overfits or underfits the dataset that it has been exposed to.

In order to understand how the Complexity Graph works, we are going to build up an example that tries to create a model to best classify a certain dataset.

So, lets start with the dataset. We selected a dataset with only two classes,
**positives** and **negatives**. We also split our data into training data
and testing data. It looks like this:

Let's suppose that we are trying to create a model that fits this data very well, we would start with a linear regression like that:

Well, that was very lame. If you look at the graph you will see that we got:

- 10 Training errors.
- 6 Testing errors.

And it is pretty visible that we didn't get any close to a good model.

Let's give a next try with a **polynomial of degree 4**:

It looks way better now, our model seems to fit the data pretty well, we got:

- 2 Training errors.
- 0 Testing errors.

It looks like a good fit, but let's take a look at a final model, using
a **polynomial of degree 14**.

This model looks like an overkill. Although it classifies correctly all the training data, it fails to classify the testing data:

- 0 Training errors.
- 3 Testing errors.

It looks like a linear model is **underfitting** our dataset, once it tries
to simplify the data way too much, the polynomial of degree 4 seems to be a **good fit**,
once it generalizes the testing data well enough, and finally the polynomial of degree 14
is **overfitting** the data, once it uses a very complex model to be very specific to our
training data, but it does not goes well in generalization, as we have more testing errors.

In practice, you will often face yourself with tons of different models, so it might be very handy to have a tool to check whether or not your model is doing well. So, that's when our Model Complexity Graph comes up.

Building our graph is very easy, we just have to compare the errors (training and testing), along with the different models that we have. You will end up having something like that:

So we can come up with some insight of our data directly from this graph.

- Our linear model is very simple, and both our training and testing errors are high in this model, which proves that it is under estimating our data.
- Our Polynomial of degree 4 seems to be a good fit for our model, once it generalizes well for our testing data, even though it misses few training samples
- Our Polynomial of degree 14 seems to be overfitting our data, once it does not miss any of our training data, but it makes poor achievements for our testing data.

Having that in mind, it would be pretty easy to find out what is the best approach for this dataset, right?