Today, a friend generously help me out with a topic-relevant blog. He decided to discuss a very technical aspect of AI. A topic that is super interesting and intrinsic to machine learning. The blog will go over a different style of cost function reducing that can drastically impact computational times (for the better). Big shout-out to Gunnar Ensarro for the content! Enjoy!
In the field of deep learning there exists an assortment of problems still need to be overcome. One problem that I looked at today was the continuous learning problems and in this blog post I will simplify the known learning algorithm EWC (Elastic weight consolidation).
Now let’s start off – What is continuous learning? Continuous learning is the process that, as data is provided, it is learned. Kinda like the drive through of neural networks, data comes into the machine learning model’s drive through, then the machine learning model prepares the weights for the corresponding data. Once the machine learning model has been fully adjusted, it can then evaluate and represent the function of that data. A little abstract, but in all fun and games, continuous learning is the process of learning on the fly. Now, how is the challenge being looked at in modern research? If you haven’t heard it before then you will hear about it many times in this blog – Elastic weight consolidation.
Elastic weight consolidation was first proposed in the research paper “Overcoming catastrophic forgetting in neural networks” . Now, you might be asking a lot of questions about what this is and how it works. To start off, this is a learning algorithm that helps with the weight optimization process. Basically, the process that represents how the neural network learns, digests, and is scanned over by this algorithm; as well as how it is controlled to allow for multiple datasets to be trained on one neural network. This does not hit the end goal of ‘learning on the fly’, but it’s a start in the general direction. Now, we understand what it does, how does it work?
So, when weights are updated in a neural network, you typically use a method called gradient descent. How this process works is by looking at the error of the neural network over a set of data. Then you look at the gradient of the error through partial derivatives (direction of greatest descent) to determine how the weights should be updated to reduce error. Now, how does EWC work in cooperation with this to train multiple datasets? Well, let’s break the problem down, first off we have dataset 1 and dataset 2. Now our goal is to take datasets 1 and 2, train them both at different times on the neural network, and have the neural network understand both datasets to a point where it has fundamentally learned them.
To do this, we need to first ask ourselves, what part of the the weights matter the most to the dataset? Well, the weights are the the coefficients of the function. Coefficients are the numbers in an equation. If you were looking at ‘Y=mx+b’ where m is the coefficient of x. But what is that? My best bet is that it is the pattern between Y and x.
So what are they in neural networks? Well the are the pattern between your vector input and your vector output. AMAZING! Now what’s even cooler is how this works with EWC. So, with dataset 1, after we learn it, the neural net holds the weight of the pattern that dataset 1 holds. Now we must look at the weights and determine the most important ones. How much a weight contributes to the vector output. Then from there ,we adjust the rate at which we update that weight in gradient descent. The math behind this gets much more difficult than meets the eye but the concept is quite solid. To help you out, below is a visual abstraction of the process when updating a task theta on dataset A. Alongside this, there is a chart showing the power of EWC.