Search
Create
Log in
Sign up
Log in
Sign up
Get ahead with a $300 test prep scholarship
| Enter to win by Tuesday 9/24
Learn more
Neural Network Zoo
STUDY
Flashcards
Learn
Write
Spell
Test
PLAY
Match
Gravity
Based on the "Neural Network Zoo" by Fjodor van Veen & The Asimov Institute http://www.asimovinstitute.org/neural-network-zoo/ [In Progess]
Terms in this set (57)
Backfed Input Cell
Probabilistic Hidden CellInput cell
Noisy Input Cell
Hidden Cell
Probabilistic Hidden Cell
Spiking Hidden Cell
Output Cell
Match Input Output Cell
Recurrent Cell
Memory Cell
Different Memory Cell
Kernel
Convolution or Pool
Perceptron (P)
Feeds information from the green input cell to the red output cell.
Feed Forward (FF)
Feeds information from the yellow input cells, through a layer of green hidden cells, to the red output cell.
This is the simplest possible NN, with two input cells and one output cell, which can be used to model logic gates. In general, a layer never has connections and two adjacent layers are fully connected.
Trained via back-propagation, where some variation of the difference between the input and the output (like MSE or just the linear difference).
Radial Basis Network (RBF)
Feeds information from the yellow input cells, through a layer of green hidden cells, to the red output cell.
Same as FFNNs, but activate with radial basis functions instead of sigmoidal functions.
Deep Feed Forward (DFF)
Feeds information from the left to the input cell, through layers of hidden cells, to the output cell.
In theory, given enough hidden neurons, it can theoretically always model the relationship between the input and output. In practice, FFs are popularly combined with other networks.
Recurrent Neural Network (RNN)
Feeds information from the yellow input cells through layers of blue recurrent cells to the red output cells.
FFNNs with a time twist: they are not stateless; they have connections between passes, connections through time.
Neurons are fed information not just from the previous layer but also from themselves from the previous pass, storing their previous weight.
This means that the order in which you feed the input and train the network matters: feeding it "milk" and then "cookies" may yield different results compared to feeding it "cookies" and then "milk".
One big problem with this network is the vanishing (or exploding) gradient problem where, depending on the activation functions used, information rapidly gets lost over time, just like very deep FFNNs lose information in depth.
Intuitively this wouldn't be much of a problem because these are just weights and not neuron states, but the weights through time is actually where the information from the past is stored; if the weight reaches a value of 0 or 10^N, the previous state won't be very informative.
In principle be used in many fields where data doesn't have a timeline (i.e. like sound or video) but can be represented as a sequence (A picture or string fed in one pixel or character at a time).
Here, the time (state) dependent weights are used for what came before in the sequence, not actually from what happened x seconds before. Therefore, these networks are generally a good choice for advancing or completing information, such as autocompletion.
Long / Short Term Memory (LSTM)
Feeds information from the yellow input cells through layers of blue memory cells to the red output cells.
Try to combat the vanishing / exploding gradient problem of RNNs by introducing gates and an explicitly defined memory cell, inspired more by circuitry, than biology.
Each neuron has a memory cell and three gates: input, output and forget. The function of these gates is to safeguard the information by stopping or allowing the flow of it.
The input gate determines how much of the information from the previous layer gets stored in the cell. The output layer takes the job on the other end and determines how much of the next layer gets to know about the state of this cell.
The forget gate seems like an odd inclusion at first but sometimes it's good to forget: if it's learning a book and a new chapter begins, it may be necessary for the network to forget some characters from the previous chapter.
These NNs have been shown to be able to learn complex sequences, such as writing like Shakespeare or composing primitive music.
Note that each of these gates has a weight to a cell in the previous neuron, so they typically require more resources to run.
Gated Recurrent Unit (GRU)
Feeds information from the yellow input cells through layers of blue different memory cells to the red output cells.
A slight variation on LSTMs. They have one less gate and are wired with update and reset gates.
This update gate determines both how much information to keep from the last state and how much information to let in from the previous layer.
The reset gate functions much like the forget gate of an LSTM but it always sends out its full state, and therefore not needing an output gate.
In most cases, they function very similarly to LSTMs, with the biggest difference being that GRUs are slightly faster and easier to run (but also slightly less expressive). In practice these tend to cancel each other out, as you need a bigger network to regain some expressiveness which often cancels out the performance benefits. In cases where the extra expressiveness is not needed, GRUs outperform LSTMs.
Auto Encoder (AE)
Feeds information from the yellow input cells through layers of green hidden cells to the red matching input output cells.
Similar to FFNNs in direct architecture, but applied differently.
The idea is to encode information (as in compress, not encrypt) automatically, hence the name.
The entire network always resembles an hourglass like shape, with smaller hidden layers than the input and output layers.
They are always symmetrical around the middle layer(s) (one or two depending on an even or odd amount of layers). The smallest layer(s) is|are almost always in the middle, the place where the information is most compressed (the chokepoint of the network).
Everything up to the middle is called the encoding part, everything after the middle the decoding and the middle (surprise) the code.
Trained via backpropagation by feeding input and setting the error to be the difference between the input and what came out. Can be built symmetrically with regard to weights as well, so the encoding weights are the same as the decoding weights.
Variational Autoencoder (VAE)
Feeds information from the yellow input cells through layers of probabilistic hidden cells to the red matching input output cells.
Have the same architecture as AEs but are "taught" something else: an approximated probability distribution of the input samples.
They are bit more closely related to BMs and RBMs, but they rely on Bayesian mathematics regarding probabilistic inference and independence, as well as a re-parametrisation trick to achieve this different representation.
The inference and independence parts make sense intuitively, but they rely on somewhat complex mathematics. The basics come down to this: take influence into account. If one thing happens in one place and something else happens somewhere else, they are not necessarily related. If they are not related, then the error propagation should consider that.
This is a useful approach because neural networks are large graphs (in a way), so it helps if you can rule out influence from some nodes to other nodes as you dive into deeper layers.
Denoising Autoencoder (DAE)
Feeds information from the yellow noisy input cells through layers of hidden cells to the red matching input output cells.
Don't feed just the input data, but we feed the input data with noise (like making an image more grainy).
We compute the error the same way though, so the output of the network is compared to the original input without noise. This encourages the network not to learn details but broader features, as learning smaller features often turns out to be "wrong" due to it constantly changing with noise. These help generalize the input to 'search through the noise'.
Sparse Autoencoder (SAE)
Feeds information from the yellow input cells through layers of hidden cells to the red matching input output cells.
In a way, the opposite of their family. Instead of teaching a network to represent a bunch of information in less "space" or nodes, we try to encode information in more space.
So instead of the network converging in the middle and then expanding back to the input size, we blow up the middle. These types of networks can be used to extract many small features from a dataset.
If one were to train this network the same way as an AE, you would in almost all cases end up with a pretty useless identity network (as in what comes in is what comes out, without any transformation or decomposition). To prevent this, instead of feeding back the input, we feed back the input plus a sparsity driver.
This sparsity driver can take the form of a threshold filter, where only a certain error is passed back and trained, the other error will be "irrelevant" for that pass and set to zero. In a way this resembles spiking neural networks, where not all neurons fire all the time (and points are scored for biological plausibility).
Markov Chain or Discrete Time Markov Chain (MC or DTMC)
A interconnected series of hidden probabilistic cells.
The predecessors to BMs and HNs. They can be understood as follows: from this node where I am now, what are the odds of me going to any of my neighbouring nodes?
They are memoryless (i.e. Markov Property) which means that every state you end up in depends completely on the previous state. While not really a neural network (alongside BMs, RBMs and HNs) , they do resemble neural networks and form the theoretical basis for BMs and HNs.
They aren't always fully connected either.
Hopfield Network (HN)
A interconnected series of yellow backfed input cells.
A network where every neuron is connected to every other neuron; it is a completely entangled plate of spaghetti where all the nodes function as everything.
Each node is input before training, then hidden during training and output afterwards. The networks are trained by setting the value of the neurons to the desired pattern after which the weights can be computed.
The weights do not change after this. Once trained for one or more patterns, the network will always converge to one of the learned patterns because the network is only stable in those states. Note that it does not always conform to the desired state (it's not a magic black box sadly). It stabilises in part due to the total "energy" or "temperature" of the network being reduced incrementally during training. Each neuron has an activation threshold which scales to this temperature, which if surpassed by summing the input causes the neuron to take the form of one of two states (usually -1 or 1, sometimes 0 or 1).
Updating the network can be done synchronously or more commonly one by one. If updated one by one, a fair random sequence is created to organise which cells update in what order (fair random being all options (n) occurring exactly once every n items). This is so you can tell when the network is stable (done converging), once every cell has been updated and none of them changed, the network is stable (annealed).
These networks are often called associative memory because the converge to the most similar state as the input; if humans see half a table we can image the other half, this network will converge to a table if presented with half noise and half a table.
Boltzmann Machine (BM)
A interconnected series of yellow backfed input cells and green hidden probabilistic cells.
Like HNs, but: some neurons are marked as input neurons and others remain "hidden". The input neurons become output neurons at the end of a full network update.
It starts with random weights and learns through back-propagation, or more recently through contrastive divergence (a Markov chain is used to determine the gradients between two informational gains).
Compared to a HN, the neurons mostly have binary activation patterns. As hinted by being trained by MCs, BMs are stochastic networks.
The training and running process of a BM is fairly similar to a HN: one sets the input neurons to certain clamped values after which the network is set free. While free the cells can get any value and we repetitively go back and forth between the input and hidden neurons. The activation is controlled by a global temperature value, which if lowered lowers the energy of the cells. This lower energy causes their activation patterns to stabilise. The network reaches an equilibrium given the right temperature.
Restricted BM (RBM)
Feeds information from the yellow backfed input cells through layers of hidden probabilistic cells.
Remarkably similar to BMs (surprise) and therefore also similar to HNs. The biggest difference between BMs and RBMs is that RBMs are a better usable because they are more restricted. They don't trigger-happily connect every neuron to every other neuron but only connect every different group of neurons to every other group, so no input neurons are directly connected to other input neurons and no hidden to hidden connections are made either.
RBMs can be trained like FFNNs with a twist: instead of passing data forward and then back-propagating, you forward pass the data and then backward pass the data (back to the first layer). After that you train with forward-and-back-propagation.
Deep Belief Network (DBN)
Feeds information from the yellow backfed input cells through layers of green mixed hidden cells and hidden probabilistic cells, with red matching input output cells.
The name given to stacked architectures of mostly RBMs or VAEs, but effectively trainable stack by stack, where each AE or RBM only has to learn to encode the previous network.
This technique is also known as greedy training, where greedy means making locally optimal solutions to get to a decent but possibly not optimal answer.
Can be trained through contrastive divergence or back-propagation and learn to represent the data as a probabilistic model, just like regular RBMs or VAEs. Once trained or converged to a (more) stable state through unsupervised learning, the model can be used to generate new data. If trained with contrastive divergence, it can even classify existing data because the neurons have been taught to look for different features.
Deep Convolutional Network (DCN OR DCNN)
Feeds information from the pink kernels and convolutions/pools, and again through layers of green hidden cells until the red output cell layer.
These are quite different from most other networks. They are primarily used for image processing but can also be used for other types of input such as as audio.
A typical use case is image classification; feeding the network images and the network classifies the data, e.g. it outputs "cat" if you give it a cat picture and "dog" when you give it a dog picture.
These tend to start with an input "scanner" which is not intended to parse all the training data at once. For example, to input an image of 200 x 200 pixels, you wouldn't want a layer with 40 000 nodes. Rather, you create a scanning input layer of say 20 x 20 which you feed the first 20 x 20 pixels of the image (usually starting in the upper left corner). Once you passed that input (and possibly use it for training) you feed it the next 20 x 20 pixels: you move the scanner one pixel to the right. Note that one wouldn't move the input 20 pixels (or whatever scanner width) over, you're not dissecting the image into blocks of 20 x 20, but rather you're crawling over it.
This input data is then fed through convolutional layers instead of normal layers, where not all nodes are connected to all nodes. Each node only concerns itself with close neighbouring cells (how close depends on the implementation, but usually not more than a few). These convolutional layers also tend to shrink as they become deeper, mostly by easily divisible factors of the input (so 20 would probably go to a layer of 10 followed by a layer of 5). Powers of two are very commonly used here, as they can be divided cleanly and completely by definition: 32, 16, 8, 4, 2, 1.
Besides these convolutional layers, they also often feature pooling layers. Pooling is a way to filter out details: a commonly found pooling technique is max pooling, where we take say 2 x 2 pixels and pass on the pixel with the most amount of red.
For audio, you basically feed the input audio waves and inch over the length of the clip, segment by segment. Real world implementations of these often glue an FFNN to the end to further process the data, which allows for highly non-linear abstractions. These networks are called DCNNs but the names and abbreviations between these two are often used interchangeably.
Deconvolutional Network (DN)
Feeds information from the yellow input cells, through pink convolutions/pools, again through layers of kernels until the red output cell layer.
Also called inverse graphics networks (IGNs), are reversed convolutional neural networks.
Imagine feeding a network the word "cat" and training it to produce cat-like pictures, by comparing what it generates to real pictures of cats. DNNs can be combined with FFNNs just like regular CNNs, but this is about the point where the line is drawn with coming up with new abbreviations.
Note that in most applications one wouldn't actually feed text-like input to the network, more likely a binary classification input vector. Think <0, 1> being cat, <1, 0> being dog and <1, 1> being cat and dog. The pooling layers commonly found in CNNs are often replaced with similar inverse operations, mainly interpolation and extrapolation with biased assumptions (if a pooling layer uses max pooling, you can invent exclusively lower new data when reversing it).
Deep Convolutional Inverse Graphics Network (DCIGN)
Feeds information from the yellow input cells, through a layer of pink kernels until pink convolutions/pools, converging to a layer of green hidden probabilistic cells, and then out again through layers of kernels and convolutions until the red output cell layer.
These have a somewhat misleading name, as they are actually VAEs but with CNNs and DNNs for the respective encoders and decoders.
These networks attempt to model "features" in the encoding as probabilities, so that it can learn to produce a picture with a cat and a dog together, having only ever seen one of the two in separate pictures. Similarly, you could feed it a picture of a cat with your neighbours' annoying dog on it, and ask it to remove the dog, without ever having done such an operation.
Demos have shown that these networks can also learn to model complex transformations on images, such as changing the source of light or the rotation of a 3D object. These networks tend to be trained with back-propagation.
Generative Adversarial Network (GAN)
Feeds information from the yellow backfed input cells, through layer of green hidden cells until red matching input output cells, which are fed again through layers of green hidden cells until a red layer of matching input output cells.
These are a different breed of networks, they are twins: two networks working together.
They consist of any two networks (although often a combination of FFs and CNNs), with one tasked to generate content and the other has to judge content.
The discriminating network receives either training data or generated content from the generative network. How well the discriminating network was able to correctly predict the data source is then used as part of the error for the generating network.
This creates a form of competition where the discriminator is getting better at distinguishing real data from generated data and the generator is learning to become less predictable to the discriminator. This works well in part because even quite complex noise-like patterns are eventually predictable but generated content similar in features to the input data is harder to learn to distinguish.
Can be quite difficult to train, as you don't just have to train two networks (either of which can pose it's own problems) but their dynamics need to be balanced as well. If prediction or generation becomes to good compared to the other, a GAN won't converge as there is intrinsic divergence.
Liquid State Machine (LSM)
Feeds information from the yellow input cells, through layer of green spiking hidden cells until red output cells.
Look a lot like ESNs, with the real difference being the spiking behavior: sigmoid activations are replaced with threshold functions and each neuron is also an accumulating memory cell.
So when updating a neuron, the value is not set to the sum of the neighbours, but rather added to itself. Once the threshold is reached, it releases its' energy to other neurons. This creates a spiking like pattern, where nothing happens for a while until a threshold is suddenly reached.
Extreme Learning Machine (ELM)
Feeds information from the yellow input cells, through layers of green hidden cells to red output cells.
Basically FFNNs but with random connections. They look very similar to LSMs and ESNs, but they are used more like FFNNs. This is not just because they are not recurrent nor spiking, but also because these use simple backpropagation through the entire network, instead of dealing with the input/output.
Echo State Network (ESN)
Feeds information from the yellow input cells, through random connections of blue memory cells to red output cells.
Yet another type of recurrent network. This one sets itself apart from others by having random connections between the neurons (i.e. not organised into neat sets of layers), and they are trained differently.
Instead of feeding input and back-propagating the error, we feed the input, forward it and update the neurons for a while, and observe the output over time.
The input and the output layers have a slightly unconventional role as the input layer is used to prime the network and the output layer acts as an observer of the activation patterns that unfold over time. During training, only the connections between the observer and the (soup of) hidden units are changed.
Deep Residual Network (DRN)
Feeds information from the yellow input cells, through interlinked connections of hidden memory cells, sometimes jumping layers, to red output cells.
Very deep FFNNs with extra connections passing input from one layer to a later layer (often 2 to 5 layers) as well as the next layer. Instead of trying to find a solution for mapping some input to some output across say 5 layers, the network is enforced to learn to map some input to some output + some input. Basically, it adds an identity to the solution, carrying the older input over and serving it freshly to a later layer.
It has been shown that these networks are very effective at learning patterns up to 150 layers deep, much more than the regular 2 to 5 layers one could expect to train.
However, it has been proven that these networks are in essence just RNNs without the explicit time based construction and they're often compared to LSTMs without gates.
Kohonen Network (KN)
Feeds information from the yellow input cells, exclusively to green hidden cells.
Also called a self organising (feature) map, SOM, SOFM). They utilise competitive learning to classify data without supervision.
Input is presented to the network, after which the network assesses which of its neurons most closely match that input. These neurons are then adjusted to match the input even better, dragging along their neighbours in the process. How much the neighbours are moved depends on the distance of the neighbours to the best matching units. KNs are sometimes not considered neural networks either.
Support Vector Machine (SVM)
Feeds information from the yellow input cells, through layers of green hidden cells to a red output cell.
These find optimal solutions for classification problems.
Classically they were only capable of categorising linearly separable data; say finding which images are of Garfield and which of Snoopy, with any other outcome not being possible.
During training, SVMs can be thought of as plotting all the data (Garfields and Snoopys) on a graph (2D) and figuring out how to draw a line between the data points. This line would separate the data, so that all Snoopys are on one side and the Garfields on the other. This line moves to an optimal line in such a way that the margins between the data points and the line are maximised on both sides. Classifying new data would be done by plotting a point on this graph and simply looking on which side of the line it is (Snoopy side or Garfield side).
Using the kernel trick, they can be taught to classify n-dimensional data. This entails plotting points in a 3D plot, allowing it to distinguish between Snoopy, Garfield AND Simon's cat, or even higher dimensions distinguishing even more cartoon characters. They are not always considered neural networks.
Neural Turing Machine (NTM)
Feeds information from the yellow input cells, through layer of green spiking hidden cells which are supported by memory cells, until a red output cell.
Can be understood as an abstraction of LSTMs and an attempt to un-black-box neural networks (and give us some insight in what is going on in there); instead of coding a memory cell directly into a neuron, the memory is separated.
It's an attempt to combine the efficiency and permanency of regular digital storage and the efficiency and expressive power of neural networks.
The idea is to have a content-addressable memory bank and a neural network that can read and write from it. The "Turing" in bit comes from them being Turing complete: the ability to read and write and change state based on what it reads means it can represent anything a Universal Turing Machine can represent.
Neural Networks
Models that have layers, each layer consisting of either input, hidden or output cells in parallel. A layer alone never has connections and in general two adjacent layers are fully connected (every neuron form one layer to every neuron to another layer). The simplest somewhat practical network has two input cells and one output cell, which can be used to model logic gates.
Supervised Learning
Learning via defined inputs and outputs.
Unsupervised Learning
Let the network fill in the blanks.
Backpropogation
...
Forward-and-back-propagation
...
Feedforward
...
Activation Function
The function which determines if a neuron is on/off.
Radial Basis Function
Activation Function:
Activates when x is within radial distance.
A common kernel function for SVMs. Akin to producing a cross-section of higher dimensional data, but are inappropriate for to many inputs or very noisy data.
Sigmoidal Function
Activation Function:
Activates proportionally when x is between 0 and 1.
The traditional 'S-curve', or logistic function, is also one of the most organic curves in nature. IT also has the benefit of having finite limits (generally 0 to 1). 'Hard' versions (no exponent calculation) are often used to improve performance, since they are faster to compute.
Threshold
Activation Function:
Activates when x is greater than a specified value.
A stricter version of the 'S-curve'.
Rectified Linear Unit
Activation Function:
Activates proportionally when x is larger than 0.
Also known as `ReLU` or the Ramp function. This activation function has been argued to be more biologically plausible than the widely used logistic sigmoid and the hyperbolic tangent. As of 2015, the most popular activation function for deep neural networks, due to its many benefits:
- Biological plausibility: one-sided, compared to the antisymmetry of tanh.
- Sparse activation: n a randomly initialized network, only about 50% of hidden units are actually activated.
- Efficient gradient propagation (No vanishing or exploding gradient problems.
- Efficient computation: Only comparison, addition and multiplication.
- Scale-invariant: Since it's a max, features do not have to be normalized or transformed.
Leaky Rectified Linear Unit
Activation Function:
Activates when x is greater than a percentage of itself.
A variant of the ReLU that seeks to avoid the 'dead neutron' problem where a ReLU at 0 has no ability to recover (turn on again) due to the nature of stochastic gradient training (a large learning rate amplifies this problem). A fix is to modify the flat side to have a small gradient, so it can recover from such a position.
Softplus
Activation Function:
Activates when x is approximately larger than 0.
This function is a smooth approximation for the ReLU function, while also being easy to compute. Its derivative also happens to be the sigmoidal function.
Arctan
Activation Function:
Activates when x is greater than 0.
Also known as the tanh or hyperbolic tangent function, is a rescaling of the logistic sigmoid, such that its outputs range from -1 to 1. (There's horizontal stretching as well.) It is very old, and I am still researching its merits.
Weights and Activations
The yellow input cells feed their weights via biases into green hidden cells, and again to the red output cell. This framework of weights and biases becomes the 'solution' of a traditional NN.
This means by which a neural network determines neuron weights and biases, the framework by which it maps input and output values, and applies to data after training to 'predict' output values from unknown input values.
In the case of the upper neuron, the weights matrix is the input layer [1, 0.5] and the bias matrix relating it to that neuron is [-5.8, 2.1]. When these are multiplied together we get -2.9 and 2.1, which we sum again get the final parameter -0.8.
This taken into one of the hidden green cells, which activates here via the sigmoidal (S-curve) function to get the result 0.3, which becomes the weight of this function.
In the case of the lower neuron, the weights matrix is still [1, 0.5] (the input layer) but the bias matrix to this neuron is now [1.2, 0.2]. When these are multiplied together summed we get 1.3, which is put into the sigmoidal function to 'activate' the neuron with the 1.0 weight.
For the final red output cell, the same weight and bias process happens again, this time with the weights of the newly activated green hidden layer [0.3, 1.0] and biases [1.2, 0.2], which is solved and summed again to get 0.6.
This is how the input values of 1.0 and 0.5 are associated to the 0.6 output value.
In training, this NN will probably categorize values near {0.5, 1} to be near the 0.6 output. Every NN is different and will use it's weights and biases differently (even arbitrarily without training).
However, it is these weights and biases that describe a network, and is what is retained by the network after it is 'taught' via backpropogation (the transformation of these weights and biases to map inputs and ouputs via error minimization).
Therefore, the best NN is the one with the weights and biases trained via sample data that has lowest error mapping inputs and outputs, especially given actual population data.
Overfitting
When a system of weights and biases for an NN generate output values that are too specific to the sample data.
When an NN is overtrained, it will provide solutions that are too specific to the training data.
This typically means it will show high variance in the face of population data, since it may overvalue sample features in the population; if the NN has never been trained on a given input value, it may generate output values that are 'too-specific', or even arbitrary, which is extremely problematic.
Underfitting
When a system of weights and biases for an NN generate output values that are too general to the sample data; high bias.
When an NN is too simplified, it is too biased to its initial weights and biases; it has not be sufficiently trained with sample data, and will typically ignore sample and population features, gen.rally outputting the same (often arbitrary) values,
;