i am trask

Is the Universe Random?

Mon, 19 Jun 2017 12:00:00 +0000

TLDR: I've recently wondered about whether the Universe is truly random, and I thought I'd write down a few thoughts on the subject. As a heads up, this post is more about sharing a personal journey I'm on than teaching a skill or tool (unlike many other blogposts). Feel free to chat with me about these ideas on Twitter as I'm still working through them myself and I'd love to hear your perspective.

I typically tweet out new blogposts when they're complete @iamtrask. Thank you for your time and attention, and I hope you enjoy the post.

The Oxford English Dictionary defines randomness as "a lack of pattern or predictability in events". I like this definition as it reveals that "randomness" is more about the predictive ability of the observer than about an event itself. Consider the following consequence of this definition:

Consequence: Two people can accurately describe the same event as having different degrees of randomness.

Consider when two meteorologists are trying to predict the probability that it will rain today. One person (we will call them the "Ignorant Meteorologist") is only allowed a record of how often it has rained in the region between the years 1900 and 2000. The second person (the "Smart Meteorologist") is also allowed this information, but this second individual is also allowed to know today's date. These two individuals would consider the likelihood of rain to be very different. The Ignorant Meteorologist would simply say, "it rains 40% of the time in the region, thus there is a 40% chance of rain today.". What else can he/she say? Given the information provided, the degree of randomness is very high. However, the "Smart Meteorologist" is more informed. He/she might be able to say "it is the dry season, thus there is only a 5% chance of rain today".

If we asked each of these meteorologists to continue to make predictions over time. They would each make predictions with different degrees of randomness (one higher and one lower) in accordance with the availability of information. However, the event is no more or less random in and of itself. It's only more or less random in relation to an invidual and other available predictive information.

Perhaps this makes you feel that randomness is no longer real, that it is only in the eye of the beholder. However, I believe that the context of Machine Learning provides a much more precise definition of randomness. In this context, one can think of randomness as a measure of compatability between 3 datasets: "input", "target", and "model".

Input: is data that is readily known. It is what we're going to try to use to predict something else. All potential "causes" are contained within the input.

Target: is the data that we wish to predict. This is the thing that we say is either random or not random.

Model: this "dataset" is simply a set of instructions for transforming the input into the target. When a human is making a prediction, this is the human's intelligence. In a computer, it's a large series of 1s and 0s which operate gates and represent a particular transformation.

Thus, we can view randomness as merely a degree of compatibility between these three datasets. If there is a high degree of compatibility (input -> model -> target is very reliable), then there is a low degree of randomness.

Now that we've identified what I believe to be the most practical and commonly used definition of randomness, I'd like to introduce the bigger version of randomness, which we'll call Randomness (with a capital R). This Randomness refers to whether, given infinite knowledge and intelligence, a future event could be predicted before it happens (a "model" dataset could exist for that event). This Randomness also implies whether or not something is caused by another thing. If it is NOT caused but simply IS of its own accord, it is (capital "R") Random.

Is the Universe Random?

The simple answer is that, from our perspective, it has a degree of randomness that is rapidly decreasing. We are consistently able to predict future events with greater accuracy. Furthermore, predicting outcomes given our interaction allows us a certain degree of control over the future (because we can choose interactions which we predict will lead to the desired outcome). This plays out in the advancement of every sector: agriculture, healthcare, finance (bigtime), politics, etc. However, because we cannot predict the Universe entirely, it is somewhat random. (lowercase "r")

Whether the Universe is uppercase "R" Random is a different question entirely. We can, however, make some progress on this question:

Claim 1: Causality exists

In short, we can predict things with far better than random accuracy. Thus, some things tend to cause other things. We might even be able to say that some things absolutely cause other things unless affected by some unpredictable randomness, but we don't even need this bold of a claim. Simply stated, much of the Universe is probably causal because we can predict events with better than random accuracy. To claim otherwise would be extremely unlikely and would imply that the entirety of human innovation and intelligence throughout history leading to prosperity and survival was simply a coincidence. It's possible, but very unlikely. We'll go ahead and accept the notion that causality exists in the Universe.

Claim 2: The Universe is not entirely Random.

For all things in the Universe that are not Random but are instead caused, randomness (lowercase) in their behavior is a result of either random or Random objects exerting upon it. Thus, asking whether the Universe is Random is about asking whether or not there exists a Random object within it. It is not about asking whether every object is inherently random, because cause and effect can transfer the unpredictable behavior of a Random object across the Universe via things that are merely random.

Claim 3: Because we can observe cause and effect relationships, the Universe is, at most, a mixture of Random and causal objects, and at least, exclusively made of causal objects.

This brings us to the root of our question. When we repeatedly ask "and what caused that?" over and over again, where do we end up? Well, there are 4 possible states of the Universe (finite/finite space/time)

Finite Time + Finite Space If time is finite, there was a beginning. If there was a beginning, there was a TON of Randomness which began the Universe. Thus, Randomness at least has existed within the Universe (although whether it still exists is less certain).
Finite Time + Infinite Space (see above)
Infinite Time + Finite Space Laws of entropy determine that if this was our Universe, you wouldn't be reading this blogpost as all the energy in the Universe would have disipated to equilibrium an infinite number of years ago. I suppose there are counter arguments to this, but I don't personally find them particularly strong. We have a rather large amount of empirical evidence that energy tends to dissipate (perhaps more empirical evidence for this than for any other claim in the Universe?).
Infinite Time + Infinite Space This state is interesting because there's theoretically an infinite amount of energy (inside the infinite space) alongside an infinite amount of time for it to disipate. Thus, while I don't have solid footing to say that this Universe does not exist, I think we can make a reasonble case for (capital R) Randomness in this Universe. Specifically, because the state of the Universe at any given time "t" is, itself, infinite, there are an infinite number of potential causes for an event. Thus, every event is Random because there are an infinite number of potential causes for any event. It may be asymtotically predicatable given proximity to some causal events playing a more dominant role, but in the limit every event is Random.

Conclusion: Randomness with a capital R either exists or has existed before in the Universe because all 3 plausible configurations of the Universe necessitate events that have no cause, and an event with no cause cannot be predicted and is therefore Random.

Safe Crime Detection

Mon, 05 Jun 2017 12:00:00 +0000

TLDR: What if it was possible for surveillance to only invade the privacy of criminals or terrorists, leaving the innocent unsurveilled? This post proposes a way with a prototype in Python.

Edit: If you're interested in training Encrypted Neural Networks, check out the PySyft Library at OpenMined

Abstract: Modern criminals and terrorists hide amongst the patterns of innocent civilians, exactly mirroring daily life until the very last moments before becoming lethal, which can happen as quickly as a car turning onto a sidewalk or a man pulling out a knife on the street. As police intervention of an instantly lethal event is impossible, law enforcement has turned to prediction based on the surveillance of public and private data streams, facilitated by legislation like the Patriot Act, USA Freedom Act, and the UK's Counter-Terrorism Act. This legislation sparks a heated debate (and rightly so) on the tradeoff between privacy and security. In this blogpost, we explore whether much of this tradeoff between privacy and security is merely a technological limitation, overcommable by a combination of Homomorphic Encryption and Deep Learning. We supplement this discussion with a tutorial for a working prototype implementation and further analysis on where technological investments could be made to mature this technology. I am optimistic that it is possible to re-imagine the tools of crime prediction such that, relative to where we find ourselves today, citizen privacy is increased, surveillance is more effective, and the potential for mis-use is mitigated via modern techniques for Encrypted Artificial Intelligence.

Edit:The term "Prediction" seemed to trigger the assumption that I was proposing technology to predict "future" crimes. However, it was only intended to describe a system that can detect crime unfolding (including intent / pre-meditation) in accordance with agreed upon defintions of "intent" and "pre-meditation" in our criminal justice system, which do relate to future crimes. However, I am in no way advocating punishment for people who have commited no crime. So, I'm changing the title to "Detection" to better communicate this.

Edit 2: Some have critiqued this post by citing court cases when tools such as drug dogs or machine learning have been either inaccurate or biased based on unsuitable characteristics such as race. These constitute fixable defects in the predictive technology and have little to no bearing on this post. This post is, instead, about how homomorphic encryption can allow this technology to run out in the open (on private data), the nature of which does NOT constitute a search or seizure because it reveals no information about any citizen other than whether or not they are commiting a crime (much like a drug dog giving your bag a sniff. It knows everything in your bag but doesn't tell anyone. It only informs the public whether or not you are likely committing a crime... triggering a search or seizure.) Ways to make the underlying technology more accurate can be discussed elsewhere.

Edit 3: Others have critiqued this post by confusing it with tech for allocation of police resources, which uses high level statistical informaiton to basically predict "this is a bad neighborhood". Note that tech such as this is categorically different than what I am proposing, as it makes judgements against large groups of people, many of whom have committed no crime. This technology is instead about detecting when crimes actually occur but would normally go un-discovered because no information to the crime's existence was made public (i.e., the "perfect crime").

I typically tweet out new blogposts when they're complete @iamtrask. If these ideas inspire you to help in some way, a share or upvote is the first place to start as a lack of awareness of these tools is the greatest obstacle at this stage. All in all, thank you for your time and attention, and I hope you enjoy the post.

Part 1: Ideal Citizen Surveillance

When you are collecting your bags at an international airport, often a drug sniffing dog will come up and give your bag a whiff. Amazingly, drug dogs are trained to detect conceiled criminal activity with absolutely no invasion of your privacy. Before drug dogs, enforcing the same laws required opening up every bag and searching its contents looking for drugs, an expensive, time consuming, privacy invading procedure. However, with a drug dog, the only bags that need to be searched are those that actually contain drugs (according to the dog). Thus, the development of K9 narcotics units simultaneously increased the privacy, effectiveness, and efficiency of narcotics surveillance.

Source:http://www.vivapets.com/upload/Image/snifferdog.jpg

Similarly, in much of the developed world it is a commonly accepted practice to install an electronic fire alarm or burgular alarm in one's home. This first wave of the Internet of Things constitutes a voluntary, selective invasion of privacy. It is a piece of intelligence that is designed to only invade our privacy (and inform home security phone operators or law enforcement to the state of your doors and windows and give them permission to enter your home) if there is a great need for it (a threat to life or property such as a fire or burgular). Just like drug dogs, fire alarms and burgular alarms replace a much less efficient, far more invasive, expensive system: a guard or fire watchman standing at your house 24x7. Just like drug dogs, fire alarms and burgular alarams simultaneously incrase privacy, effectiveness, and efficiency of home surveillance.

In these two scenarios, there is almost no detectable tradeoff between privacy and security. It is a non-issue as a result of technological solutions that filter through irrelevant, invasive information to only detect the rare bits of information indicative of a crime or a threat to life or property. This is the ideal state of surveillance. Surveillance is effective yet privacy is protected. This state is reachable as a result of several characteristics of these devices:

Privacy is only invaded if danger/criminal activity is highly probable.
The devices are accurate, with a low occurrence of False Positives.
Those with access to the device (homeowners and K9 handlers) aren't explicitly trying to fool it. Thus, its inner workings can be made public, allowing it's protection of privacy to be fully known / auditable (no need for self-regulation of alarm manufacturers or dog handlers)

This combination of privacy protection, accuracy, and auditability is the key to the ideal state of surveillance, and it's quite intuitive. Why should every bag be opened and every airline passenger's privacy violated when less than 0.001% will actually contain drugs? Why should video feeds into people's homes be watched by security guards when 99.999% of the time there is no invasion or fire? More precisely, should it even be possible for a surveillance state to surveil the innocent, presenting the opportunity for corruption? Is it possible to create this limit to a surveillance state's powers while simultaneously increasing its effectiveness and reducing cost? What would such a technology look like? These are the questions explored below.

Part 2: National Security Surveillance

At the time of writing, over 50 people have been killed by terror attacks in the last two weeks alone in Manchester, London, Egypt, and Afghanistan. My prayers go out to the victims and their families, and I deeply hope that we can find more effective ways to keep people safe. I am also reminded of the recent terror attack in Westminster, which claimed the lives of 4 people and injured over 50. The investigation into the attack in Westminster revealed that it was coordinated on WhatsApp. This has revived a heated debate on the tradeoff between privacy and safety. Governments want back-doors into apps like WhatsApp (which constitute unrestricted READ access to a live data stream), but many are concerned about trusting big brother to self-regulate the privacy of WhatsApp users. Furthermore, installing open backdoors makes these apps vulnerable to criminials discovering and exploiting them, further increasing risks to the public. This is reflective of the modern surveillance state.

Privacy is only violated as a means to an end of detecting extremely rare events (attacks)
There's a high degree of false positives in data collection (presumably 99.9999% of the data has nothing to do with an actual threat).
If the detection technology was released to the public, presumbly bad actors would seek to evade it. Thus, it has to be deployed in secret behind a firewall (not auditable). This opens up the opportunity for mis-use (although in most nations I personally believe misuse is rare).

These three characteristics of national surveillance states are in stark contrast to the three ideal characteristics mentioned in the introduction. There is largely believed to be significant collateral damage to the privacy of innocent bystanders in data streams being surveilled. This is a result of the detection technology needing to remain secret. Unlike drug dogs, algorithms used to find bad actors (terrorists, criminals, etc.) can't be deployed (i.e. on WhatsApp) out in public for everyone to see and audit for privacy protection. If they were released to the public to be evaluated for privacy protection they would quickly be reverse engineered and their effects deemed useless. Furthermore, even deploying them to people's phones (without auditing) to detect criminal activity would constitute a vulnerability. Bad actors would evade the algorithms by reverse engineering them and by vieweing/intercepting/faking their predictions being sent back to the state. Instead, entire data streams must be re-directed to warehouses for stockpiling and analysis, as it is impossible to determine which subsets of the data stream are actually relevant using an automatic method out in public.

While terrorism is perhaps the most discussed domain in the tradeoff between privacy and safety, it is not the only one. Crimes such as murder take the lives of hundreds of thousands of individuals around the world. The US alone averages around 16,000 murders per year, which oddly can be abstracted to a logistical issue: Law enforcement does not know of a crime far enough in advance to intervene. On average, 16,000 Americans simply call 911 too late, if they manage to call at all.

The Chicken and Egg Problem of "Probable Cause": The challenge faced by the FBI and local law enforcement is incredibly similar to that of terrorism. The laws intended to protect citizens from the invasive nature of surveillance create a chicken and egg problem between observing "probable cause" for a crime (subsequently obtaining a warrant), and having access to the information indicative of "probable cause" in the first place. Thus, unless victims can somehow know (predict) they are about to be murdered far enough in advanced to send out a public cry for help, law enforcement is often unable to prevent their death. Viewing crime prediction from this light is an interesting perspective, as it moves crime prediction from something a citizien must invoke for themselves to a public good that justifies public funding and support.

Bottom line: the cost of this chicken-and-egg "probable cause" issue is not only the invasion of citizen privacy, it is an extremely large number of human lives owing to the inability for people to predict when they will be harmed far enough in advance for law enforcement to intervene. Dialing 911 is often too little, too late. However, in rare cases like drug dogs or fire alarms, this is a non-issue as crime is detectable without significant collateral damage to privacy and thus "probable cause" is no longer a limiting factor to keeing the public safe.

Part 3: The Role of Artificial Intelligence

In a perfect world, there would be a "Fire Alarm" device for activities involving irreversible, heinous crimes such as assault, murder, or terrorism, that was private, accurate, and auditable. Fortunately, the R&D investment into devices for this kind of detection by commerical entities has been massive. To be clear, this investment hasn't been driven by the desire for consumer privacy. On the contrary, these devices were developed to achieve scale. Consider developing Gmail and wanting to offer a feature that filters out SPAM. You could invade people's privacy and read all their emails by hand, but it would be faster and cheaper to build a machine that can simply detect SPAM such that you can filter through hundreds of millions of emails per day with a few dozen machines. Given that law enforcement seeks to protect such a large population (presumably filtering through rather massive amounts of data looking for criminals/terrorists), it is not hard to expect that there's a high degree of automation in this process. Bottom line, narrow AI based automation is probably involved. So, given this assumption, what we really lack is an ability to transform our AI agents such that:

they can be audited by a trusted party for privacy protection
they can't be reverse engineered when deployed
their predictions can't be known by those being surveilled
their predictions can't be falsified by the deploying party (such as a chat application)
reasonably efficient and scalable

In order to fully define this idea, we will be building a prototype "Fire Detector" for crime. In the next section, we're going to build a basic version of this detector using a 2 layer neural network. After that, we're going to upgrade this detector so that it meets the requirements listed above. For the sake of exposition, this detector is going to be trained on a SPAM dataset and thus will only detect SPAM, but it could conceivably be trained to detect any particular event you wanted (i.e., murder, arson, etc.). I only chose SPAM because it's relatively easy to train and gives me a simple, high quality network with which to demonstrate the methods of this blogpost. However, the spectrum of possible detectors is as broad as the field of AI itself.

Part 4: Building a SPAM Detector

So, our demo use case is that a local law enforcement officer (we'll call him "Bob") is hoping to crack down on people sending out SPAM emails. However, instead of reading everyone's emails, Bob only wants to detect when someone is sending SPAM so that he can file for an injunction and acquire a warrant with the police to further investigate. The first part of this process is to simply build an effective SPAM detector.

The Enron Spam Dataset: In order to teach an algorithm how to detect SPAM, we need a large set of emails that has been previously labeled as "SPAM" or "NOT SPAM". That way, our algorithm can study the dataset and learn to tell the difference between the two kinds of emails. Fortunately, a prominent energy company called Enron committed a few crimes recorded in email, and as a result a rather large subset of the company's emails were made public. As many of these were SPAM, a dataset was curated specifically for building SPAM detectors called the Enron Spam Dataset. I have further pre-processed this dataset for you in the following files: HAM and SPAM. Each line contains an email. There are 22,032 HAM emails, and 9,000 SPAM emails. We're going to set aside the last 1,000 in each category as our "test" dataset. We'll train on the rest.

The Model: For this model, we're going to optimize for speed and simplicity and use a simple bag-of-words Logistic Classifier. It's a neural network with 2 layers (input and output). We could get more sophisticated with an LSTM, but this blogpost isn't about filtering SPAM, it's about making surveillance less intrusive, more accountable, and more effective. Besides that, bag-of-words LR works really well for SPAM detection anyway (and for a surprisingly large number of other tasks as well). No need to overcomplicate. Below you'll find the code to build this classifier. If you're unsure how this works, feel free to review my previous post on A Neural Network in 11 Lines of Python.

(fwiw, this code works identically in Python 2 or 3 on my machine)


import numpy as np
from collections import Counter
import random
import sys

np.random.seed(12345)

f = open('spam.txt','r')
raw = f.readlines()
f.close()

spam = list()
for row in raw:
    spam.append(row[:-2].split(" "))
    
f = open('ham.txt','r')
raw = f.readlines()
f.close()

ham = list()
for row in raw:
    ham.append(row[:-2].split(" "))
    
class LogisticRegression(object):
    
    def __init__(self, positives,negatives,iterations=10,alpha=0.1):
        
        # create vocabulary (real world use case would add a few million
        # other terms as well from a big internet scrape)
        cnts = Counter()
        for email in (positives+negatives):
            for word in email:
                cnts[word] += 1
        
        # convert to lookup table
        vocab = list(cnts.keys())
        self.word2index = {}
        for i,word in enumerate(vocab):
            self.word2index[word] = i
    
        # initialize decrypted weights
        self.weights = (np.random.rand(len(vocab)) - 0.5) * 0.1
        
        # train model on unencrypted information
        self.train(positives,negatives,iterations=iterations,alpha=alpha)
    
    def train(self,positives,negatives,iterations=10,alpha=0.1):
        
        for iter in range(iterations):
            error = 0
            n = 0
            for i in range(max(len(positives),len(negatives))):

                error += np.abs(self.learn(positives[i % len(positives)],1,alpha))
                error += np.abs(self.learn(negatives[i % len(negatives)],0,alpha))
                n += 2

            print("Iter:" + str(iter) + " Loss:" + str(error / float(n)))

    
    def softmax(self,x):
        return 1/(1+np.exp(-x))

    def predict(self,email):
        pred = 0
        for word in email:
            pred += self.weights[self.word2index[word]]
        pred = self.softmax(pred)
        return pred

    def learn(self,email,target,alpha):
        pred = self.predict(email)
        delta = (pred - target)# * pred * (1 - pred)
        for word in email:
            self.weights[self.word2index[word]] -= delta * alpha
        return delta
    
model = LogisticRegression(spam[0:-1000],ham[0:-1000],iterations=3)

# evaluate on holdout set

fp = 0
tn = 0
tp = 0
fn = 0

for i,h in enumerate(ham[-1000:]):
    pred = model.predict(h)

    if(pred < 0.5):
        tn += 1
    else:
        fp += 1
        
    if(i % 10 == 0):
        sys.stdout.write('\rI:'+str(tn+tp+fn+fp) + " % Correct:" + str(100*tn/float(tn+fp))[0:6])

for i,h in enumerate(spam[-1000:]):
    pred = model.predict(h)

    if(pred >= 0.5):
        tp += 1
    else:
        fn += 1

    if(i % 10 == 0):
        sys.stdout.write('\rI:'+str(tn+tp+fn+fp) + " % Correct:" + str(100*(tn+tp)/float(tn+tp+fn+fp))[0:6])
sys.stdout.write('\rI:'+str(tn+tp+fn+fp) + " Correct: %" + str(100*(tn+tp)/float(tn+tp+fn+fp))[0:6])

print("\nTest Accuracy: %" + str(100*(tn+tp)/float(tn+tp+fn+fp))[0:6])
print("False Positives: %" + str(100*fp/float(tp+fp))[0:4] + "    <- privacy violation level out of 100.0%")
print("False Negatives: %" + str(100*fn/float(tn+fn))[0:4] + "   <- security risk level out of 100.0%")

Iter:0 Loss:0.0455724486216
Iter:1 Loss:0.0173317643148
Iter:2 Loss:0.0113520767678
I:2000 Correct: %99.798
Test Accuracy: %99.7
False Positives: %0.3    <- privacy violation level out of 100.0%
False Negatives: %0.3   <- security risk level out of 100.0%

Feature: Auditability: a nice feature of our classifier is that it is a highly auditable algorithm. Not only does it give us accurate scores on the testing data, but we can open it up and look at how it weights various terms to make sure it's flagging emails based on what officer Bob is specifically looking for. It is with these insights that officer Bob seeks permission from his superior to perform his very limited surveillance over the email clients in his jurisdiction. Note, Bob has no access to read anyone's emails. He only has access to detect exactly what he's looking for. The purpose of this model is to be a measure of "probable cause", which Bob's superior can make the final call on given the privacy and security levels indicated above for this model.

Ok, so we have our classifier and Bob gets it approved by his boss (the chief of police?). Presumably, law enforcement officer "Bob" would hand this over to all the email clients within his jurisdiction. Each email client would then use the classifier to make a prediction each time it's about to send an email (commit a crime). This prediction gets sent to Bob, and eventually he figures out who has been anonymously sending out 10,000 SPAM emails every day within his jurisdiction.

Problem 1: His Predictions Get Faked - after 1 week of running his algorithm in everyone's email clients, everyone is still receiving tons of SPAM. However, Bob's Logistic Regression Classifier apparently isn't flagging ANY of it, even though it seems to work when he tests some of the missed SPAM on the classifier with his own machine. He suspects that someone is intercepting the algorithm's predictions and faking them to look like they're all "Negative". What's he to do?

Problem 2: His Model is Reverse Enginered - Furthermore, he notices that he can take his pre-trained model and sort it by its weight values, yielding the following result.

While this was advantageous for auditability (making the case to Bob's boss that this model is going to find only the information it's supposed to), it makes it vulnerable to attacks! So not only can people intercept and modify his model's predictions, but they can even reverse engineer the system to figure out which words to avoid. In other words, the model's capabilities and predictions are vulnerable to attack. Bob needs another line of defense.

Part 5: Homomorphic Encryption

In my previous blogpost Building Safe A.I., I outlined how one can train Neural Networks in an encrypted state (on data that is not encrypted) using Homomorphic Encryption. Along the way, I discussed how Homomorphic Encryption generally works and provided an implementation of Efficient Integer Vector Homomorphic Encryption with tooling for neural networks based on this implementation. However, as mentioned in the post, there are many homomorphic encryption schemes to choose from. In this post, we're going to use a different one called Paillier Cryptography, which is a probabilistic, assymetric algorithm for public key cryptography. While a complete breakdown of this cryptosystem is something best saved for a different blogpost, I did fork and update a python library for paillier to be able to handle larger cyphertexts and plaintexts (longs) as well as a small bugfix in the logging here Paillier Cryptosystem Library. Pull that repo down, run "python setup.py install" and try out the following code.

As you can see, we can encrypt (positive or negative) numbers using a public key and then add their encrypted values together. We can then decrypt the resulting number which returns the output of whatever math operations we performed. Pretty cool, eh? We can use just these operations to encrypt our Logistic Regression classifier after training. For more on how this works, check out my previous post on the subject, otherwise let's jump straight into the implementation.


import phe as paillier
import math
import numpy as np
from collections import Counter
import random
import sys

np.random.seed(12345)

print("Generating paillier keypair")
pubkey, prikey = paillier.generate_paillier_keypair(n_length=64)

print("Importing dataset from disk...")
f = open('spam.txt','r')
raw = f.readlines()
f.close()

spam = list()
for row in raw:
    spam.append(row[:-2].split(" "))
    
f = open('ham.txt','r')
raw = f.readlines()
f.close()

ham = list()
for row in raw:
    ham.append(row[:-2].split(" "))
    
class HomomorphicLogisticRegression(object):
    
    def __init__(self, positives,negatives,iterations=10,alpha=0.1):
        
        self.encrypted=False
        self.maxweight=10
        
        # create vocabulary (real world use case would add a few million
        # other terms as well from a big internet scrape)
        cnts = Counter()
        for email in (positives+negatives):
            for word in email:
                cnts[word] += 1
        
        # convert to lookup table
        vocab = list(cnts.keys())
        self.word2index = {}
        for i,word in enumerate(vocab):
            self.word2index[word] = i
    
        # initialize decrypted weights
        self.weights = (np.random.rand(len(vocab)) - 0.5) * 0.1
        
        # train model on unencrypted information
        self.train(positives,negatives,iterations=iterations,alpha=alpha)
        

    
    def train(self,positives,negatives,iterations=10,alpha=0.1):
        
        for iter in range(iterations):
            error = 0
            n = 0
            for i in range(max(len(positives),len(negatives))):

                error += np.abs(self.learn(positives[i % len(positives)],1,alpha))
                error += np.abs(self.learn(negatives[i % len(negatives)],0,alpha))
                n += 2

            print("Iter:" + str(iter) + " Loss:" + str(error / float(n)))

    
    def softmax(self,x):
        return 1/(1+np.exp(-x))

    def encrypt(self,pubkey,scaling_factor=1000):
        if(not self.encrypted):
            self.pubkey = pubkey
            self.scaling_factor = float(scaling_factor)
            self.encrypted_weights = list()

            for weight in model.weights:
                self.encrypted_weights.append(self.pubkey.encrypt(\\
                int(min(weight,self.maxweight) * self.scaling_factor)))

            self.encrypted = True            
            self.weights = None

            
        return self

    def predict(self,email):
        if(self.encrypted):
            return self.encrypted_predict(email)
        else:
            return self.unencrypted_predict(email)
    
    def encrypted_predict(self,email):
        pred = self.pubkey.encrypt(0)
        for word in email:
            pred += self.encrypted_weights[self.word2index[word]]
        return pred
    
    def unencrypted_predict(self,email):
        pred = 0
        for word in email:
            pred += self.weights[self.word2index[word]]
        pred = self.softmax(pred)
        return pred

    def learn(self,email,target,alpha):
        pred = self.predict(email)
        delta = (pred - target)# * pred * (1 - pred)
        for word in email:
            self.weights[self.word2index[word]] -= delta * alpha
        return delta
    
model = HomomorphicLogisticRegression(spam[0:-1000],ham[0:-1000],iterations=10)

encrypted_model = model.encrypt(pubkey)

# generate encrypted predictions. Then decrypt them and evaluate.

fp = 0
tn = 0
tp = 0
fn = 0

for i,h in enumerate(ham[-1000:]):
    encrypted_pred = encrypted_model.predict(h)
    try:
        pred = prikey.decrypt(encrypted_pred) / encrypted_model.scaling_factor
        if(pred < 0):
            tn += 1
        else:
            fp += 1
    except:
        print("overflow")

    if(i % 10 == 0):
        sys.stdout.write('\r I:'+str(tn+tp+fn+fp) + " % Correct:" + str(100*tn/float(tn+fp))[0:6])

for i,h in enumerate(spam[-1000:]):
    encrypted_pred = encrypted_model.predict(h)
    try:
        pred = prikey.decrypt(encrypted_pred) / encrypted_model.scaling_factor
        if(pred > 0):
            tp += 1
        else:
            fn += 1
    except:
        print("overflow")

    if(i % 10 == 0):
        sys.stdout.write('\r I:'+str(tn+tp+fn+fp) + " % Correct:" + str(100*(tn+tp)/float(tn+tp+fn+fp))[0:6])
sys.stdout.write('\r I:'+str(tn+tp+fn+fp) + " % Correct:" + str(100*(tn+tp)/float(tn+tp+fn+fp))[0:6])

print("\n Encrypted Accuracy: %" + str(100*(tn+tp)/float(tn+tp+fn+fp))[0:6])
print("False Positives: %" + str(100*fp/float(tp+fp))[0:4] + "    <- privacy violation level")
print("False Negatives: %" + str(100*fn/float(tn+fn))[0:4] + "   <- security risk level")


Generating paillier keypair
Importing dataset from disk...
Iter:0 Loss:0.0455724486216
Iter:1 Loss:0.0173317643148
Iter:2 Loss:0.0113520767678
Iter:3 Loss:0.00455875940625
Iter:4 Loss:0.00178564065045
Iter:5 Loss:0.000854385076612
Iter:6 Loss:0.000417669805378
Iter:7 Loss:0.000298985174998
Iter:8 Loss:0.000244521525096
Iter:9 Loss:0.000211014087681
 I:2000 % Correct:99.296
 Encrypted Accuracy: %99.2
False Positives: %0.0    <- privacy violation level
False Negatives: %1.57   <- security risk level

This model is really quite special (and fast!... around 1000 emails per second with a single thread on my laptop). Note that we don't use the sigmoid during prediction (only during training) as it's followed by a threshold at 0.5. Thus, at testing we can simply skip the sigmoid and threshold at 0 (which is identical to running the sigmoid and thresholding at 0.5). However, enough with the technicals, let's get back to Bob.

Bob had a problem with people being able to see his predictions and fake them. However, now all the predictions are encrypted.

Furthermore, Bob had a problem with people reading his weights and reverse engineering how his algorithm had learned to detect. However, now all the weights themselves are also encrypted (and can predict in their encrypted state!).

Now when he deploys his model, no one can read what it is sending to spoof it (or even know what it is supposedly detecting) or reverse engineer it to further avoid its detection. This model has many of the desirable properties that we wanted. It's auditable by a third party, makes encrypted predictions, and its intelligence is also encrypted from those who might want to steal/fool it. Furthermore, it is quite accurate (with no false positives on the testing dataset), and also quite fast. Bob deploys his new model, receives encrypted predictions, and discovers that one particular person seems to be preparing to send out (what the model thinks is) 10,000 suspiciously SPAMY emails. He reports the metric to his boss and a judge, obtains a warrant, and rids the world of SPAM forever!!!!

Part 6: Building Safe Crime Prediction

Let's take a second and consider the high level difference that this model can make for law enforcement. Present day, in order to detect events such as a murder or terrorist attack, law enforcement needs unrestricted access to data streams which might be predictive of the event. Thus, in order to detect an event that may occur in 0.0001% of the data, they have to have access to 100% of the data stream by re-directing it to a secret warehouse wherein (I assume) Machine Learning models are likely deployed.

However, with this approach the same Machine Learning models currently used to identify crimes can instead be encrypted and used as detectors which are deployed to the data stream itself (i.e., chat applications). Law Enforcement then only has access to the predictions of the model as opposed to having access to the entire dataset. This is similar to the use of drug dogs in an airport. Drug dogs eliminate the need for law enforcement to search everyone's bags looking for cocaine. Instead, a dog is TRAINED (just like a Machine Learning model) to exclusively detect the existence of narcotics. Barking == drugs. No barking == no drugs. POSITIVE neural network prediction means "a terrorist plot is being planned on this phone", NEGATIVE neural network prediction means "a terrorist plot is NOT being planned on this phone". Law enforcement has no need to see the data. They only need this one datapoint. Furthermore, as the model is a discrete piece of intelligence, it can be independently evaluated to ensure that it only detects what it's supposed to (just like we can independently audit what a drug dog is trained to detect by evaluating the dog's accuracy through tests). However, unlike drug dogs, Encrypted Artificial Intelligence could provide this ability for any crime that is detectable through digitial evidence.

Auditing Concerns: So, who do we trust to perform the auditing? I'm not a political science expert, so I'll leave this for others to comment. However, I think that third party watchdogs, a mixture of government contractors, or perhaps even open source developers could perform this role. If there are enough versions of every detector, it would likely be very difficult for bad actors to figure out which one was being deployed against them (since they're encrypted). I see several plausible options here and, largely, auditing bodies over government organizations seems like the kind of problem that many people have thought about before me, so I'll leave this part for the experts. ;)

Ethical Concerns: Notably, literary work provides commentary around the ethical and moral implications of crime predictions leading to a conviction directly, (such as in the 1956 short story "Minority Report", the origin of the term "precrime"). However, the primary value of crime prediction is not efficient punishment and inprisonment, it's the prevention of harm. Accordingly, there are two trivial ways to avoid this moral dilemma. First, the vast majority of crimes require smaller crimes in advance of a larger one (i.e., conspiracy to commit), and simply predicting the larger crime by being able to more accurately detect the smaller crimes of preparation avoids much of the moral dilemma. Secondly, pre-crime technology can simply be used as a method for how to best allocate police resources as well as a method for triggering a warrant/investigation (such as Bob did in our example). This latter use inparticular is perhaps the best use of crime prediction technology if the privacy security tradeoff can be mitigated (the topic of this blogpost). A positive prediction should launch an investigation, not put someone behind bars directly.

Legal Concerns: United States v. Place ruled that because drug dogs are able to exclusively detect the odor of narcotics (without detecting anything else), they are not considered a "search". In other words, because they are able to classify only the crime without requiring a citizen to divulge any other information, it is not considered an invasion of privacy. Furthermore, I believe that the general feeling of the public around this issue reflects the law. A fluffy dog coming up and giving your rolling bag a quick sniff at the airport is a very effective yet privacy preserving form of surveillance. Curiously, the dog un-doubtedly could be trained to detect the presence of any number of embarassing things in your bag. However, it is only TRAINED to detect those indicative of a crime. In the same way, Artificial Intelligence agents can be TRAINED to detect signs indicative of a crime without detecting anything else. As such, for models that achieve a sufficiently high accuracy rate, these models could obtain a similar legal status as drug dogs.

Authoritarian Corruption Concerns: Perhaps you're wondering, "Why innovate in this space? Why propose new methods for surveillance? Aren't we being surveilled enough?". My answer is this: It should be impossible for those who do NOT harm one another (the innocent) to be surveilled by corporations or governments. Inversely, we want to detect anyone about to injure another human far enough in advance to stop them. Before recent technological advancements, these two statements were clearly impossible to achieve together. The purpose of this blogpost is to make one point: I believe it is technologically plausible to have both perfect safety and perfect privacy. By "perfect privacy", I mean privacy that is NOT subject to the whims of an authoritarian government, but is instead limited by auditable technology like Encrypted Artificial Intelligence. Who should be responsible for auditing this technology without revealing its techniques to bad actors? I'm not sure. Perhaps it's 3rd party watchdog organizations. Perhaps it's instead a system people opt-in to (like Fire Alarms) and there's a social contract established such that people can avoid those who do not opt in (because... why would you?). Perhaps it's developed entirely in the open source but is simply so effective that it can't be circumvented? These are some good questions to explore in subsequent discussions. This blogpost is not the whole solution. Social and government structures would undoubtedly need to adjust to the advent of this kind of tool. However, I do believe it is a significant piece of the puzzle, and I look forward to the conversations it can inspire.

Part 7: Future Work

First and foremost, we need modern float vector Homomorphic Encryption algorithms (FV, YASHE, etc.) supported in a major Deep Learning framework (PyTorch, Tensorflow, Keras, etc.). Furthermore, exploring how we can increase the speed and security of these algorithms is an actively innovated and vitally important line of work. Finally, we need to imagine how social structures could best partner with these new tools to protect people's safety without violating privacy (and continue to reduce the risk of authoritarian governments misusing the technology).

Part 8: Let's Talk

At the end of May, I had the pleasure of meeting with the brilliant Carrick Flynn and a number of wonderful folks from the Future of Humanity Institute, one of the world's leading labs on A.I. Safety. We were speaking as they have recently become curious about Homomorphic Encryption, especially in the context of Deep Learning and A.I. One of the use cases we explored over Indian cuisine was that of Crime Prediction, and this blogpost is an early derivative of our conversation (hopefully one of many). I hope you find these ideas as exciting as I do, and I encourage you to reach out to Carrick, myself, and FHI if you are further curious about this topic or the Existential Risks of A.I. more broadly. To that end, if you find yourself excited about the idea that these tools might reduce government surveillance down to only criminals and terrorists, don't just click on! Help spread the word! (upvote, share, etc.) Also, I'm particularly interested in hearing from you if you work in one of several industries (or can make introductions):

Law Enforcement: I'm very curious as to the extent to which these kinds of tools could become usable, particularly for local law enforcement. What are the major barriers to entry to tools such as these becoming useful for the triggering of a warrant? What kinds of certifications are needed? Do you know of similar precedent/cases? (i.e., drug dogs)
DARPA / Intelligence Community / Gov. Contractor: Similar question as for local law enforcement, but with the context being the federal space.
Legislation / Regulation: I envision a future where these tools become mature enough for legislation to be crafted such that they account for improved privacy/security tradeoffs (reduced privacy invasion but expedited warrant triggering procedure). Are there members of the legislative body who are actually interested in backing this type of development?

I typically tweet out new blogposts when they're complete @iamtrask. As mentioned above, if these ideas inspire you to help in some way, a share or upvote is the first place to start as a lack of awareness of these tools is the greatest obstacle at this stage. All in all, thank you for your time and attention, and I hope you enjoyed the post!

Relevant Links

http://blog.fastforwardlabs.com/2017/03/09/fairml-auditing-black-box-predictive-models.html
https://eprint.iacr.org/2013/075.pdf
https://iamtrask.github.io/2017/03/17/safe-ai/

Deep Learning without Backpropagation

Tue, 21 Mar 2017 12:00:00 +0000

TLDR: In this blogpost, we're going to prototype (from scratch) and learn the intuitions behind DeepMind's recently proposed Decoupled Neural Interfaces Using Synthetic Gradients paper.

I typically tweet out new blogposts when they're complete at @iamtrask. Feel free to follow if you'd be interested in reading more in the future and thanks for all the feedback!

Part 1: Synthetic Gradients Overview

Normally, a neural network compares its predictions to a dataset to decide how to update its weights. It then uses backpropagation to figure out how each weight should move in order to make the prediction more accurate. However, with Synthetic Gradients, individual layers instead make a "best guess" for what they think the data will say, and then update their weights according to this guess. This "best guess" is called a Synthetic Gradient. The data is only used to help update each layer's "guesser" or Synthetic Gradient generator. This allows for (most of the time), individual layers to learn in isolation, which increases the speed of training.

Edit: This paper also adds great intuitions on how/why Synthetic Gradients are so effective

Source: Decoupled Neural Interfaces Using Synthetic Gradients

The graphic above (from the paper) gives a very intuitive picture for what’s going on (from left to right). The squares with rounded off corners are layers and the diamond shaped objects are (what I call) the Synthetic Gradient generators. Let’s start with how a regular neural network layer is updated.

Part 2: Using Synthetic Gradients

Let's start by ignoring how the Synthetic Gradients are created and instead just look at how the are used. The far left box shows how these can work to update the first layer in a neural network. The first layer forward propagates into the Synthetic Gradient generator (M i+1), which then returns a gradient. This gradient is used instead of the real gradient (which would take a full forward propagation and backpropagation to compute). The weights are then updated as normal, pretending that this Synthetic Gradient is the real gradient. If you need a refresher on how weights are updated with gradients, check out A Neural Network in 11 Lines of Python and perhaps the followup post on Gradient Descent.

So, in short, Synthetic Gradients are used just like normal gradients, and for some magical reason they seem to be accurate (without consulting the data)! Seems like magic? Let’s see how they’re made.

Part 3: Generating Synthetic Gradients

Ok, this part is really clever, and frankly it's amazing that it works. How do you generate Synthetic Gradients for a neural network? Well, you use another network of course! Synthetic Gradient genenerators are nothing other than a neural network that is trained to take the output of a layer and predict the gradient that will likely happen at that layer.

A Sidenote: Related Work by Geoffrey Hinton

This actually reminds me of some work that Geoffrey Hinton did a couple years ago in which he showed that random feedback weights support learning in deep neural networks. Basically, you can backpropagate through randomly generated matrices and still accomplish learning. Furthermore, he showed that it had a kind of regularization affect. It was some interesting work for sure.

Ok, back to Synthetic Gradients. So, now we know that Synthetic Gradients are trained by another neural network that learns to predict the gradient at a step given the output at that step. The paper also says that any other relevant information could be used as input to the Synthetic Gradient generator network, but in the paper it seems like just the output of the layer is used for normal feedforwards networks. Furthermore, the paper even states that a single linear layer can be used as the Synthetic Gradient generator. Amazing. We're going to try that out.

How do we learn the network that generates Synthetic Gradients?

This begs the question, how do we learn the neural networks that generate our Synthetic Gradients? Well, as it turns out, when we perform full forward and backpropagation, we actually get the "correct" gradient. We can compare this to our "synthetic" gradient in the same way we normally compare the output of a neural network to the dataset. Thus, we can train our Synthetic Gradient networks by pretending that our "true gradients" are coming from mythical dataset... so we train them like normal. Neat!

Wait... if our Synthetic Gradient Network requires backprop... what's the point?

Excellent question! The whole point of this technique was to allow individual neural networks to train without waiting on each other to finish forward and backpropagating. If our Synthetic Gradient networks require waiting for a full forward/backprop step, then we're back where we started but with more computation going on (even worse!). For the answer, let's revisit this visualization from the paper.

Source: Decoupled Neural Interfaces Using Synthetic Gradients

Focus on the second section from the left. See how the gradient (M i+2) backpropagates through (f i+1) and into M(i+1)? As you can see, each synthetic gradient generator is actually only trained using the Synthetic Gradients generated from the next layer. Thus, only the last layer actually trains on the data. All the other layers, including the Synthetic Gradient generator networks, train based on Synthetic Gradients. Thus, the network can train with each layer only having to wait on the synthetic gradient from the following layer (which has no other dependencies). Very cool!

Part 4: A Baseline Neural Network

Time to start coding! To get things started (so we have an easier frame of reference), I'm going to start with a vanilla neural network trained with backpropagation, styled in the same way as A Neural Network in 11 Lines of Python. (So, if it doesn't make sense, just go read that post and come back). However, I'm going to add an additional layer, but that shoudln't be a problem for comprehension. I just figured that since we're all about reducing dependencies, having more layers might make for a better illustration.

As far as the dataset we're training on, I'm going to genereate a synthetic dataset (har! har!) using binary addition. So, the network will take two, random binary numbers and predict their sum (also a binary number). The nice thing is that this gives us the flexibility to increase the dimensionality (~difficulty) of the task as needed. Here's the code for generating the dataset.

And here's the code for a vanilla neural network training on that dataset.

Now, at this point I really feel its necessary to do something that I almost never do in the context of learning, add a bit of object oriented structure. Normally, this obfuscates the network a little bit and makes it harder to see (from a high level) what's going on (relative to just reading a python script). However, since this post is about "Decoupled Neural Interfaces" and the benefits that they offer, it's really pretty hard to explain things without actually having those interfaces be reasonably decoupled.So, to make learning a little bit easier, I'm first going to convert the network above into exactly the same network but with a "Layer" class object that we'll soon convert into a DNI. Let's take a look at this Layer object.


class Layer(object):
    
    def __init__(self,input_dim, output_dim,nonlin,nonlin_deriv):
        
        self.weights = (np.random.randn(input_dim, output_dim) * 0.2) - 0.1
        self.nonlin = nonlin
        self.nonlin_deriv = nonlin_deriv
    
    def forward(self,input):
        self.input = input
        self.output = self.nonlin(self.input.dot(self.weights))
        return self.output
    
    def backward(self,output_delta):
        self.weight_output_delta = output_delta * self.nonlin_deriv(self.output)
        return self.weight_output_delta.dot(self.weights.T)
    
    def update(self,alpha=0.1):
        self.weights -= self.input.T.dot(self.weight_output_delta) * alpha

In this Layer class, we have several class variables. weights is the matrix we use for a linear transformation from input to output (just like a normal linear layer). Optionally, we can also include an output nonlin function which will put a non-linearity on the output of our network. If we don't want a non-linearity, we can simply set this value to lambda x:x. In our case, we're going to pass in the "sigmoid" function.

The second function we pass in is nonlin_deriv which is a special derivative function. This function needs to take the output from our nonlinearity and convert it to the derivative. For sigmoid, this is simply (out * (1 - out)) where "out" is the output of the sigmoid. This particular function exists for pretty much all of the common neural network nonlinearities.

Now, let's take a look at the various methods in this class. forward does what it's name implies. It forward propagates through the layer, first through a linear transformation, and then through the nonlin function. backward accepts a output_delta paramter, which represents the real gradient (as opposed to a synthetic one) coming back from the next layer during backpropagation. We then use this to compute self.weight_output_delta, which is the derivative at the output of our weights (just inside the nonlinearity). Finally, it backpropagates the error to send to the previous layer and returns it.

update is perhaps the simplest method of all. It simply takes the derivative at the output of the weights and uses it to perform a weight update. If any of these steps don't make sense to you, again, consult A Neural Network in 11 Lines of Python and come back. If everything makes sense, then let's see our layer objects in the context of training.


layer_1 = Layer(input_dim,layer_1_dim,sigmoid,sigmoid_out2deriv)
layer_2 = Layer(layer_1_dim,layer_2_dim,sigmoid,sigmoid_out2deriv)
layer_3 = Layer(layer_2_dim, output_dim,sigmoid, sigmoid_out2deriv)

for iter in range(iterations):
    error = 0

    for batch_i in range(int(len(x) / batch_size)):
        batch_x = x[(batch_i * batch_size):(batch_i+1)*batch_size]
        batch_y = y[(batch_i * batch_size):(batch_i+1)*batch_size]  
        
        layer_1_out = layer_1.forward(batch_x)
        layer_2_out = layer_2.forward(layer_1_out)
        layer_3_out = layer_3.forward(layer_2_out)

        layer_3_delta = layer_3_out - batch_y
        layer_2_delta = layer_3.backward(layer_3_delta)
        layer_1_delta = layer_2.backward(layer_2_delta)
        layer_1.backward(layer_1_delta)
        
        layer_1.update()
        layer_2.update()
        layer_3.update()

Given a dataset x and y, this is how we use our new layer objects. If you compare it to the script from before, pretty much everything happens in pretty much the same places. I just swapped out the script versions of the neural network for the method calls

So, all we've really done is taken the steps in the script from the previous neural network and split them into distinct functions inside of a class. Below, we can see this layer in action.

If you pull both the previous network and this network into Jupyter notebooks, you'll see that the random seeds cause these networks to have exactly the same values. It seems that Trinket.io might not have perfect random seeding, such that these networks reach nearly identical values. However, I assure you that the networks are identical. If this network doesn't make sense to you, don't move on. Be sure you're comfortable with how this abstraction works before moving forward, as it's going to get a bit more complex below.

Part 6: Synthetic Gradients Based on Layer Output

Ok, so now we're going to use a very similar interface to the onee to integrate what we learned about Synthetic Gradients into our Layer object (and rename it DNI). First, I'm going to show you the class, and then I'll explain it. Check it out!


class DNI(object):
    
    def __init__(self,input_dim, output_dim,nonlin,nonlin_deriv,alpha = 0.1):
        
        # same as before
        self.weights = (np.random.randn(input_dim, output_dim) * 0.2) - 0.1
        self.nonlin = nonlin
        self.nonlin_deriv = nonlin_deriv


        # new stuff
        self.weights_synthetic_grads = (np.random.randn(output_dim,output_dim) * 0.2) - 0.1
        self.alpha = alpha
    
    # used to be just "forward", but now we update during the forward pass using Synthetic Gradients :)
    def forward_and_synthetic_update(self,input):

    	# cache input
        self.input = input

        # forward propagate
        self.output = self.nonlin(self.input.dot(self.weights))
        
        # generate synthetic gradient via simple linear transformation
        self.synthetic_gradient = self.output.dot(self.weights_synthetic_grads)

        # update our regular weights using synthetic gradient
        self.weight_synthetic_gradient = self.synthetic_gradient * self.nonlin_deriv(self.output)
        self.weights += self.input.T.dot(self.weight_synthetic_gradient) * self.alpha
        
        # return backpropagated synthetic gradient (this is like the output of "backprop" method from the Layer class)
        # also return forward propagated output (feels weird i know... )
        return self.weight_synthetic_gradient.dot(self.weights.T), self.output
    
    # this is just like the "update" method from before... except it operates on the synthetic weights
    def update_synthetic_weights(self,true_gradient):
        self.synthetic_gradient_delta = self.synthetic_gradient - true_gradient 
        self.weights_synthetic_grads += self.output.T.dot(self.synthetic_gradient_delta) * self.alpha

So, the first big change. We have some new class variables. The only one that really matters is the self.weights_synthetic_grads variable, which is our Synthetic Generator neural network (just a linear layer... i.e., ...just a matrix).

Forward And Synthetic Update:The forward method has changed to forward_and_synthetic_update. Remember how we don't need any other part of the network to make our weight update? This is where the magic happens. First, forward propagation occurs like normal (line 22). Then, we generate our synthetic gradient by passing our output through a non-linearity. This part could be a more complicated neural network, but we've instead decided to keep things simple and just use a simple linear layer to generate our synthetic gradients. After we've got our gradient, we go ahead and update our normal weights (lines 28 and 29). Finally, we backpropagate our synthetic gradient from the output of the weights to the input so that we can send it to the previous layer.

Update Synthetic Gradient: Ok, so the gradient that we returned at the end of the "forward" method. That's what we're going to accept into the update_synthetic_gradient method from the next layer. So, if we're at layer 2, then layer 3 returns a gradient from its forward_and_synthetic_update method and that gets input into layer 2's update_synthetic_weights. Then, we simply update our synthetic weights just like we would a normal neural network. We take the input to the synthetic gradient layer (self.output), and then perform an average outer product (matrix transpose -> matrix mul) with the output delta. It's no different than learning in a normal neural network, we've just got some special inputs and outputs in leau of data

Ok! Let's see it in action.

Hmm... things aren't converging as I'd originally want them too. I mean, it is converging, but just not really very fast. Upon further inquiry, the hidden representations all start out pretty flat and random (which we're using as input to our gradient generators). In other words, two different training examples end up having nearly identical output representations at different layers. This seems to make it really difficult for the graident generators to do their job. In the paper, the solution for this is Batch Normalization, which scales all the layer outputs to 0 mean and unit variance. This adds a lot of complexity to what is otherwise a fairly simple toy neural network. Furthermore, the paper also mentions you can use other forms of input to the gradietn generators. I'm going to try using the output dataset. This still keeps things decoupled (the spirit of the DNI) but gives something really strong for the network to use to generate gradients from the very beginning. Let's check it out.

And things are training quite a bit faster! Thinking about what might make for good input to gradient generators is a really fascinating concept. Perhaps some combination between input data, output data, and batch normalized layer output would be optimal (feel free to give it a try!) Hope you've enjoyed this tutorial!

I typically tweet out new blogposts when they're complete at @iamtrask. Feel free to follow if you'd be interested in reading more in the future and thanks for all the feedback!

Building Safe A.I.

Fri, 17 Mar 2017 12:00:00 +0000

TLDR: In this blogpost, we're going to train a neural network that is fully encrypted during training (trained on unencrypted data). The result will be a neural network with two beneficial properties. First, the neural network's intelligence is protected from those who might want to steal it, allowing valuable AIs to be trained in insecure environments without risking theft of their intelligence. Secondly, the network can only make encrypted predictions (which presumably have no impact on the outside world because the outside world cannot understand the predictions without a secret key). This creates a valuable power imbalance between a user and a superintelligence. If the AI is homomorphically encrypted, then from it's perspective, the entire outside world is also homomorphically encrypted. A human controls the secret key and has the option to either unlock the AI itself (releasing it on the world) or just individual predictions the AI makes (seems safer).

I typically tweet out new blogposts when they're complete at @iamtrask. Feel free to follow if you'd be interested in reading more in the future and thanks for all the feedback!

Edit: If you're interested in training Encrypted Neural Networks, check out the PySyft Library at OpenMined

Superintelligence

Many people are concerned that superpoweful AI will one day choose to harm humanity. Most recently, Stephen Hawking called for a new world government to govern the abilities that we give to Artificial Intelligence so that it doesn't turn to destroy us. These are pretty bold statements, and I think they reflect the general concern shared between both the scientific community and the world at large. In this blogpost, I'd like to give a tutorial on a potential technical solution to this problem with some toy-ish example code to demonstrate the approach.

The goal is simple. We want to build A.I. technology that can become incredibly smart (smart enough to cure cancer, end world hunger, etc.), but whose intelligence is controlled by a human with a key, such that the application of intelligence is limited. Unlimited learning is great, but unlimited application of that knowledge is potentially dangerous.

To introduce this idea, I'll quickly describe two very exciting fields of research: Deep Learning and Homomorphic Encryption.

Part 1: What is Deep Learning?

Deep Learning is a suite of tools for the automation of intelligence, primarily leveraging neural networks. As a field of computer science, it is largely responsible for the recent boom in A.I. technology as it has surpassed previous quality records for many intelligence tasks. For context, it played a big part in DeepMind's AlphaGo system that recently defeated the world champion Go player, Lee Sedol.

Question: How does a neural network learn?

A neural network makes predictions based on input. It learns to do this effectively by trial and error. It begins by making a prediction (which is largely random at first), and then receives an "error signal" indiciating that it predicted too high or too low (usually probabilities). After this cycle repeats many millions of times, the network starts figuring things out. For more detail on how this works, see A Neural Network in 11 Lines of Python

The big takeaway here is this error signal. Without being told how well it's predictions are, it cannot learn. This will be important to remember.

Part 2: What is Homomorphic Encryption?

As the name suggests, Homomorphic Encryption is a form of encryption. In the asymmetric case, it can take perfectly readable text and turn it into jibberish using a "public key". More importantly, it can then take that jibberish and turn it back into the same text using a "secret key". However, unless you have the "secret key", you cannot decode the jibberish (in theory).

Homomorphic Encryption is a special type of encryption though. It allows someone to modify the encrypted information in specific ways without being able to read the information. For example, homomorphic encryption can be performed on numbers such that multiplication and addition can be performed on encrypted values without decrypting them. Here are a few toy examples.

Now, there are a growing number of homomorphic encryption schemes, each with different properties. It's a relatively young field and there are several significant problems still being worked through, but we'll come back to that later.

For now, let's just start with the following. Integer public key encryption schemes that are homomorphic over multiplication and addition can perform the operations in the picture above. Furthermore, because the public key allows for "one way" encryption, you can even perform operations between unencrypted numbers and encrypted numbers (by one-way encrypting them), as exemplified above by 2 * Cypher A. (Some encryption schemes don't even require that... but again... we'll come back to that later)

Part 3: Can we use them together?

Perhaps the most frequent intersection between Deep Learning and Homomorphic Encryption has manifested around Data Privacy. As it turns out, when you homomorphically encrypt data, you can't read it but you still maintain most of the interesting statistical structure. This has allowed people to train models on encrypted data (CryptoNets). Furthermore a startup hedge fund called Numer.ai encrypts expensive, proprietary data and allows anyone to attempt to train machine learning models to predict the stock market. Normally they wouldn't be able to do this becuase it would constitute giving away incredibly expensive information. (and normal encryption would make model training impossible)

However, this blog post is about doing the inverse, encrypting the neural network and training it on decrypted data.

A neural network, in all its amazing complexity, actually breaks down into a surprisingly small number of moving parts which are simply repeated over and over again. In fact, many state-of-the-art neural networks can be created using only the following operations:

Addition
Multiplication
Division
Subtraction
Sigmoid
Tanh
Exponential

So, let's ask the obvious technical question, can we homomorphically encrypt the neural network itself? Would we want to? As it turns out, with a few conservative approximations, this can be done.

Addition - works out of the box
Multiplication - works out of the box
Division - works out of the box? - simply 1 / multiplication
Subtraction - works out of the box? - simply negated addition
Sigmoid - hmmm... perhaps a bit harder
Tanh - hmmm... perhaps a bit harder
Exponential - hmmm... perhaps a bit harder

It seems like we'll be able to get Division and Subtraction pretty trivially, but these more complicated functions are... well... more complicated than simple addition and multiplication. In order to try to homomorphically encrypt a deep neural network, we need one more secret ingredient.

Part 4: Taylor Series Expansion

Perhaps you remember it from primary school. A Taylor Series allows one to compute a complicated (nonlinear) function using an infinite series of additions, subtractions, multiplications, and divisions. This is perfect! (except for the infinite part). Fortunately, if you stop short of computing the exact Taylor Series Expansion you can still get a close approximation of the function at hand. Here are a few popular functions approximated via Taylor Series (Source).

WAIT! THERE ARE EXPONENTS! No worries. Exponents are just repeated multiplication, which we can do. For something to play with, here's a little python implementation approximating the Taylor Series for our desirable sigmoid function (the formula for which you can lookup on Wolfram Alpha). We'll take the first few parts of the series and see how close we get to the true sigmoid function.

With only the first four factors of the Taylor Series, we get very close to sigmoid for a relatively large series of numbers. Now that we have our general strategy, it's time to select a Homomorphic Encryption algorithm.

Part 5: Choosing an Encryption Algorithm

Homomorphic Encryption is a relatively new field, with the major landmark being the discovery of the first Fully Homomorphic algorithm by Craig Gentry in 2009. This landmark event created a foothold for many to follow. Most of the excitement around Homomorphic Encryption has been around developing Turing Complete, homomorphically encrypted computers. Thus, the quest for a fully homomorphic scheme seeks to find an algorithm that can efficiently and securely compute the various logic gates required to run arbitrary computation. The general hope is that people would be able to securely offload work to the cloud with no risk that the data being sent could be read by anyone other than the sender. It's a very cool idea, and a lot of progress has been made.

However, there are some drawbacks. In general, most Fully Homomorphic Encryption schemes are incredibly slow relative to normal computers (not yet practical). This has sparked an interesting thread of research to limit the number of operations to be Somewhat homomorphic so that at least some computations could be performed. Less flexible but faster, a common tradeoff in computation.

This is where we want to start looking. In theory, we want a homomorphic encryption scheme that operates on floats (but we'll settle for integers, as we'll see) instead of binary values. Binary values would work, but not only would it require the flexibility of Fully Homomorphic Encryption (costing performance), but we'd have to manage the logic between binary representations and the math operations we want to compute. A less powerful, tailored HE algorithm for floating point operations would be a better fit.

Despite this constraint, there is still a plethora of choices. Here are a few popular ones with characteristics we like:

The best one to use here is likely either YASHE or FV. YASHE was the method used for the popular CryptoNets algorithm, with great support for floating point operations. However, it's pretty complex. For the purpose of making this blogpost easy and fun to play around with, we're going to go with the slightly less advanced (and possibly less secure) Efficient Integer Vector Homomorphic Encryption. However, I think it's important to note that new HE algorithms are being developed as you read this, and the ideas presented in this blogpost are generic to any schemes that are homomorphic over addition and multiplication of integers and/or floating point numbers. If anything, it is my hope to raise awareness for this application of HE such that more HE algos will be developed to optimize for Deep Learning.

This encryption algorithm is also covered extensively by Yu, Lai, and Paylor in this work with an accompanying implementation here. The main bulk of the approach is in the C++ file vhe.cpp. Below we'll walk through a python port of this code with accompanying explanation for what's going on. This will also be useful if you choose to implement a more advanced scheme as there are themes that are relatively universal (general function names, variable names, etc.).

Part 6: Homomorphic Encryption in Python

Let’s start by covering a bit of the Homomorphic Encryption jargon:

Plaintext: this is your un-encrypted data. It's also called the "message". In our case, this will be a bunch of numbers representing our neural network.
Cyphertext: this is your encrypted data. We'll do math operations on the cyphertext which will change the underlying Plaintext.
Public Key: this is a pseudo-random sequence of numbers that allows anyone to encrypt data. It's ok to share this with people because (in theory) they can only use it for encryption.
Private/Secret Key: this is a pseudo-random sequence of numbers that allows you to decrypt data that was encrypted by the Public Key. You do NOT want to share this with people. Otherwise, they could decrypt your messages.

So, those are the major moving parts. They also correspond to particular variables with names that are pretty standard across different homomorphic encryption techniques. In this paper, they are the following:

S: this is a matrix that represents your Secret/Private Key. You need it to decrypt stuff.
M: This is your public key. You'll use it to encrypt stuff and perform math operations. Some algorithms don't require the public key for all math operations but this one uses it quite extensively.
c: This vector is your encrypted data, your "cyphertext".
x: This corresponds to your message, or your "plaintext". Some papers use the variable "m" instead.
w: This is a single "weighting" scalar variable which we use to re-weight our input message x (make it consistently bigger or smaller). We use this variable to help tune the signal/noise ratio. Making the signal "bigger" makes it less susceptible to noise at any given operation. However, making it too big increases our likelihood of corrupting our data entirely. It's a balance.
E or e: generally refers to random noise. In some cases, this refers to noise added to the data before encrypting it with the public key. This noise is generally what makes the decryption difficult. It's what allows two encryptions of the same message to be different, which is important to make the message hard to crack. Note, this can be a vector or a matrix depending on the algorithm and implementation. In other cases, this can refer to the noise that accumulates over operations. More on that later.

As is convention with many math papers, capital letters correspond to matrices, lowercase letters correspond to vectors, and italic lowercase letters correspond to scalars. Homomorphic Encryption has four kinds of operations that we care about: public/private keypair generation, one-way encryption, decryption, and the math operations. Let's start with decryption.

The formula on the left describes the general relationship between our secret key S and our message x. The formula on the right shows how we can use our secret key to decrypt our message. Notice that "e" is gone? Basically, the general philosophy of Homomorphic Encryption techniques is to introduce just enough noise that the original message is hard to get back without the secret key, but a small enough amount of noise that it amounts to a rounding error when you DO have the secret key. The brackets on the top and bottom represent "round to the nearest integer". Other Homomorphic Encryption algorithms round to various amounts. Modulus operators are nearly ubiquitous. Encryption, then, is about generating a c so that this relationship holds true. If S is a random matrix, then c will be hard to decrypt. The simpler, non-symmetric way of generating an encryption key is to just find the inverse of the secret key. Let's start there with some Python code.

import numpy as np

def generate_key(w,m,n):
    S = (np.random.rand(m,n) * w / (2 ** 16)) # proving max(S) < w
    return S

def encrypt(x,S,m,n,w):
    assert len(x) == len(S)
    
    e = (np.random.rand(m)) # proving max(e) < w / 2
    c = np.linalg.inv(S).dot((w * x) + e)
    return c

def decrypt(c,S,w):
    return (S.dot(c) / w).astype('int')

def get_c_star(c,m,l):
    c_star = np.zeros(l * m,dtype='int')
    for i in range(m):
        b = np.array(list(np.binary_repr(np.abs(c[i]))),dtype='int')
        if(c[i] < 0):
            b *= -1
        c_star[(i * l) + (l-len(b)): (i+1) * l] += b
    return c_star

def get_S_star(S,m,n,l):
    S_star = list()
    for i in range(l):
        S_star.append(S*2**(l-i-1))
    S_star = np.array(S_star).transpose(1,2,0).reshape(m,n*l)
    return S_star


x = np.array([0,1,2,5])

m = len(x)
n = m
w = 16
S = generate_key(w,m,n)

And when I run this code in an iPython notebook, I can perform the following operations (with corresponding output).

The key thing to look at are the bottom results. Notice that we can perform some basic operations to the cyphertext and it changes the underlying plaintext accordingly. Neat, eh?

Part 7: Optimizing Encryption

Import Lesson: Take a look at the decryption formulas again. If the secret key, S, is the identity matrix, then cyphertext c is just a re-weighted, slightly noisy version of the input x, which could easily be discovered given a handful of examples. If this doesn't make sense, Google "Identity Matrix Tutorial" and come back. It's a bit too much to go into here.

This leads us into how encryption takes place. Instead of explicitly allocating a self-standing "Public Key" and "Private Key", the authors propose a "Key Switching" technique, wherein they can swap out one Private Key S for another S'. More specifically, this private key switching technique involves generating a matrix M that can perform the transformation.Since M has the ability to convert a message from being unencrypted (secret key of the identity matrix) to being encrypted (secret key that's random and difficult to guess), this M becomes our public key!

That was a lot of information at a fast pace. Let's nutshell that again.

Here's what happened...

Given the two formulas above, if the secret key is the identity matrix, the message isn't encrypted.
Given the two formulas above, if the secret key is a random matrix, the generated message is encrypted.
We can make a matrix M that changes the secret key from one secret key to another.
When the matrix M converts from the identity to a random secret key, it is, by extension, encrypting the message in a one-way encryption.
Because M performs the role of a "one way encryption", we call it the "public key" and can distribute it like we would a public key since it cannot decrypt the code.

So, without further adue, let's see how this is done in Python.


import numpy as np

def generate_key(w,m,n):
    S = (np.random.rand(m,n) * w / (2 ** 16)) # proving max(S) < w
    return S

def encrypt(x,S,m,n,w):
    assert len(x) == len(S)
    
    e = (np.random.rand(m)) # proving max(e) < w / 2
    c = np.linalg.inv(S).dot((w * x) + e)
    return c

def decrypt(c,S,w):
    return (S.dot(c) / w).astype('int')

def get_c_star(c,m,l):
    c_star = np.zeros(l * m,dtype='int')
    for i in range(m):
        b = np.array(list(np.binary_repr(np.abs(c[i]))),dtype='int')
        if(c[i] < 0):
            b *= -1
        c_star[(i * l) + (l-len(b)): (i+1) * l] += b
    return c_star

def switch_key(c,S,m,n,T):
    l = int(np.ceil(np.log2(np.max(np.abs(c)))))
    c_star = get_c_star(c,m,l)
    S_star = get_S_star(S,m,n,l)
    n_prime = n + 1
    

    S_prime = np.concatenate((np.eye(m),T.T),0).T
    A = (np.random.rand(n_prime - m, n*l) * 10).astype('int')
    E = (1 * np.random.rand(S_star.shape[0],S_star.shape[1])).astype('int')
    M = np.concatenate(((S_star - T.dot(A) + E),A),0)
    c_prime = M.dot(c_star)
    return c_prime,S_prime

def get_S_star(S,m,n,l):
    S_star = list()
    for i in range(l):
        S_star.append(S*2**(l-i-1))
    S_star = np.array(S_star).transpose(1,2,0).reshape(m,n*l)
    return S_star

def get_T(n):
    n_prime = n + 1
    T = (10 * np.random.rand(n,n_prime - n)).astype('int')
    return T

def encrypt_via_switch(x,w,m,n,T):
    c,S = switch_key(x*w,np.eye(m),m,n,T)
    return c,S

x = np.array([0,1,2,5])

m = len(x)
n = m
w = 16
S = generate_key(w,m,n)

The way this works is by making the S key mostly the identiy matrix, simply concatenating a random vector T onto it. Thus, T really has all the information necessary for the secret key, even though we have to still create a matrix of size S to get things to work right.

Part 8: Building an XOR Neural Network

So, now that we know how to encrypt and decrypt messages (And compute basic addition and multiplication), it's time to start trying to expand to the rest of the operations we need to build a simple XOR neural network. While technically neural networks are just a series of very simple operations, there are several combinations of these operations that we need some handy functions for. So, here I'm going to describe each operation we need and the high level approach we're going to take (basically which series of additions and multiplications we'll use). Then I'll show you code. For detailed descriptions check out this work

Floating Point Numbers: We're going to do this by simply scaling our floats into integers. We'll train our network on integers as if they were floats. Let's say we're scaling by 1000. 0.2 * 0.5 = 0.1. If we scale up, 200 * 500 = 100000. We have to scale down by 1000 twice since we performed multiplication, but 100000 / (1000 * 1000) = 0.1 which is what we want. This can be tricky at first but you'll get used to it. Since this HE scheme rounds to the nearest integer, this also lets you control the precision of your neural net.
Vector-Matrix Multiplication: This is our bread and butter. As it turns out, for the M matrix that converts from one secret key to another, there is actually a way to linear transform it.
Inner Dot Product: In the right context, the linear transformation above can also be an inner dot product.
Sigmoid: Since we can do vector-matrix multiplication, we can evaluate arbitrary polynomials given enough multiplications. Since we know the Taylor Series polynomial for sigmoid, we can evaluate an approximate sigmoid!
Elementwise Matrix Multiplication: This one is surprisingly inefficient. We have to do a Vector-Matrix multiplication or a series of inner dot products.
Outer Product: We can accomplish this via masking and inner products.

As a general disclaimer, there might be more effient ways of accomplishing these methods, but I didn't want to risk compromising the integrity of the homomorphic encryption scheme, so I sortof bent over backwards to just use the provided functions from the paper (with the allowed extension to sigmoid). Now, let's see how these are accomplished in Python.


def sigmoid(layer_2_c):
    out_rows = list()
    for position in range(len(layer_2_c)-1):

        M_position = M_onehot[len(layer_2_c)-2][0]

        layer_2_index_c = innerProd(layer_2_c,v_onehot[len(layer_2_c)-2][position],M_position,l) / scaling_factor

        x = layer_2_index_c
        x2 = innerProd(x,x,M_position,l) / scaling_factor
        x3 = innerProd(x,x2,M_position,l) / scaling_factor
        x5 = innerProd(x3,x2,M_position,l) / scaling_factor
        x7 = innerProd(x5,x2,M_position,l) / scaling_factor

        xs = copy.deepcopy(v_onehot[5][0])
        xs[1] = x[0]
        xs[2] = x2[0]
        xs[3] = x3[0]
        xs[4] = x5[0]
        xs[5] = x7[0]

        out = mat_mul_forward(xs,H_sigmoid[0:1],scaling_factor)
        out_rows.append(out)
    return transpose(out_rows)[0]

def load_linear_transformation(syn0_text,scaling_factor = 1000):
    syn0_text *= scaling_factor
    return linearTransformClient(syn0_text.T,getSecretKey(T_keys[len(syn0_text)-1]),T_keys[len(syn0_text)-1],l)

def outer_product(x,y):
    flip = False
    if(len(x) < len(y)):
        flip = True
        tmp = x
        x = y
        y = tmp
        
    y_matrix = list()

    for i in range(len(x)-1):
        y_matrix.append(y)

    y_matrix_transpose = transpose(y_matrix)

    outer_result = list()
    for i in range(len(x)-1):
        outer_result.append(mat_mul_forward(x * onehot[len(x)-1][i],y_matrix_transpose,scaling_factor))
    
    if(flip):
        return transpose(outer_result)
    
    return outer_result

def mat_mul_forward(layer_1,syn1,scaling_factor):
    
    input_dim = len(layer_1)
    output_dim = len(syn1)

    buff = np.zeros(max(output_dim+1,input_dim+1))
    buff[0:len(layer_1)] = layer_1
    layer_1_c = buff
    
    syn1_c = list()
    for i in range(len(syn1)):
        buff = np.zeros(max(output_dim+1,input_dim+1))
        buff[0:len(syn1[i])] = syn1[i]
        syn1_c.append(buff)
    
    layer_2 = innerProd(syn1_c[0],layer_1_c,M_onehot[len(layer_1_c) - 2][0],l) / float(scaling_factor)
    for i in range(len(syn1)-1):
        layer_2 += innerProd(syn1_c[i+1],layer_1_c,M_onehot[len(layer_1_c) - 2][i+1],l) / float(scaling_factor)
    return layer_2[0:output_dim+1]

def elementwise_vector_mult(x,y,scaling_factor):
    
    y =[y]
    
    one_minus_layer_1 = transpose(y)

    outer_result = list()
    for i in range(len(x)-1):
        outer_result.append(mat_mul_forward(x * onehot[len(x)-1][i],y,scaling_factor))
        
    return transpose(outer_result)[0]

Now, there's one bit that I haven't told you about yet. To save time, I'm pre-computing several keys, , vectors, and matrices and storing them. This includes things like "the vector of all 1s" and one-hot encoding vectors of various lengths. This is useful for the masking operations above as well as some simple things we want to be able to do. For example, the derivive of sigmoid is sigmoid(x) * (1 - sigmoid(x)). Thus, precomputing these variables is handy. Here's the pre-computation step.


# HAPPENS ON SECURE SERVER

l = 100
w = 2 ** 25

aBound = 10
tBound = 10
eBound = 10

max_dim = 10

scaling_factor = 1000

# keys
T_keys = list()
for i in range(max_dim):
    T_keys.append(np.random.rand(i+1,1))

# one way encryption transformation
M_keys = list()
for i in range(max_dim):
    M_keys.append(innerProdClient(T_keys[i],l))

M_onehot = list()
for h in range(max_dim):
    i = h+1
    buffered_eyes = list()
    for row in np.eye(i+1):
        buffer = np.ones(i+1)
        buffer[0:i+1] = row
        buffered_eyes.append((M_keys[i-1].T * buffer).T)
    M_onehot.append(buffered_eyes)
    
c_ones = list()
for i in range(max_dim):
    c_ones.append(encrypt(T_keys[i],np.ones(i+1), w, l).astype('int'))
    
v_onehot = list()
onehot = list()
for i in range(max_dim):
    eyes = list()
    eyes_txt = list()
    for eye in np.eye(i+1):
        eyes_txt.append(eye)
        eyes.append(one_way_encrypt_vector(eye,scaling_factor))
    v_onehot.append(eyes)
    onehot.append(eyes_txt)

H_sigmoid_txt = np.zeros((5,5))

H_sigmoid_txt[0][0] = 0.5
H_sigmoid_txt[0][1] = 0.25
H_sigmoid_txt[0][2] = -1/48.0
H_sigmoid_txt[0][3] = 1/480.0
H_sigmoid_txt[0][4] = -17/80640.0

H_sigmoid = list()
for row in H_sigmoid_txt:
    H_sigmoid.append(one_way_encrypt_vector(row))

If you're looking closely, you'll notice that the H_sigmoid matrix is the matrix we need for the polynomial evaluation of sigmoid. :) Finally, we want to train our neural network with the following. If the neural netowrk parts don't make sense, review A Neural Network in 11 Lines of Python. I've basically taken the XOR network from there and swapped out its operations with the proper utility functions for our encrypted weights.


np.random.seed(1234)

input_dataset = [[],[0],[1],[0,1]]
output_dataset = [[0],[1],[1],[0]]

input_dim = 3
hidden_dim = 4
output_dim = 1
alpha = 0.015

# one way encrypt our training data using the public key (this can be done onsite)
y = list()
for i in range(4):
    y.append(one_way_encrypt_vector(output_dataset[i],scaling_factor))

# generate our weight values
syn0_t = (np.random.randn(input_dim,hidden_dim) * 0.2) - 0.1
syn1_t = (np.random.randn(output_dim,hidden_dim) * 0.2) - 0.1

# one-way encrypt our weight values
syn1 = list()
for row in syn1_t:
    syn1.append(one_way_encrypt_vector(row,scaling_factor).astype('int64'))

syn0 = list()
for row in syn0_t:
    syn0.append(one_way_encrypt_vector(row,scaling_factor).astype('int64'))


# begin training
for iter in range(1000):
    
    decrypted_error = 0
    encrypted_error = 0
    for row_i in range(4):

        if(row_i == 0):
            layer_1 = sigmoid(syn0[0])
        elif(row_i == 1):
            layer_1 = sigmoid((syn0[0] + syn0[1])/2.0)
        elif(row_i == 2):
            layer_1 = sigmoid((syn0[0] + syn0[2])/2.0)
        else:
            layer_1 = sigmoid((syn0[0] + syn0[1] + syn0[2])/3.0)

        layer_2 = (innerProd(syn1[0],layer_1,M_onehot[len(layer_1) - 2][0],l) / float(scaling_factor))[0:2]

        layer_2_delta = add_vectors(layer_2,-y[row_i])

        syn1_trans = transpose(syn1)

        one_minus_layer_1 = [(scaling_factor * c_ones[len(layer_1) - 2]) - layer_1]
        sigmoid_delta = elementwise_vector_mult(layer_1,one_minus_layer_1[0],scaling_factor)
        layer_1_delta_nosig = mat_mul_forward(layer_2_delta,syn1_trans,1).astype('int64')
        layer_1_delta = elementwise_vector_mult(layer_1_delta_nosig,sigmoid_delta,scaling_factor) * alpha

        syn1_delta = np.array(outer_product(layer_2_delta,layer_1)).astype('int64')

        syn1[0] -= np.array(syn1_delta[0]* alpha).astype('int64')

        syn0[0] -= (layer_1_delta).astype('int64')

        if(row_i == 1):
            syn0[1] -= (layer_1_delta).astype('int64')
        elif(row_i == 2):
            syn0[2] -= (layer_1_delta).astype('int64')
        elif(row_i == 3):
            syn0[1] -= (layer_1_delta).astype('int64')
            syn0[2] -= (layer_1_delta).astype('int64')


        # So that we can watch training, I'm going to decrypt the loss as we go.
        # If this was a secure environment, I wouldn't be doing this here. I'd send
        # the encrypted loss somewhere else to be decrypted
        encrypted_error += int(np.sum(np.abs(layer_2_delta)) / scaling_factor)
        decrypted_error += np.sum(np.abs(s_decrypt(layer_2_delta).astype('float')/scaling_factor))

    
    sys.stdout.write("\r Iter:" + str(iter) + " Encrypted Loss:" + str(encrypted_error) +  " Decrypted Loss:" + str(decrypted_error) + " Alpha:" + str(alpha))
    
    # just to make logging nice
    if(iter % 10 == 0):
        print()
    
    # stop training when encrypted error reaches a certain level
    if(encrypted_error < 25000000):
        break
        
print("\nFinal Prediction:")

for row_i in range(4):

    if(row_i == 0):
        layer_1 = sigmoid(syn0[0])
    elif(row_i == 1):
        layer_1 = sigmoid((syn0[0] + syn0[1])/2.0)
    elif(row_i == 2):
        layer_1 = sigmoid((syn0[0] + syn0[2])/2.0)
    else:
        layer_1 = sigmoid((syn0[0] + syn0[1] + syn0[2])/3.0)

    layer_2 = (innerProd(syn1[0],layer_1,M_onehot[len(layer_1) - 2][0],l) / float(scaling_factor))[0:2]
    print("True Pred:" + str(output_dataset[row_i]) + " Encrypted Prediction:" + str(layer_2) + " Decrypted Prediction:" + str(s_decrypt(layer_2) / scaling_factor))


 Iter:0 Encrypted Loss:84890656 Decrypted Loss:2.529 Alpha:0.015
 Iter:10 Encrypted Loss:69494197 Decrypted Loss:2.071 Alpha:0.015
 Iter:20 Encrypted Loss:64017850 Decrypted Loss:1.907 Alpha:0.015
 Iter:30 Encrypted Loss:62367015 Decrypted Loss:1.858 Alpha:0.015
 Iter:40 Encrypted Loss:61874493 Decrypted Loss:1.843 Alpha:0.015
 Iter:50 Encrypted Loss:61399244 Decrypted Loss:1.829 Alpha:0.015
 Iter:60 Encrypted Loss:60788581 Decrypted Loss:1.811 Alpha:0.015
 Iter:70 Encrypted Loss:60327357 Decrypted Loss:1.797 Alpha:0.015
 Iter:80 Encrypted Loss:59939426 Decrypted Loss:1.786 Alpha:0.015
 Iter:90 Encrypted Loss:59628769 Decrypted Loss:1.778 Alpha:0.015
 Iter:100 Encrypted Loss:59373621 Decrypted Loss:1.769 Alpha:0.015
 Iter:110 Encrypted Loss:59148014 Decrypted Loss:1.763 Alpha:0.015
 Iter:120 Encrypted Loss:58934571 Decrypted Loss:1.757 Alpha:0.015
 Iter:130 Encrypted Loss:58724873 Decrypted Loss:1.75 Alpha:0.0155
 Iter:140 Encrypted Loss:58516008 Decrypted Loss:1.744 Alpha:0.015
 Iter:150 Encrypted Loss:58307663 Decrypted Loss:1.739 Alpha:0.015
 Iter:160 Encrypted Loss:58102049 Decrypted Loss:1.732 Alpha:0.015
 Iter:170 Encrypted Loss:57863091 Decrypted Loss:1.725 Alpha:0.015
 Iter:180 Encrypted Loss:55470158 Decrypted Loss:1.653 Alpha:0.015
 Iter:190 Encrypted Loss:54650383 Decrypted Loss:1.629 Alpha:0.015
 Iter:200 Encrypted Loss:53838756 Decrypted Loss:1.605 Alpha:0.015
 Iter:210 Encrypted Loss:51684722 Decrypted Loss:1.541 Alpha:0.015
 Iter:220 Encrypted Loss:54408709 Decrypted Loss:1.621 Alpha:0.015
 Iter:230 Encrypted Loss:54946198 Decrypted Loss:1.638 Alpha:0.015
 Iter:240 Encrypted Loss:54668472 Decrypted Loss:1.63 Alpha:0.0155
 Iter:250 Encrypted Loss:55444008 Decrypted Loss:1.653 Alpha:0.015
 Iter:260 Encrypted Loss:54094286 Decrypted Loss:1.612 Alpha:0.015
 Iter:270 Encrypted Loss:51251831 Decrypted Loss:1.528 Alpha:0.015
 Iter:276 Encrypted Loss:24543890 Decrypted Loss:0.732 Alpha:0.015
 Final Prediction:
True Pred:[0] Encrypted Prediction:[-3761423723.0718255 0.0] Decrypted Prediction:[-0.112]
True Pred:[1] Encrypted Prediction:[24204806753.166267 0.0] Decrypted Prediction:[ 0.721]
True Pred:[1] Encrypted Prediction:[23090462896.17028 0.0] Decrypted Prediction:[ 0.688]
True Pred:[0] Encrypted Prediction:[1748380342.4553354 0.0] Decrypted Prediction:[ 0.052]

When I train this neural network, this is the output that I see. Tuning was a bit tricky as some combination of the encryption noise and the low precision creates for somewhat chunky learning. Training is also quite slow. A lot of this comes back to how expensive the transpose operation is. I'm pretty sure that I could do something quite a bit simpler, but, again, I wanted to air on the side of safety for this proof of concept.

Things to takeaway:

The weights of the network are all encrypted.
The data is decrypted... 1s and 0s.
After training, the network could be decrypted for increased performance or training (or switch to a different encryption key).
The training loss and output predictions are all also encrypted values. We have to decode them in order to be able to interpret the network.

Part 9: Sentiment Classification

To make this a bit more real, here's the same network training on IMDB sentiment reviews based on a network from Udacity's Deep Learning Nanodegree. You can find the full code here


import time
import sys
import numpy as np

# Let's tweak our network from before to model these phenomena
class SentimentNetwork:
    def __init__(self, reviews,labels,min_count = 10,polarity_cutoff = 0.1,hidden_nodes = 8, learning_rate = 0.1):
       
        np.random.seed(1234)
    
        self.pre_process_data(reviews, polarity_cutoff, min_count)
        
        self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)
        
        
    def pre_process_data(self,reviews, polarity_cutoff,min_count):
        
        print("Pre-processing data...")
        
        positive_counts = Counter()
        negative_counts = Counter()
        total_counts = Counter()

        for i in range(len(reviews)):
            if(labels[i] == 'POSITIVE'):
                for word in reviews[i].split(" "):
                    positive_counts[word] += 1
                    total_counts[word] += 1
            else:
                for word in reviews[i].split(" "):
                    negative_counts[word] += 1
                    total_counts[word] += 1

        pos_neg_ratios = Counter()

        for term,cnt in list(total_counts.most_common()):
            if(cnt >= 50):
                pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
                pos_neg_ratios[term] = pos_neg_ratio

        for word,ratio in pos_neg_ratios.most_common():
            if(ratio > 1):
                pos_neg_ratios[word] = np.log(ratio)
            else:
                pos_neg_ratios[word] = -np.log((1 / (ratio + 0.01)))
        
        review_vocab = set()
        for review in reviews:
            for word in review.split(" "):
                if(total_counts[word] > min_count):
                    if(word in pos_neg_ratios.keys()):
                        if((pos_neg_ratios[word] >= polarity_cutoff) or (pos_neg_ratios[word] <= -polarity_cutoff)):
                            review_vocab.add(word)
                    else:
                        review_vocab.add(word)
        self.review_vocab = list(review_vocab)
        
        label_vocab = set()
        for label in labels:
            label_vocab.add(label)
        
        self.label_vocab = list(label_vocab)
        
        self.review_vocab_size = len(self.review_vocab)
        self.label_vocab_size = len(self.label_vocab)
        
        self.word2index = {}
        for i, word in enumerate(self.review_vocab):
            self.word2index[word] = i
        
        self.label2index = {}
        for i, label in enumerate(self.label_vocab):
            self.label2index[label] = i
         
        
    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        # Set number of nodes in input, hidden and output layers.
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes

        print("Initializing Weights...")
        self.weights_0_1_t = np.zeros((self.input_nodes,self.hidden_nodes))
    
        self.weights_1_2_t = np.random.normal(0.0, self.output_nodes**-0.5, 
                                                (self.hidden_nodes, self.output_nodes))
        
        print("Encrypting Weights...")
        self.weights_0_1 = list()
        for i,row in enumerate(self.weights_0_1_t):
            sys.stdout.write("\rEncrypting Weights from Layer 0 to Layer 1:" + str(float((i+1) * 100) / len(self.weights_0_1_t))[0:4] + "% done")
            self.weights_0_1.append(one_way_encrypt_vector(row,scaling_factor).astype('int64'))
        print("")
        
        self.weights_1_2 = list()
        for i,row in enumerate(self.weights_1_2_t):
            sys.stdout.write("\rEncrypting Weights from Layer 1 to Layer 2:" + str(float((i+1) * 100) / len(self.weights_1_2_t))[0:4] + "% done")
            self.weights_1_2.append(one_way_encrypt_vector(row,scaling_factor).astype('int64'))           
        self.weights_1_2 = transpose(self.weights_1_2)
        
        self.learning_rate = learning_rate
        
        self.layer_0 = np.zeros((1,input_nodes))
        self.layer_1 = np.zeros((1,hidden_nodes))
        
    def sigmoid(self,x):
        return 1 / (1 + np.exp(-x))
    
    
    def sigmoid_output_2_derivative(self,output):
        return output * (1 - output)
    
    def update_input_layer(self,review):

        # clear out previous state, reset the layer to be all 0s
        self.layer_0 *= 0
        for word in review.split(" "):
            self.layer_0[0][self.word2index[word]] = 1

    def get_target_for_label(self,label):
        if(label == 'POSITIVE'):
            return 1
        else:
            return 0
        
    def train(self, training_reviews_raw, training_labels):

        training_reviews = list()
        for review in training_reviews_raw:
            indices = set()
            for word in review.split(" "):
                if(word in self.word2index.keys()):
                    indices.add(self.word2index[word])
            training_reviews.append(list(indices))

        layer_1 = np.zeros_like(self.weights_0_1[0])

        start = time.time()
        correct_so_far = 0
        total_pred = 0.5
        for i in range(len(training_reviews_raw)):
            review_indices = training_reviews[i]
            label = training_labels[i]

            layer_1 *= 0
            for index in review_indices:
                layer_1 += self.weights_0_1[index]
            layer_1 = layer_1 / float(len(review_indices))
            layer_1 = layer_1.astype('int64') # round to nearest integer

            layer_2 = sigmoid(innerProd(layer_1,self.weights_1_2[0],M_onehot[len(layer_1) - 2][1],l) / float(scaling_factor))[0:2]

            if(label == 'POSITIVE'):
                layer_2_delta = layer_2 - (c_ones[len(layer_2) - 2] * scaling_factor)
            else:
                layer_2_delta = layer_2

            weights_1_2_trans = transpose(self.weights_1_2)
            layer_1_delta = mat_mul_forward(layer_2_delta,weights_1_2_trans,scaling_factor).astype('int64')

            self.weights_1_2 -= np.array(outer_product(layer_2_delta,layer_1))  * self.learning_rate

            for index in review_indices:
                self.weights_0_1[index] -= (layer_1_delta * self.learning_rate).astype('int64')

            # we're going to decrypt on the fly so we can watch what's happening
            total_pred += (s_decrypt(layer_2)[0] / scaling_factor)
            if((s_decrypt(layer_2)[0] / scaling_factor) >= (total_pred / float(i+2)) and label == 'POSITIVE'):
                correct_so_far += 1
            if((s_decrypt(layer_2)[0] / scaling_factor) < (total_pred / float(i+2)) and label == 'NEGATIVE'):
                correct_so_far += 1

            reviews_per_second = i / float(time.time() - start)

            sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews_raw)))[:4] + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
            if(i % 100 == 0):
                print(i)

    
    def test(self, testing_reviews, testing_labels):
        
        correct = 0
        
        start = time.time()
        
        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if(pred == testing_labels[i]):
                correct += 1
            
            reviews_per_second = i / float(time.time() - start)
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                            + "% #Correct:" + str(correct) + " #Tested:" + str(i+1) + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")
    
    def run(self, review):
        
        # Input Layer


        # Hidden layer
        self.layer_1 *= 0
        unique_indices = set()
        for word in review.lower().split(" "):
            if word in self.word2index.keys():
                unique_indices.add(self.word2index[word])
        for index in unique_indices:
            self.layer_1 += self.weights_0_1[index]
        
        # Output layer
        layer_2 = self.sigmoid(self.layer_1.dot(self.weights_1_2))
        
        if(layer_2[0] >= 0.5):
            return "POSITIVE"
        else:
            return "NEGATIVE"

Progress:0.0% Speed(reviews/sec):0.0 #Correct:1 #Trained:1 Training Accuracy:100.%0
Progress:0.41% Speed(reviews/sec):1.978 #Correct:66 #Trained:101 Training Accuracy:65.3%100
Progress:0.83% Speed(reviews/sec):2.014 #Correct:131 #Trained:201 Training Accuracy:65.1%200
Progress:1.25% Speed(reviews/sec):2.011 #Correct:203 #Trained:301 Training Accuracy:67.4%300
Progress:1.66% Speed(reviews/sec):2.003 #Correct:276 #Trained:401 Training Accuracy:68.8%400
Progress:2.08% Speed(reviews/sec):2.007 #Correct:348 #Trained:501 Training Accuracy:69.4%500
Progress:2.5% Speed(reviews/sec):2.015 #Correct:420 #Trained:601 Training Accuracy:69.8%600
Progress:2.91% Speed(reviews/sec):1.974 #Correct:497 #Trained:701 Training Accuracy:70.8%700
Progress:3.33% Speed(reviews/sec):1.973 #Correct:581 #Trained:801 Training Accuracy:72.5%800
Progress:3.75% Speed(reviews/sec):1.976 #Correct:666 #Trained:901 Training Accuracy:73.9%900
Progress:4.16% Speed(reviews/sec):1.983 #Correct:751 #Trained:1001 Training Accuracy:75.0%1000
Progress:4.33% Speed(reviews/sec):1.940 #Correct:788 #Trained:1042 Training Accuracy:75.6%
....

Part 10: Advantages over Data Encryption

The most similar approach to this one is to encrypt training data and train neural networks on the encrypted data (accepting encrypted input and predicting encrypted output). This is a fantastic idea. However, it does have a few drawbacks. First and foremost, encrypting the data means that the neural network is completely useless to anyone without the private key for the encrypted data. This makes it impossible for data from different private sources to be trained on the same Deep Learning model. Most commercial applications have this requirement, requiring the aggregation of consumer data. In theory, we'd want every consumer to be protected by their own secret key, but homomorphically encrypting the data requires that everyone use the SAME key.

However, encrypting the network doesn't have this restriction.

With the approach above, you could train a regular, decrypted neural network for a while, encrypt it, send it to Party A with a public key (who trains it for a while on their own data... which remains in their possession). Then, you could get the network back, decrypt it, re-encrypt it with a different key and send it to Party B who does some training on their data. Since the network itself is what's enrypted, you get total control over the intelligence that you're capturing along the way. Party A and Party B would have no way of knowing that they each received the same network, and this all happens without them ever being able to see or use the network on their own data. You, the company, retain control over the IP in the neural network, and each user retains control over their own data.

Part 11: Future Work

There are faster and more secure homomorphic encryption algorithms. Taking this work and porting it to YASHE is, I believe, a step in the right direction. Perhaps a frameowrk would be appropriate to make encryption easier for the user, as it has a few systemic complications. In general, in order for many of these ideas to reach production level quality, HE needs to get faster. However, progress is happening quickly. I'm sure we'll be there before too long.

Part 12: Potential Applications

Decentralized AI: Companies can deploy models to be trained or used in the field without risking their intelligence being stolen.

Protected Consumer Privacy: the previous application opens up the possibility that consumers could simply hold onto their data, and "opt in" to different models being trained on their lives, instead of sending their data somewhere else. Companies have less of an excuse if their IP isn't at risk via decentralization. Data is power and it needs to go back to the people.

Controlled Superintelligence: The network can become as smart as it wants, but unless it has the secret key, all it can do is predict jibberish.

Tutorial: Deep Learning in PyTorch

Sun, 15 Jan 2017 12:00:00 +0000

EDIT: A complete revamp of PyTorch was released today (Jan 18, 2017), making this blogpost a bit obselete. I will update this post with a new Quickstart Guide soon, but for now you should check out their documentation.</a>

This Blogpost Will Cover:

Part 1: PyTorch Installation
Part 2: Matrices and Linear Algebra in PyTorch
Part 3: Building a Feedforward Network (starting with a familiar one)
Part 4: The State of PyTorch

Pre-Requisite Knowledge:

Simple Feedforward Neural Networks (Tutorial)
Basic Gradient Descent (Tutorial)

Torch is one of the most popular Deep Learning frameworks in the world, dominating much of the research community for the past few years (only recently being rivaled by major Google sponsored frameworks Tensorflow and Keras). Perhaps its only drawback to new users has been the fact that it requires one to know Lua, a language that used to be very uncommon in the Machine Learning community. Even today, this barrier to entry can seem a bit much for many new to the field, who are already in the midst of learning a tremendous amount, much less a completely new programming language.

However, thanks to the wonderful and billiant Hugh Perkins, Torch recently got a new face, PyTorch... and it's much more accessible to the python hacker turned Deep Learning Extraordinare than it's Luariffic cousin. I have a passion for tools that make Deep Learning accessible, and so I'd like to lay out a short "Unofficial Startup Guide" for those of you interested in taking it for a spin. Before we get started, however, a question:

Why Use a Framework like PyTorch? In the past, I have advocated learning Deep Learning using only a matrix library. For the purposes of actually knowing what goes on under the hood, I think that this is essential, and the lessons learned from building things from scratch are real gamechangers when it comes to the messiness of tackling real world problems with these tools. However, when building neural networks in the wild (Kaggle Competitions, Production Systems, and Research Experiments), it's best to use a framework.

Why? Frameworks such as PyTorch allow you (the researcher) to focus exclusively on your experiment and iterate very quickly. Want to swap out a layer? Most frameworks will let you do this with a single line code change. Want to run on a GPU? Many frameworks will take care of it (sometimes with 0 code changes). If you built the network by hand in a matrix library, you might be spending a few hours working out these kinds of modifications. So, for learning, use a linear algebra library (like Numpy). For applying, use a framework (like PyTorch). Let's get started!

For New Readers: I typically tweet out new blogposts when they're complete @iamtrask. Feel free to follow if you'd be interested in reading more in the future and thanks for all the upvotes on Hacker News and Reddit! They mean a lot to me.

Part 1: Installation

Install Torch: The first thing you need to do is install torch and the "nn" package using luarocks. As torch is a very robust framework, the installation instructions should work well for you. After that, you should be able to run:

luarocks install nn

and be good to go. If any of these steps fails to work, copy paste what looks like the "error" and error description (should just be one sentence or so) from the command line and put it into Google (as is common practice when installing).

Clone the Repository: At the time of writing, PyTorch doesn't seem to be in the PyPI repository. So, we'll need to clone the repo in order to install it. Assuming you have git already installed (hint hint... if not go install it and come back), you should open up your Terminal application and navigate to an empty folder. Personally, I have a folder called "Laboratory" in my home directory (i.e. "cd ~/Laboratory/"), which cators to various childhood memories of mine. If you're not sure where to put it, feel free to use your Desktop for now. Once there, execute the commands:

git clone https://github.com/hughperkins/pytorch.git
cd pytorch/
pip install -r requirements.txt
pip install -r test/requirements.txt
source ~/torch/install/bin/torch-activate
./build.sh

For me, this worked flawlessly, finishing with the statement

Finished processing dependencies for PyTorch===4.1.1-SNAPSHOT

If you also see this output at the bottom of your terminal, congraulations! You have successfully installed PyTorch!

Startup Jupyter Notebook: While certainly not a requirement, I highly recommend playing around with this new tool using Jupyter Notebok, which is definitely best installed using conda. Take my word for it. Install it using conda. All other ways are madness.

Part 2: Matrices and Linear Algebra

In the spirit of starting with the basics, neural networks run on linear algebra libraries. PyTorch is no exception. So, the simplest building block of PyTorch is its linear algebra library.

Above, I created 4 matrices. Notice that the library doesn't call them matrices though. It calls them tensors.

Quick Primer on Tensors: A Tensor is just a more generic term than matrix or vector. 1-dimensional tensors are vectors. 2-dimensional tensors are matrices. 3+ dimensional tensors are just refered to as tensors. If you're unfamiliar with these objects, here's a quick summary. A vector is "a list of numbers". A matrix is "a list of lists of numbers". A 3-d tensor is "a list of lists of lists of numbers". A 4-d tensor is... See the pattern? For more on how vectors and matrices are used to make neural networks, see my first blog post on a Neural Network in 11 lines of Python

PyTorch Tensors There appear to be 4 major types of tensors in PyTorch: Byte, Float, Double, and Long tensors. Each tensor type corresponds to the type of number (and more importantly the size/preision of the number) contained in each place of the matrix. So, if a 1-d Tensor is a "list of numbers", a 1-d Float Tensor is a list of floats. As a general rule of thumb, for weight matries we use FloatTensors. For data matrices, we'll likely use either FloatTensors (for real valued inputs) or Long Tensors (for integers). I'm a little bit surprised to not see IntTensor anywhere. Perhaps it has yet to be wrapped or is just non-obvious in the API.

The final thing to notice is that the matries seem to come with lots of rather random looking numbers inside (but not sensibly random like "evenly random between 0 and 1"). This is actually a plus of the library in my book. Let me explain (skip 1 paragraph if you're familiar with this concept)

Why Seemingly Nonsenical Values: When you create a new matrix in PyTorch, the framework goes and "sets aside" enough RAM memory to store your matrix. However, "setting aside" memory is completely different from "changing all the values in that memory to 0". "Setting aside" memory while also "changing all the values to 0" is more computationally expensive. It's nice that this library doesn't assume you want the values to be any particular way. Instead, it just sets aside memory and whatever 1s and 0s happen to be there from the last program that used that piece of RAM will show up in your matrix. In most cases, we're going to set the values in our matrices to be something else anyway (say... our input data values or a specific kind of random number range). So, the fact that it doesn't pre-set the matrix values saves you a bit of processing time, but the user needs to recognize that it's their responsibility to actively choose the values he/she wants in a matrix. Numpy is not like this, which make Numpy a bit more user friendly but also less computationally efficient.

Basic Linear Algebra Operations: So, now that we know how to store numbers in PyTorch, lets talk a bit about how PyTorch manipulates them. How does one go about doing linear algebra in PyTorch?

The Basic Neural Network Operations: Neural networks, amidst all their complexity, are actually mostly made up of rather simple operations. If you remember from A Neural Network in 11 Lines of Python, we can build a working network with only Matrix-Matrix Multiplication, Vector-Matrix Multiplication, Elementwise Operations (addition, subtraction, multiplication, and division), Matrix Transposition, and a handful of elementwise functions (sigmoid, and a special function to compute sigmoid's derivative at a point which uses only the aforementioned elementwise operations). Let's initialize some matrices and start with the elementwise operations.

Elementwise Operations: Above I have given several examples of elementwise addition. (simply replacing the "+" sign with - * or / will give you the others). These act mostly how you would expect them to act. A few hiccups here and there. Notably, if you accidentally add two Tensors that aren't aligned correctly (have different dimensions) the python kernel crashes as opposed to throwing an error. It could be just my machine, but error handling in wrappers is notoriously time consuming to finish. I expect this functionality will be worked out soon enough, and as we will see, there's a suitable substitute.

Vector/Matrix Multiplication: It appears that the native matrix multiplication functionality of Torch isn't wrapped. Instead, we get to use something a bit more familiar. (This feature is really, really cool.) Much of PyTorch can run on native numpy matrices. That's right! The convenient functionality of numpy is now integrated with one of the most popular Deep Learning Frameworks out there. So, how should you really do elementwise and matrix multipliplication? Just use numpy! For completeness sake, here's the same elementwise operations using PyTorch's numpy connector.

And now let's see how to do the basic matrix operations we need for a feedforward network.

Other Neural Functions: Finally, we also need to be able to compute some nonlinearities efficiently. There are both numpy and native wrappers made available which seem to run quite fast. Additionally, sigmoid has a native implementation (something that numpy does not implement), which is quite nice and a bit faster than computing it explicitly in numpy.

I consider the fantastic integration between numpy and PyTorch to be one of the great selling points of this framework. I personally love prototyping with the full control of a matrix library, and PyTorch really respects this preference as an option. This is so nice relative to most other frameworks out there. +1 for PyTorch.

Part 3: Building a Feedforward Network

In this section, we really get to start seeing PyTorch shine. While understanding how matrices are handled is an important pre-requisite to learning a framework, the various layers of abstraction are where frameworks really become useful. In this section, we're going to take the bare bones 3 layer neural network from a previous blogpost and convert it to a network using PyTorch's neural network abstractions. In this way, as we wrap each part of the network with a piece of framework functionality, you'll know exactly what PyTorch is doing under the hood. Your goal in this section should be to relate the PyTorch abstractions (objects, function calls, etc.) in the PyTorch network we will build with the matrix operations in the numpy neural network (pictured below)s.

import PyTorch
from PyTorch import np

def nonlin(x,deriv=False):
	if(deriv==True):
	    return x*(1-x)

	return 1/(1+np.exp(-x))
    
X = np.array([[0,0,1],
            [0,1,1],
            [1,0,1],
            [1,1,1]])
                
y = np.array([[0],
			[1],
			[1],
			[0]])

np.random.seed(1)

# randomly initialize our weights with mean 0
syn0 = 2*np.random.random((3,4)) - 1
syn1 = 2*np.random.random((4,1)) - 1

for j in range(60000):

	# Feed forward through layers 0, 1, and 2
    l0 = X
    l1 = nonlin(np.dot(l0,syn0))
    l2 = nonlin(np.dot(l1,syn1))

    # how much did we miss the target value?
    l2_error = y - l2
    
    if (j% 10000) == 0:
        print("Error:" + str(np.mean(np.abs(l2_error))))
        
    # in what direction is the target value?
    # were we really sure? if so, don't change too much.
    l2_delta = l2_error*nonlin(l2,deriv=True)

    # how much did each l1 value contribute to the l2 error (according to the weights)?
    l1_error = l2_delta.dot(syn1.T)
    
    # in what direction is the target l1?
    # were we really sure? if so, don't change too much.
    l1_delta = l1_error * nonlin(l1,deriv=True)

    # lets update our weights
    syn1 += l1.T.dot(l2_delta)
    syn0 += l0.T.dot(l1_delta)

Error:0.496410031903
Error:0.00858452565325
Error:0.00578945986251
Error:0.00462917677677
Error:0.00395876528027
Error:0.00351012256786

Now that we've seen how to build this network (more or less "by hand"), let's starting building the same network using PyTorch instead of numpy.

import PyTorch
from PyTorchAug import nn
from PyTorch import np

First, we want to import several packages from PyTorch. np is the numpy wrapper mentioned before. nn is the Neural Network package, which contains things like layer types, error measures, and network containers, as we'll see in a second.

# randomly initialize our weights with mean 0
net = nn.Sequential()
net.add(nn.Linear(3, 4))
net.add(nn.Sigmoid())
net.add(nn.Linear(4, 1))
net.add(nn.Sigmoid())
net.float()

The next section highlights the primary advantage of deep learning frameworks in general. Instead of declaring a bunch of weight matrices (like with numpy), we create layers and "glue" them together using nn.Sequential(). Contained in these "layer" objects is logic about how the layers are constructed, how each layer forward propagates predictions, and how each layer backpropagates gradients. nn.Sequential() knows how to combine these layers together to allow them to learn together when presented with a dataset, which is what we'll do next.

X = np.array([[0,0,1],
            [0,1,1],
            [1,0,1],
            [1,1,1]]).astype('float32')
                
y = np.array([[0],
			[1],
			[1],
			[0]]).astype('float32')

This section is largely the same as before. We create our input (X) and output (y) datasets as numpy matrices. PyTorch seemed to want these matrices to be float32 values in order to do the implicit cast from numpy to PyTorch tensor objects well, so I added an .astype('float32') to ensure they were the right type.

crit = nn.MSECriterion()
crit.float()

This one might look a little strange if you're not familiar with neural network error measures. As it turns out, you can measure "how much you missed" in a variety of different ways. How you measure error changes how a network prioritizes different errors when training (what kinds of errors should it take most seriously). In this case, we're going to use the "Mean Squared Error". For a more in-depth coverage of this, please see Chapter 4 of Grokking Deep Learning.



for j in range(2400):
    
    net.zeroGradParameters()

    # Feed forward through layers 0, 1, and 2
    output = net.forward(X)
    
    # how much did we miss the target value?
    loss = crit.forward(output, y)
    gradOutput = crit.backward(output, y)
    
    # how much did each l1 value contribute to the l2 error (according to the weights)?
    # in what direction is the target l1?
    # were we really sure? if so, don't change too much.
    gradInput = net.backward(X, gradOutput)
    
    # lets update our weights
    net.updateParameters(1)
    
    if (j% 200) == 0:
        print("Error:" + str(loss))

And now for the training of the network. I have annotated each section of the code with near identical annotations as the numpy network. In this way, if you look at them side by side, you should be able to see where each operation in the numpy network occurs in the PyTorch network.

One part might not look familiar. The "net.zeroGradParameters()" basically just zeros out all our "delta" matrices before a new iteration. In our numpy network, this was the l2_delta variable and l1_delta variable. PyTorch re-uses the same memory allocations each time you forward propgate / back propagate (to be efficient, similar to what was mentioned in the Matrices section), so in order to keep from accidentally re-using the gradients from the prevoius iteration, you need to re-set them to 0. This is also a standard practice for most popular deep learning frameworks.

Finally, Torch also separates your "loss" from your "gradient". In our (somewhat oversimplified) numpy network, we just computed an "error" measure. As it turns out, your pure "error" and "delta" are actually slightly different measures. (delta is the derivative of the error). Again, for deeper coverage, see Chatper 4 of GDL.

Putting it all together

import PyTorch
from PyTorchAug import nn
from PyTorch import np

# randomly initialize our weights with mean 0
net = nn.Sequential()
net.add(nn.Linear(3, 4))
net.add(nn.Sigmoid())
net.add(nn.Linear(4, 1))
net.add(nn.Sigmoid())
net.float()

X = np.array([[0,0,1],
            [0,1,1],
            [1,0,1],
            [1,1,1]]).astype('float32')
                
y = np.array([[0],
			[1],
			[1],
			[0]]).astype('float32')

crit = nn.MSECriterion()
crit.float()

for j in range(2400):
    
    net.zeroGradParameters()

    # Feed forward through layers 0, 1, and 2
    output = net.forward(X)
    
    # how much did we miss the target value?
    loss = crit.forward(output, y)
    gradOutput = crit.backward(output, y)
    
    # how much did each l1 value contribute to the l2 error (according to the weights)?
    # in what direction is the target l1?
    # were we really sure? if so, don't change too much.
    gradInput = net.backward(X, gradOutput)
    
    # lets update our weights
    net.updateParameters(1)
    
    if (j% 200) == 0:
        print("Error:" + str(loss))

Error:0.2521711587905884
Error:0.2500123083591461
Error:0.249952495098114
Error:0.24984735250473022
Error:0.2495250701904297
Error:0.2475520819425583
Error:0.22693687677383423
Error:0.13267411291599274
Error:0.04083901643753052
Error:0.016316475346684456
Error:0.008736669085919857
Error:0.005575092509388924

Your results may vary a bit. I do not yet see how random numbers are to be seeded. If I come across that in the future, I'll add an edit.

Part 4: The State of PyTorch

While still a new framework with lots of ground to cover to close the gap with its competitors, PyTorch already has a lot to offer. It looks like there's an LSTM test case in the works, and strong promise for building custom layers in .lua files that you can import into Python with some simple wrapper functions. If you want to build feedforward neural networks using the industry standard Torch backend without having to deal with Lua, PyTorch is what you're looking for. If you want to build custom layers or do some heavy sequence2sequence models, I think the framework will be there very soon (with documentation / test cases to describe best practices). Overall, I'm very excited to see where this framework goes, and I encourage you to Star/Follow it on Github

For New Readers: I typically tweet out new blogposts when they're complete at @iamtrask. Feel free to follow if you'd be interested in reading more in the future and thanks for all the upvotes on hacker news and reddit!

Grokking Deep Learning

Wed, 17 Aug 2016 12:00:00 +0000

If you passed high school math and can hack around in Python, I want to teach you Deep Learning.

Edit: 50% Coupon Code: "mltrask" (expires August 26)

I've decided to write a Deep Learning book in the same style as my blog, teaching Deep Learning from an intuitive perspective, all in Python, using only numpy. I wanted to make the lowest possible barrier to entry to learn Deep Learning.

Here's what you need to know:

• High School Math (basic algebra)
• Python... the basics

The Problem with most entry level Deep Learning resources these days is that they either assume advanced knowledge of Calculus, Linear Algebra, Differential Equations, and perhaps even Convex Optimization, or they just teach a "black box" framework like Torch, Keras, or TensorFlow (where you just hit "train" but you don't actually know what's going on under the hood). Both have their appropriate audience, but I don't believe that either are appropriate for your average python hacker looking for a 101 on the fundamentals.

My Solution is to teach Deep Learning from an intuitive standpoint, just like I've done in the other posts on this blog. Everything you need to know to understand Deep Learning will be explained like you would to a 5 year old, including the bits and pieces of Linear Algebra and Calculus that are necessary. You'll learn how neural networks work, and how to use them to classify images, understand language (including machine translation), and even play games.

At the time of writing, I think that this is the only Deep Learning resource that is taught this way. I hope you enjoy it.

Who This Book Is NOT For: people who would rather be taught using formulas. Individuals with advanced mathematical backgrounds should choose another resource. This book is for an introduction to Deep Learning. It's about lowering the barrier to entry.

Click To See the Early Release Page (where the first three chapters are)

How to Code and Understand DeepMind's Neural Stack Machine

Thu, 25 Feb 2016 12:00:00 +0000

Summary: I learn best with toy code that I can play with. This tutorial teaches DeepMind's Neural Stack machine via a very simple toy example, a short python implementation. I will also explain my thought process along the way for reading and implementing research papers from scratch, which I hope you will find useful.

I typically tweet out new blogposts when they're complete at @iamtrask. Feel free to follow if you'd be interested in reading more in the future and thanks for all the feedback!

Part 1: What is a Neural Stack?

A Simple Stack

Let's start with the definition of a regular stack before we get to a neural one. In computer science, a stack is a type of data structure. Before I explain it, let me just show it to you. In the code below, we "stack" a bunch of harry potter books on an (ascii art) table.

Just like in the example above, picture yourself stacking Harry Potter books onto a table. A stack is pretty much the same as a list with one exception: you can't add/remove a book to/from anywhere except the top. So, you can add another to the top ( stack.push(book) ) or you can remove a book from the top ( stack.pop() ), however you can't do anything with the books in the middle. Pushing when we add a book to the top. Popping is when we remove a book from the top (and perhaps do something with it :) )

A Neural Stack

A close eye might ask, "We learn things with neural networks. What is there to learn with a data structure? Why would you learn how to do what you can easily code?" A neural stack is still just a stack. However, our neural network will learn how to use the stack to implement an algorithm. It will learn when to push and pop to correctly model output data given input data.

How will a neural network learn when to push and pop?

A neural network will learn to push and pop using backpropgation. Certainly a pre-requisite to this blogpost is an intuitive understanding of neural networks and backpropagation in general. Everything in this blogpost will be enough.

So, how will a neural network learn when to push and pop? To answer this question, we need to understand what a "correct sequence" of pushing and popping would look like? And that's right... it's a "sequence" of pushing and popping. So, that means that our input data and our correct output data will both be sequences. So, what kinds of sequences are stacks good at modeling?

When we push a sequence onto a stack, and then pop that sequence off of the stack. The squence pops off in reverse order to the original sequence that was pushed. So, if you have a sequence of 6 numbers, pushing 6 times and then popping 6 times is the correct sequence of pushing and popping to reverse a list.

...so What is a Neural Stack?

A Neural Stack is a stack that can learn to correctly accept a sequence of inputs, remember them, and then transform them according to a pattern learned from data.

...and How Does It Learn?

A Neural Stack learns by:

1) accepting input data, pushing and popping it according to when a neural network says to push and pop. This generates a sequence of output data (predictions).

2) Comparing the output data to the input data to see how much the neural stack "missed".

3) Updating the neural network to more correctly push and pop next time. (using backpropagation)

... so basically... just like every other neural network learns...

And now for the money question...

Money Question: How does backpropagation learn to push and pop when the error is on the output of the stack and the neural network is on the input to the stack? Normally we backpropagate the error from the output of the network to the weights so that we can make a weight update. It seems like the stack is "blocking" the output from the decision making neural network (which controls the pushing and popping).

Money Answer: We make the neural stack "differentiable". If you haven't had calculus, the simplest way to think about it is that we will make the "neural stack" using a sequence of vector additions, subtractions, and multiplications. If we can figure out how to mimic the stack's behaviors using only these tools, then we will be able to backpropagate the error through the stack just like we backpropagate it through a neural network's hidden layers. And it will be quite familiar to us! We're already used to backpropagating through sequences of additions, subtractions, and multiplications. Figuring out how to mimic the operations of a stack in a fully differentiable way was the hard part... which why Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom are so brilliant!!!

Part 2: Reading and Implementing Academic Papers

Where To Start....

As promised, I want to give a bit of "meta-learning" regarding how to approach implementing academic papers. So, pop open this paper and have a look around. As a disclaimer, there is no correct way to read academic papers. I wish only to share how I approached this one and why. Feel free to take any/all of it with a grain of salt. If you have lessons to add from experience, please comment on the hacker news or reddit posts if you came from there... or feel free to tweet @iamtrask. I'm happy to retweet good advice on this topic.

First Pass: Most people I know start by just reading a paper start to finish. Don't try to understand everything. Just get the high level goal of what's being accomplished, the key vocabulary terms involved, and a sense of the approach. Don't worry too much about formulas. Take time to look at pictures and tables. This paper has lots of good ones, which is helpful too. :) If this paper were about how to build a car, this first pass is just about learning "We're going to build a driving machine. It's going to be able to move and turn at 60 miles per hour down a curvy road. It runs on gasolean and has wheels. I think it will be driven by a human being." Don't worry about the alternator, transmission, or spark plugs... and certainly not the optimal temperature for combustion. Just get the general idea.

Second Pass: For the second pass, if you feel like you understand the background (which is always the first few sections... commonly labeled "Introduction" and "Related Work"), jump straight to the approach. In this paper the approach section starts with "3 Models" at the bottom of page 2. For this section, read each sentence slowly. These sections are almost always extremely dense. Each sentence is crafted with care, and without an understanding of each sentence in turn, the next might not make sense. At this point, still don't worry too much about the details of the formulas. Instead, just get an idea of the "major moving parts" in the algorithm. Focus on the what not the how. Again, if this were about building a car, this is about making a list of what each part is called and what it generally does like below...

Part Name	Variable	Description When First Reading
"The Memory"	V_t	Sortof like "self.contents" in our VerySimpleStack. This is where our stuff goes. More specifically, this is the state of our stack at timestep "t".
"The Controller"	?	The neural network that decides when to push or pop.
"Pop Signal"	u_t	How much the controller wants to pop.
"Push Signal"	d_t	How much the controller wants to push.
"Strength Signal"	s_t	Given u_t and d_t are real valued, it seems like we can push on or pop off "parts" of objects... This vector seems to keep up with how much of each variable we still have in V (or V_t really).
"Read Value"	v_t	This seems to be made by combining s_t and V_t somehow.... so some sort of weighted average of what's on the stack... interesting....
"Time"	_t	This is attached to many of the variables... i think it means the state of that variable at a specific timestep in the sequence.

As a sidenote, this is also a great time to create some mental pneumonics to remember which variable is which. In this case, since the "u" in "u_t" is open at the top... I thought that it looked like it has been "popped" open. It's also the "pop signal". In contrast, "d_t" is closed on top and is the "push signal". I found that this helped later when trying to read the formulas intuitively (which is the next step). If you don't know the variables by heart, it's really hard to figure out how they relate to each other in the formulas.

N More Passes: At this point, you just keep reading the method section until you have your working implementation (which you can evaluate using the later sections). So, this is generally how to read the paper. :)

Have Questions? Stuck? Feel free to tweet your question @iamtrask for help.

Part 3: Building a Toy Neural Stack

Where To Start....

Ok, so we have a general idea of what's going on. Where do we start? I'm always tempted to just start coding the whole big thing but inevitably I get halfway through with bugs I'll never find again. So, a word from experience, break down each project into distinct, testable sections. In this case, the smallest testable section is the "stack mechanism" itself. Why? Well, the bottom of page 5 gives it away "the three memory modules... contain no tunable parameters to optimize during training". In a word, they're deterministic. To me, this is always the easiest place to start. Debugging something with deterministic, constant behavior is always easier than debugging something you learn/optimize/constantly changes. Furthermore, the stack logic is at the core of the algorithm. Even better, Figure 1 section (a) gives its expected behavior which we can use as a sort of "unit test". All of these things make this a great place to start. Let's jump right into understanding this portion by looking at the diagram of the stack's architecture.

What I'm Thinking When I See This: Ok... so we can push green vectors v_t onto the stack. Each yellow bubble on the right of each green v_t looks like it coresponds to the weight of v_t in the stack... which can be 0 apparently (according to the far right bubble). So, even though there isn't a legend, I suspect that the yellow circles are infact s_t. This is very useful. Furthermore, it looks like the graphs go from left to right. So, this is 3 timesteps. t=1, t=2, and t=3. Great. I think I see what's going on.

What I'm Thinking: Ok. I can line up the formulas with the picture above. The first formula sets each row of V_t which are the green bubbles. The second formula determines each row of s_t, which are the yellow bubbles. The final formula doesn't seem to be pictured above but I can see what its values are given the state of the stack. (e.g. r_3 = 0.9 * v_3 + 0 * v_2 + 0.1 * v_1) Now, what are these formulas really saying?

So, what we're going to try to do here is "tell the story" of each part of each formula in our head. Let's start with the first formula (1) which seems the least intimidating.

See the part circled in blue above? Notice that it's indexed by two numbers, t, and i. I actually overlooked this at first and it came back to bite me. V_t is the state of our stack's memory at time t. We can see its state in the picture at t=1, t=2, and t=3. However, at each timestep, the memory can have more than one value v_t inside of it! This will have significant implications for the code later (and the memory overhead).

Bottom Line: If V_t is the state of the stack's memory at time "t", then V is a list of ALL of the memory states the stack goes through (at every timestep). So, V is a list of lists of vectors. (a vector is a list of numbers). In your head, you can think of this as the shape of V. Make sure you can look away and tell yourself what the shapes of V, V_t, and V_t[i] are. You'll need that kind of information on the fly when you're reading the rest of the paper.

Ok, so this next section defines a conditional function. In this case, that means that the function is different depending on what value "i" takes on. If our variable "i" is greater than or equal to 1 AND is less than t, then V_t[i] = V_t-1[i]. If, on the other hand, "i" equals "t", then V_t[i] = v_t.

So, what story is this telling? What is really going on? Well, what are we combining to make V_t[i]? We're either using V_t-1 or v_t. That's interesting. Also, since "i" determines each row of the stack, and we've got "if" statements depending on "i", that means that different rows of the stack are created differently. If "i" == "t"... which would be the newest member of the stack... then it's equal to some variable v_t. If not, hoever, then it seems like it equals whatever the previous timestep equaled at that row. Eureka!!!

So, all this formula is really saying is... each row of V_t is the same as it was in the previous timestep EXCEPT for the newest row v_t... which we just added! This makes total sense given that we're building a stack! Also, when we look at the picture again, we see that each timestep adds a new row... which is curious.

Interesting... so we ALWAYS add a row. We ALWAYS add v_t. That's very interesting. This sounds like we're always pushing. V_t has t rows. That must be why "i" can range from 1 to "t". We have "t" rows for "i" to index.

Take a deep breath. This formula is actually pretty simple, although there are a few things to note that could trip us up in the implementation. First, "i" doesn't seem to be defined at 0 (by "not defined" i mean that they didn't tell us what to do if i == 0... so what do we do?). To me, this means that the original implementation was probably written in Torch (Lua) instead of Theano (Python) because "i" seems to range from 1 to t (as opposed to 0 to t). "i" is an index into an array. In Lua, the first value in an array or list is at index 1. In Python (the language we're prototyping in), the first value in an array or list is at index 0. Thus, we'll need to compensate for the fact that we're coding this network in a different langauge from the original by subtracting 1 from each index. It's a simple fix but perhaps easy to miss.

So, now that we finished the first formula, can you tell what shape s_t[i] is going to be? It's also indexed by both t and i. However, there's a key difference. The "s" is lowercased which means that s_t is a vector (whereas V_t was a list of vectors... e.g. a matrix). Since s_t is a vector, then s_t[i] is a value from that vector. So, what's the shape of "s"? It's a list of "t" vectors.It's a matrix. (I suppose V is technically a strange kind of tensor.)


def s_t(i,t,u,d):
    if(i >= 0 and i < t):
        inner_sum = sum(s[t-1][i+1:t])
        out = max(0,s[t-1][i] - max(0,u[t] - (inner_sum)))
        return out
    elif(i == t):
        return d[t]
    else:
        print "Undefined i -> t relationship"

When whe just do an exact representation of the function in code, we get the following function above.

This should be very familiar. Just like the first formula (1), formula (2) is also a conditional function based on "i" and "t". They're the same conditions so I suppose there's no additional explanation here. Note that the same 1 vs 0 indexing discrepancy between Lua and Python applies here. In the code, this blue circle is modeled on line 02.

So, we have two conditions that are identical as before. This means that the bottom part of the function (circled in blue) is only true if "i" == "t". "i" only equals "t" when we're talking about the row corresponding to the newest vector on our stack. So, what's the value of s_t for the newest member of our stack? This is where we remember back to the definitions we wrote out earlier. s_t was the strength/weight of each vector on the stack. d_t was our pushing weight. s_t is the current strength. d_t is the weight we pushed it on the stack with.

Pause here... try to figure it out for yourself. What is the relationship between s_t[i] and d_t when i == t?

Aha! This makes sense! For the newest vector that we just put on the stack (V_t[t]), it is added to the stack with the same weight (s_t[i]) that we push it onto the stack with (d_t). That's brilliant! This also answers the question of why we push every time! "not pushing" just means "pushing" with a weight equal to 0! (e.g. d_t == 0) If we push with d_t equal to zero then the weight of that vector on the stack is also equal to 0. This section is represented on line 07 in the code.


def s_t(i,t,u,d):
    if(i >= 0 and i < t):
        inner_sum = sum(s[t-1][i+1:t])
        out = max(0,s[t-1][i] - max(0,u[t] - (inner_sum)))
        return out
    elif(i == t):
        return d[t]
    else:
        print "Undefined i -> t relationship"

Ok, now we're getting to the meat of a slightly more complicated formula. So, we're gong to break it into slightly smaller parts. I'm also re-printing the picture below to help you visualize the formula. This sum circled in blue sums from i+1 to t-1. Intuitively this is equivalent to summing "all of the weights between s_t[i] and the top of the stack". Stop here. Make sure you have that in your head.

Why t-1? Well, s_t-1 only runs to t-1. s_t-1[t] would overflow.

Why i+1? Well, think about being at the bottom of the ocean. Imagine that s_t is measuring the amount of water at each level in the ocean. Imagine then that the sum circled in blue is measuring "the weight of the water above me". I don't want to include the water even with me when measuring the "sum total amount of water between me and the ocean's surface." Perhaps I only want to measure the water that's "between me and the surface". That's why we start from i+1. I know that's kindof a silly analogy, but that's how I think about it in my head.

So, what's circled in blue is "the sum total amount of weight between the current strength and the top of the stack". We don't know what we're using it for yet, but just remember that for now.


def s_t(i,t,u,d):
    if(i >= 0 and i < t):
        inner_sum = sum(s[t-1][i+1:t])
        out = max(0,s[t-1][i] - max(0,u[t] - (inner_sum)))
        return out
    elif(i == t):
        return d[t]
    else:
        print "Undefined i -> t relationship"

In the code, this blue circle is represented on line 03. It's stored in the variable "inner_sum".

Look below at the circle in the next image. This is only a slight modification to the previous circle. So, if the previous circle was "the sum total amount of weight between the current strength adn the top of the stack", this is "u_t" minus that weight. Remember what u_t was? It's our pop weight! So, this circle is "the amount we want to pop MINUS the weight between the current index and the top of the stack". Stop here and make sure you can picture it.

What does the blue circle above mean intuitively? Let's try the ocean floor analogy again except this time the ocean is dirt. You're buried alive. The sum from before is the sum total amount of weight above you. It's the sum total of all the dirt above you. u_t is the amount of dirt that can be "popped" off. So, if you're popping more than the amount of dirt above you, then this blue circle returns a positive number. You'll be uncovered! You'll be free! However, if u_t is smaller than the amount of dirt above you, then you'll still be buried. The circle will return "-1 * amount_of_dirt_still_above_me". This circle determines how deep in the dirt you are after dirt was popped off. Negative numbers mean you're still buried. Positive numbers mean you're above the ground. The next circle will reveal more about this. Stop and make sure you've got this in your head.

Now instead of picturing yourself as being under the ground, picture yourself as a gold digger from above ground. You're wanting to figure out how far you have to dig to get to some gold. You ask yourself, "if I did 10 meters, how much (if any) gold will I get). u_t is the 10 meters you're digging. The sum previously discussed is the distance from the gold to the surface of the ground. In this case, if u_t - the sum is negative, then you get no gold. If it is positive, then you get gold. That's why we take the max. At each level, we want to know if that vector will be popped off at all. Will the gold at that level be "removed" by popping "u_t" distance. Thus, this circle takes the "max between the difference and 0" so that the output is either "how much gold we get" or "0". The output is either "we're popping off this much" or it's "0". So, the "max" represents how much we have left to "pop" at the current depth in the stack. Stop here and make sure you've got this in your head.


def s_t(i,t,u,d):
    if(i >= 0 and i < t):
        inner_sum = sum(s[t-1][i+1:t])
        out = max(0,s[t-1][i] - max(0,u[t] - (inner_sum)))
        return out
    elif(i == t):
        return d[t]
    else:
        print "Undefined i -> t relationship"

Can you see the second "max" function on line 04. It is the code equivalent of the circled function above

Almost there! So, if the max from the previous circle indicates "whether u_t is large enough to pop this row off (and if so how much of it)", then s_t-1[i] - that number is how much we have left after popping.

If the previous circle was "how much left we have after popping", then this just guarantees that we can only have a positive amount left, which is exactly the desirable property. So, this function is really saying: given how much we're popping off at this time step (u_t), and how much is between this row and the top of the stack, how much weight is left in this row? Note that u_t doesn't have any affect if i == t. (It has no affect on how much we push on d_t) This means that we're popping before we push at each timestep. Stop here and make sure this makes sense in your head.

And now onto the third formula! This sum should look very familiar. The only difference between this sum and the sum in function (2) is that in this function we sum all the way to "t" instead of stopping at "t-1". This means that we're including the most recently pushed strength (s_t) in the sum. Previously we did not.

def r_t(t):
    r_t_out = np.zeros(stack_width)
    for i in xrange(0,t+1):
        temp = min(s[t][i],max(0,1 - sum(s[t][i+1:t+1])))
        r_t_out += temp * V[t][i]
    return r_t_out

The circled sum above is represented with the "sum" function on line 04 of this code. The circled (1 - sum) immediately below this paragraph is equivalent to the (1 - sum) on line 04.

This "1" means a lot more than you might think. As a sneak peek, this formula is reading from the stack. How many vectors is it reading? It's reading "1" deep into the stack. The previous sum calculates (for every row of s_t) the sum of all the weights above it in the stack. The sum calculates the "depth" if you were of each strength at index "i". Thus, 1 minus that strength calculates what's "left over". This difference will be positive for s_t[i] values that are less than 1.0 units deep in the stack. It will be negative for s_t[i] values that are deeper in the stack.

Taking the max of the previous circle guarantees that it's positive. The previous circle was negative for s_t[i] values that were deeper in the stack than 1.0. Since we're only interested in reading values up to 1.0 deep in the stack, all the weights for deeper values in the stack will be 0.

If the previous circle returned 1 - the sum of the weights above a layer in the stack, then it guarantees that the weight is 0 if the layer is too deep. However, if the vector is much shallower than 1.0, the previous circle would return a very positive number (perhaps as high as 1 for the vector at the top). This min function guarantees that the weight at this level doesn't exceed the strength that was pushed onto this level s_t[i].

def r_t(t):
    r_t_out = np.zeros(stack_width)
    for i in xrange(0,t+1):
        temp = min(s[t][i],max(0,1 - sum(s[t][i+1:t+1])))
        r_t_out += temp * V[t][i]
    return r_t_out

So, the circle in the image above is represented as the variable "temp" created on line 04. The circled section below is the output of the entire function, stored as "r_t_out".

So, what does this function do? It performs a weighted sum over the entire stack, multiplying each vector by s_t, if that vector is less than 1.0 depth into the stack. In other words, it reads the top 1.0 weight off of the stack by performing a weighted sum of the vectors, weighted by s_t. This is the vector that is read at time t and put into the variable r_t.

Writing the Code

Ok, so now we have an intuitive understanding of what these formulas are doing. Where do we start coding? Recall the following figure.

Let's recreate this behavior given the values of u_t, d_t, and r_t above. Remember that many of the "time indices" will be decreased by 1 relative to the Figure 1 above because we're working in Python (instead of Lua).

So, at risk of re-explaining all of the logic above, I'll point to which places in the code correspond to each function.

Lines 1 - 18 initialize all of our variables. v_0, v_1, and v_2 correspond to v_1, v_2, and v_3 in the picture of a stack operating. I made them be the rows of the identity matrix so that they'd be easy to see inside the stack. v_0 has a 1 in the first position (and zeros everywhere else). v_1 has a 1 in the second position, and v_2 has a 1 in the third position.

Lines 13-17 create our basic stack variables... all indexed by "t"

Lines 19-24 correspond to function (3).

Lines 26-33 correspond to function (2).

Lines 35-56 performs a push and pop operation on our stack

Lines 45-55 correspond to function (1). Notice that function (1) is more about how to create V_t given V_t-1.

Lines 58-60 run and print out the exact operations in the picture from the paper. Follow along to make sure you see the data flow!

Lines 62-66 reset the stack variables so we can make sure we got the outputs correct (by making sure they equal the values from the paper)

Lines 68-70 assert that the operations from the graph in the paper produce the correct results

Have Questions? Stuck? Feel free to tweet your question @iamtrask for help.

Part 4: Learning A Sequence

In the last section, we learned the intuitions behind the neural stack mechanism's formulas. We then constructed those exact formulas in python code and validated that they behaved identically to the example laid out in Figure 1 (a) of the paper. In this section, we're going to dig further into how the neural stack works. We will then teach our neural stack to learn a single sequence using backpropagation.

Let's Revisit How The Neural Stack Will Learn

In Part 2, we discussed how the neural stack is unique in that we can backpropgate error from the output of the stack back through to the input. The reason we can do this is that the stack is fully differentiable. (For background on determining whether a function is differentiable, please see Khan Academy . For more on derivatives and differentiability, see the rest of that tutorial.) Why do we care that the stack (as a function) is differentiable? Well, we used the "derivative" of the function to move the error around (more specifically... to backpropagate). For more on this, please see the Tutorial I Wrote on Basic Neural Networks, Gradient Descent, and Recurrent Neural Networks. I particularly recommend the last one because it demontrates backpropgating through somewhat more arbitrary vector operations... kindof like what we're going to do here. :)

Perhaps you might say, "Hey Andrew, pre-requisite information is all nice and good... but I'd like a little bit more intuition on why this stack is differentiable". Let me try to simplify most of it to a few easy to use rules. Backpropagation is really about credit attribution. It's about the neural network saying "ok... I messed up. I was supposed to predict 1.0 but i predicted 0.9. What parts of the function caused me to do this? What parts can I change to better predict 1.0 next time?". Consider this problem.

y = a + b

What if I told you that when we ran this equation, y ended up being a little bit too low. It should have been 1.0 and it was 0.9. So, the error was 0.1. Where did the error come from? Clearly. It came from both a and b equally. So, this gives us an early rule of thumb. When we're summing two variables into another variable. The error is divided evenly between the sum... becuase it's both their faults that it missed! This is a gross oversimplification of calculus, but it helps me remember how to do the chain rule in code on the fly... so I dunno. I find it helpful... at least as a pneumonic. Let's look at another problem.

y = a + 2*b

Same question. Different answer. In this case, the error is 2 times more significant because of b. So remember, when you are multiplying in the function, you have to multiply the error at any point by what that point was multiplied by. So, if the error at y is 0.1. The error at a is 0.1 and the error at b is 2 * 0.1 = 0.2. By the way, this generalizes to vector addition and multiplication as well. Consider in your head why the error would be twice as significant at b. Y is twice as sensitive to changes in b!

y = a*b

Ok, last one for now. If you compute 0.1 error at y, what is the error at a. Well, we can't really know without knowing what b is because b is determining how sensitive y is to a. Funny enough the reverse is also true. a is determining how sensitive y is to b! So, let's take this simple intuition and reflect on our neural stack in both the formal math functions and the code. (Btw, there are many more rules to know in Calculus, and I highly recommend taking a course on it from Coursera or Khan Academy, but these rules should pretty much get you through the Neural Stack)

def s_t(i,t,u,d):
    if(i >= 0 and i < t):
        inner_sum = sum(s[t-1][i+1:t])
        out = max(0,s[t-1][i] - max(0,u[t] - (inner_sum)))
        return out
    elif(i == t):
        return d[t]
    else:
        print "Undefined i -> t relationship"

def r_t(t):
    r_t_out = np.zeros(stack_width)
    for i in xrange(0,t+1):
        temp = min(s[t][i],max(0,1 - sum(s[t][i+1:t+1])))
        r_t_out += temp * V[t][i]
    return r_t_out

Read through each formula. What if the output of our python function r_t(t) was supposed to be 1.0 and it returned 0.9. We would (again) have an error of 0.1. Conceptually, this means we read our stack (by calling the function r_t(t)) and got back a number that was a little bit too low. So, can you see how we can take the simple rules above and move the error (0.1) from the output of the function (line 16) back through to the various inputs? In our case, those inputs include global variables s and V. It's really not any more complicated than the 3 rules we identified above! It's just a long chain of them. This would end up putting error onto s and V. This puts error onto the stack! I find this concept really fascinating. Error in memory! It's like telling the network "dude... you remembered the wrong thing man! Remember something more relevant next time!". Pretty sweet stuff.

So, if we were to code up this error attribution... this backpropagation. It would look like the following.

def s_t_error(i,t,u,d,error):
    if(i >= 0 and i < t):
        if(s_t(i,t,u,d) >= 0):
            s_delta[t-1][i] += error
            if(max(0,u[t] - np.sum(s[t-1][i+1:t-0])) >= 0):
                u_delta[t] -= error
                s_delta[t-1][i+1:t-0] += error
    elif(i == t):
        d_delta[t] += error
    else:
        print "Problem"

def r_t_error(t,r_t_error):
    for i in xrange(0, t+1):
        temp = min(s[t][i],max(0,1 - np.sum(s[t][i+1:t+1])))
        V_delta[t][i] += temp * r_t_error
        temp_error = np.sum(r_t_error * V[t][i])
    
        if(s[t][i] < max(0,1 - np.sum(s[t][i+1:t+1]))):
            s_delta[t][i] += temp_error
        else:
            if(max(0,1 - np.sum(s[t][i+1:t+1])) > 0):
                s_delta[t][i+1:t+1] -= temp_error # minus equal because of the (1-).. and drop the 1

Notice that I have variables V_delta, u_delta, and s_delta that I put the "errors" into. These are identically shaped variables to V, u, and s respectively. It's just a place to put the delta (since there are already meaningful variables in the regular V, u, and s that I don't want to erase).

From Error Propagation To Learning

Ok, so now we know how to move error around through two of our fun stack functions. How does this translate to learning? What are we trying to learn anyway?

Let's think back to our regular stack again. Remember the toy problem we had before? If we pushed an entire sequence onto a stack and then popped it off, we'd get the sequence back in reverse. What this requires is the correct sequence of pushing and popping. So, what if we pushed 3 items on our stack, could we learn to correctly pop them off by adjusting u_t? Let's give it a shot!

Step one is to setup our problem. We need to pick a sequence, and initialize our u_t and d_t variables. (We're initializing both, but we're only going to try to adjust u_t). Something like this will do.


u_weights = np.array([0,0,0,0,0,0.01])
d_weights = np.array([1,1,1,0,0,0]).astype('float64')
# for i in xrange(1000):
stack_width = 2
copy_length = 5

sequence = np.array([[0.1,0.2,0.3],[0,0,0]]).T

# INIT
V = list() # stack states
s = list() # stack strengths 
d = list() # push strengths
u = list() # pop strengths

V_delta = list() # stack states
s_delta = list() # stack strengths 
d_delta = list() # push strengths
u_delta = list() # pop strengths

Ok, it's time to start "reading the net" a little bit. Notice that d_weights starts wtih three 1s in the first three positions. This is what's going to push our sequence onto the stack! By just fixing these weights to 1, we will push with a weight of 1 onto the stack. We're also jumping into something a bit fancy here but important to form the right picture of the stack in our heads. Our sequence has two dimensions and three values. The first dimension (column) has the sequence 0.1, 0.2, and 0.3. The second dimension is all zeros. So, the first item in our sequence is really [0.1, 0.0]. The second is [0.2,0.0]. We're only focusing on optimizing for (reversing) the sequence in the first column, but I want to use two dimensions so that we can make sure our logic is generalized to multi-dimensional sequence inputs. We'll see why later. :)

Also notice that we're initializing the "delta" variables. We also make a few changes to our functions from before to make sure we keep the delta variables maintaining the same shape as their respective base variables.

NOTE: When you hit "Play", the browser may freeze temporarily. Just wait for it to finish. Could be a minute or two.

So, at risk of making this blogpost too long (it probably already is), I'll leave it up to you to use what I've taught here and in previous blogposts to work through the backpropgation steps if you like. It's really just a sequence of applying the rules that we outlined above. Furthermore, everything else in the learning stack above is based on the concepts we already learned. I encourage you to play around with the code. All we did was backprop from the error in prediction back to the popping weight array u_weights... which stores the value we enter for u_t at each timestep. We then update the weights to apply a different u_t at the next iteration. To be clear, this is basically a neural network with only 3 parameters that we update. However, since that update is whether we pop or not, it has the opportunity to optimize this toy problem. Try to learn different sequences. Break it. Fix it. Play with it!

The Next Level: Learning to Push

So, why did we build that last toy example? Well, for me personally, I wanted to be able to sanity check that my backpropagation logic was correct. What better way to check that than to have it optimize the simplest toy problem I could surmise. This is another best practice. Validate as much as you can along the way. What I know so far is that the deltas showing up in u_delta are correct, and if I use them to update future values of u_t, then the network converges to a sequence. However, what about d_t? Let's try to optimize both with a slightly harder problem (but only slightly... remember... we're validating code). Notice that there are very few code changes. We're just harvesting the derivatives from d_t as well to update d_weights just like we did u_weights (run for more iterations to get better convergence).

Have Questions? Stuck? Feel free to tweet your question @iamtrask for help.

Part 5: Building a Neural Controller

We did it! At this point, we have built a neural stack and all of its components. However, there's more to do to get it to learn arbitrary sequences. So, for a little while we're going to return to some of our fundamentals of neural networks. The next phase is to control v_t, u_t, and d_t with an external neural network called a "Controller". This network will be a Recurrent Neural Network (because we're still modeling sequences). Thus, the knowledge of RNNs contained in the previous blogpost on Recurrent Neural Networks will be considered a pre-requisite.

To determine what kind of neural network we will use to control these various operations, let's take a look back at the formulas describing it in the paper.

The high level takeaway of these formulas is that all of our inputs to the stack are conditioned on a vector called o_prime_t. If you are familiar with vanilla neural networks already, then this should be easy work. The code for these functions looks something like this.


import numpy as np

def sigmoid(x):
    return 1/(1+np.exp(-x))

def tanh(x):
    return np.tanh(x)

o_prime_dim = 5
stack_width = 2
output_dim = 3

o_prime_t = np.random.rand(o_prime_dim) # random input data

# weight matrix
W_d = np.random.rand(o_prime_dim,1) * 0.2 - 0.1
# bias
b_d = np.random.rand(1) * 0.2 - 0.1

# weight matrix
W_u = np.random.rand(o_prime_dim,1) * 0.2 - 0.1
# bias
b_u = np.random.rand(1) * 0.2 - 0.1

# weight matrix
W_v = np.random.rand(o_prime_dim,stack_width) * 0.2 - 0.1
#bias
b_v = np.random.rand(stack_width) * 0.2 - 0.1

#weight matrix
W_o = np.random.rand(o_prime_dim,output_dim) * 0.2 - 0.1
#bias
b_o = np.random.rand(output_dim) * 0.2 - 0.1

#forward propagation
d_t = sigmoid(np.dot(o_prime_t,W_d) + b_d)
u_t = sigmoid(np.dot(o_prime_t,W_u) + b_u)
v_t = tanh(np.dot(o_prime_t,W_v) + b_v)
o_t = tanh(np.dot(o_prime_t,W_o) + b_o)

So, this is another point where it would be tempting to hook up all of these controllers at once, build and RNN, and see if it converges. However, this is not wise. Instead, let's (again) just bite off the smallest piece that we can test. Let's start by just controlling u_t and d_t with a neural network by altering our previous codebase. This is also a good time to add some object oriented abstraction to our neural stack since we won't be changing it anymore (make it work... THEN make it pretty :) )

Note: I'm just writing the code inline for you to copy and run locally. This was getting a bit too computationally intense for a browser. I highly recommend downloading iPython Notebook and running all the rest of the examples in this blog in various notebooks. That's what I used to develop them. They're extremely effective for experimention and rapid prototyping.


import numpy as np
np.random.seed(1)

def sigmoid(x):
    return 1/(1+np.exp(-x))

def sigmoid_out2deriv(out):
    return out * (1 - out)

def tanh(x):
    return np.tanh(x)

def tanh_out2deriv(out):
    return (1 - out**2)

def relu(x,deriv=False):
    if(deriv):
        return int(x >= 0)
    return max(0,x)

class NeuralStack():
    def __init__(self,stack_width=2,o_prime_dim=6):
        self.stack_width = stack_width
        self.o_prime_dim = o_prime_dim
        self.reset()
        
    def reset(self):
        # INIT STACK
        self.V = list() # stack states
        self.s = list() # stack strengths 
        self.d = list() # push strengths
        self.u = list() # pop strengths
        self.r = list()
        self.o = list()

        self.V_delta = list() # stack states
        self.s_delta = list() # stack strengths 
        self.d_error = list() # push strengths
        self.u_error = list() # pop strengths
        
        self.t = 0
        
    def pushAndPopForward(self,v_t,d_t,u_t):

        self.d.append(d_t)
        self.d_error.append(0)

        self.u.append(u_t)
        self.u_error.append(0)

        new_s = np.zeros(self.t+1)
        for i in xrange(self.t+1):
            new_s[i] = self.s_t(i)
        self.s.append(new_s)
        self.s_delta.append(np.zeros_like(new_s))

        if(len(self.V) == 0):
            V_t = np.zeros((0,self.stack_width))
        else:
            V_t = self.V[-1]
        self.V.append(np.concatenate((V_t,np.atleast_2d(v_t)),axis=0))
        self.V_delta.append(np.zeros_like(self.V[-1]))
        
        r_t = self.r_t()
        self.r.append(r_t)
        
        self.t += 1
        return r_t
    
    def s_t(self,i):
        if(i >= 0 and i < self.t):
            inner_sum = self.s[self.t-1][i+1:self.t-0]
            return relu(self.s[self.t-1][i] - relu(self.u[self.t] - np.sum(inner_sum)))
        elif(i == self.t):
            return self.d[self.t]
        else:
            print "Problem"
            
    def s_t_error(self,i,error):
        if(i >= 0 and i < self.t):
            if(self.s_t(i) >= 0):
                self.s_delta[self.t-1][i] += error
                if(relu(self.u[self.t] - np.sum(self.s[self.t-1][i+1:self.t-0])) >= 0):
                    self.u_error[self.t] -= error
                    self.s_delta[self.t-1][i+1:self.t-0] += error
        elif(i == self.t):
            self.d_error[self.t] += error
        else:
            print "Problem"
            
    def r_t(self):
        r_t_out = np.zeros(self.stack_width)
        for i in xrange(0,self.t+1):
            temp = min(self.s[self.t][i],relu(1 - np.sum(self.s[self.t][i+1:self.t+1])))
            r_t_out += temp * self.V[self.t][i]
        return r_t_out
            
    def r_t_error(self,r_t_error):
        for i in xrange(0, self.t+1):
            temp = min(self.s[self.t][i],relu(1 - np.sum(self.s[self.t][i+1:self.t+1])))
            self.V_delta[self.t][i] += temp * r_t_error
            temp_error = np.sum(r_t_error * self.V[self.t][i])

            if(self.s[self.t][i] < relu(1 - np.sum(self.s[self.t][i+1:self.t+1]))):
                self.s_delta[self.t][i] += temp_error
            else:
                if(relu(1 - np.sum(self.s[self.t][i+1:self.t+1])) > 0):
                    self.s_delta[self.t][i+1:self.t+1] -= temp_error # minus equal becuase of the (1-).. and drop the 1
            
    def backprop(self,all_errors_in_order_of_training_data):
        errors = all_errors_in_order_of_training_data
        for error in reversed(list((errors))):
            self.t -= 1
            self.r_t_error(error)
            for i in reversed(xrange(self.t+1)):
                self.s_t_error(i,self.s_delta[self.t][i])

        
u_weights = np.array([0,0,0,0,0,0.0])

o_primes = np.eye(6)
y = np.array([1,0])
W_op_d = (np.random.randn(6,1) * 0.2) - 0.2
b_d = np.zeros(1)

W_op_u = (np.random.randn(6,1) * 0.2) - 0.2
b_u = np.zeros(1)

for h in xrange(10000):
    
    sequence = np.array([np.random.rand(3),[0,0,0]]).T
    sequence = np.concatenate((sequence,np.zeros_like(sequence)))
    
    stack = NeuralStack()

    d_weights = sigmoid(np.dot(o_primes,W_op_d) + b_d)
    u_weights = sigmoid(np.dot(o_primes,W_op_u) + b_u)
    
    reads = list()
    for i in xrange(6):
        reads.append(stack.pushAndPopForward(sequence[i],d_weights[i],u_weights[i]))

    reads = np.array(reads)

    reads[:3] *= 0 # eliminate errors we're not ready to backprop

    errors_in_order = reads - np.array(list(reversed(sequence)))

    stack.backprop(errors_in_order)

    # WEIGHT UPDATE
    alpha = 0.5

    u_delta = np.atleast_2d(np.array(stack.u_error) * sigmoid_out2deriv(u_weights)[0])
    W_op_u -= alpha * np.dot(o_primes,u_delta.T)
    b_u -= alpha * np.mean(u_delta)

    d_delta = np.atleast_2d(np.array(stack.d_error) * sigmoid_out2deriv(d_weights)[0])
    W_op_d -= alpha * np.dot(o_primes,d_delta.T)
    b_d -= alpha * np.mean(d_delta)

    for k in range(len(d_weights)):
        if d_weights[k] < 0:
            d_weights[k] = 0
        if u_weights[k] < 0:
            u_weights[k] = 0
            

    if(h % 100 == 0):
        print errors_in_order

Runtime Output:


[[ 0.          0.        ]
 [ 0.          0.        ]
 [ 0.          0.        ]
 [-0.79249511  0.        ]
 [-0.19810149  0.        ]
 [-0.14038694  0.        ]]

 .....
 .....
 .....

 [[ 0.          0.        ]
 [ 0.          0.        ]
 [ 0.          0.        ]
 [-0.00174623  0.        ]
 [ 0.0023387   0.        ]
 [-0.00084291  0.        ]]

Note that I'm logging errors this time instead of the discrete sequence.

All this code is doing is using two weight matrices W_op_u and W_op_d (and their biases) to predict u_t and d_t. We created mock o_prime_t variables to be different at each timestep. Instead of taking the delta at u_t and changing the u_weight directly. We used the delta at u_t to update the matrices W_op_u. Even though the code is cleaned up considerably, it's still doing the same thing for 99% of it.

Building Out The Rest of the Controller

So, all we're really doing now is taking the RNN from my previous blogpost on Recurrent Neural Networks and using it to generate o_prime_t. We then hook up the forward and backpropagation and we get the following code. I'm going to write the code in section here and describe (at a high level) what's going on. I'll then give you a single block with all the code together (that's runnable)


import numpy as np

def sigmoid(x):
    return 1/(1+np.exp(-x))

def sigmoid_out2deriv(out):
    return out * (1 - out)

def tanh(x):
    return np.tanh(x)

def tanh_out2deriv(out):
    return (1 - out**2)

def relu(x,deriv=False):
    if(deriv):
        return int(x >= 0)
    return max(0,x)

Seen this before. Just some utility nonlinearities to use at various layers. Note that I'm using relu here instead of using the "max(0,x)" from before. They are identical. So, wherever you used to see "max(0," you will now see "relu(".


class NeuralStack():
    def __init__(self,stack_width=2,o_prime_dim=6):
        self.stack_width = stack_width
        self.o_prime_dim = o_prime_dim
        self.reset()
        
    def reset(self):
        # INIT STACK
        self.V = list() # stack states
        self.s = list() # stack strengths 
        self.d = list() # push strengths
        self.u = list() # pop strengths
        self.r = list()
        self.o = list()

        self.V_delta = list() # stack states
        self.s_delta = list() # stack strengths 
        self.d_error = list() # push strengths
        self.u_error = list() # pop strengths
        
        self.t = 0
        
    def pushAndPopForward(self,v_t,d_t,u_t):

        self.d.append(d_t)
        self.d_error.append(0)

        self.u.append(u_t)
        self.u_error.append(0)

        new_s = np.zeros(self.t+1)
        for i in xrange(self.t+1):
            new_s[i] = self.s_t(i)
        self.s.append(new_s)
        self.s_delta.append(np.zeros_like(new_s))

        if(len(self.V) == 0):
            V_t = np.zeros((0,self.stack_width))
        else:
            V_t = self.V[-1]
        self.V.append(np.concatenate((V_t,np.atleast_2d(v_t)),axis=0))
        self.V_delta.append(np.zeros_like(self.V[-1]))
        
        r_t = self.r_t()
        self.r.append(r_t)
        
        self.t += 1
        return r_t
    
    def s_t(self,i):
        if(i >= 0 and i < self.t):
            inner_sum = self.s[self.t-1][i+1:self.t-0]
            return relu(self.s[self.t-1][i] - relu(self.u[self.t] - np.sum(inner_sum)))
        elif(i == self.t):
            return self.d[self.t]
        else:
            print "Problem"
            
    def s_t_error(self,i,error):
        if(i >= 0 and i < self.t):
            if(self.s_t(i) >= 0):
                self.s_delta[self.t-1][i] += error
                if(relu(self.u[self.t] - np.sum(self.s[self.t-1][i+1:self.t-0])) >= 0):
                    self.u_error[self.t] -= error
                    self.s_delta[self.t-1][i+1:self.t-0] += error
        elif(i == self.t):
            self.d_error[self.t] += error
        else:
            print "Problem"
            
    def r_t(self):
        r_t_out = np.zeros(self.stack_width)
        for i in xrange(0,self.t+1):
            temp = min(self.s[self.t][i],relu(1 - np.sum(self.s[self.t][i+1:self.t+1])))
            r_t_out += temp * self.V[self.t][i]
        return r_t_out
            
    def r_t_error(self,r_t_error):
        for i in xrange(0, self.t+1):
            temp = min(self.s[self.t][i],relu(1 - np.sum(self.s[self.t][i+1:self.t+1])))
            self.V_delta[self.t][i] += temp * r_t_error
            temp_error = np.sum(r_t_error * self.V[self.t][i])

            if(self.s[self.t][i] < relu(1 - np.sum(self.s[self.t][i+1:self.t+1]))):
                self.s_delta[self.t][i] += temp_error
            else:
                if(relu(1 - np.sum(self.s[self.t][i+1:self.t+1])) > 0):
                    self.s_delta[self.t][i+1:self.t+1] -= temp_error # minus equal becuase of the (1-).. and drop the 1
    def backprop_single(self,r_t_error):
        self.t -= 1
        self.r_t_error(r_t_error)
        for i in reversed(xrange(self.t+1)):
            self.s_t_error(i,self.s_delta[self.t][i])
    
    def backprop(self,all_errors_in_order_of_training_data):
        errors = all_errors_in_order_of_training_data
        for error in reversed(list((errors))):
            self.backprop_single(error)

This is pretty much the same Neural Stack we developed before. I broke the "backprop" method into two methods: backprop and backprop_single. Backpropagating over all the timesteps can be done by calling backprop. If you just want to backprop a single step at a time (which was useful when making sure to backprop through the RNN), then call backprop_single.

options = 2
sub_sequence_length = 2
sequence_length = sub_sequence_length*2

sequence = (np.random.random(sub_sequence_length)*options).astype(int)+1
sequence[-1] = 0
sequence

X = np.zeros((sub_sequence_length*2,options+1))
Y = np.zeros_like(X)
for i in xrange(len(sequence)):
    X[i][sequence[i]] = 1
    Y[-i][sequence[i]] = 1

x_dim = X.shape[1]
h_dim = 16
o_prime_dim = 16
stack_width = options
y_dim = Y.shape[1]

This segment of code does a couple things. First, it constructs a training example sequence. "sub_sequence_length" is the lengh of the sequence that we want to remember and then reverse with the neural stack. "options" is the number of unique elements in the sequence. Setting options to 2 generates a binary sequence, which is what we're running here. The sequence_length is just double the sub_sequence_length. This is because we need to first encode the whole sequence and then decode the whole sequence. So, if the sub_sequence_length is of length 5, then we have to generate 10 training examples (5 encoding and 5 decoding). Note that we set the last number in the sequence to 0 which is a special index indicating that we have reached the end of the sequence. The network will learn to start reversing at this point.

X and Y are our input and output training examples respectively. They one-hot encode our training data for both inputs and outputs.

Finally, we have the dimensionality of our neural network. In accordance with the paper, we the input dimension x_dim (equal to the number of "options" in our sequence plus one special character for the end of sequence marker). We also have two hidden layers. "h_dim" refers to the hidden layer in the recurrent neural network. "o_prime_dim" is the second hidden layer (generated from the recurrent hidden layer) which sends information into our neural stack. We have set the dimensionality of both hidden layers to 16. Note that this is WAY smaller than the 256 and 512 size layers in the paper. For ease, we're going to work with shorter binary sequences which require smaller hidden layers sizes (mostly because of the number of options, not the length of the sequence...)

"stack_width" is still the width of the vectors on the stack. In this case, I'm setting it to the number of options so that it can (in theory) one hot encode the input data into the stack. In theory you could actually use log_base_2(options) but this level of compression just requires more training time. I tried several experiments making this bigger with mixed results.

"y_dim" is the dimensionality of the output sequence to be predicted. Note that this could be (in theory) any sequence, but in this case it is the reverse of the input.


W_xh = (np.random.rand(x_dim,h_dim)*0.2) - 0.1
W_xh_update = np.zeros_like(W_xh)

W_hh = (np.random.rand(h_dim,h_dim)*0.2) - 0.1
W_hh_update = np.zeros_like(W_hh)

W_rh = (np.random.rand(stack_width,h_dim)*0.2) - 0.1
W_rh_update = np.zeros_like(W_rh)

b_h = (np.random.rand(h_dim)*0.2) - 0.1
b_h_update = np.zeros_like(b_h)

W_hop = (np.random.rand(h_dim,o_prime_dim) * 0.2) - 0.1
W_hop_update = np.zeros_like(W_hop)

b_op = (np.random.rand(o_prime_dim)*0.2) - 0.1
b_op_update = np.zeros_like(b_op)

W_op_d = (np.random.rand(o_prime_dim,1)*0.2) - 0.1
W_op_d_update = np.zeros_like(W_op_d)

W_op_u = (np.random.rand(o_prime_dim,1)*0.2) - 0.1
W_op_u_update = np.zeros_like(W_op_u)

W_op_v = (np.random.rand(o_prime_dim,stack_width)*0.2) - 0.1
W_op_v_update = np.zeros_like(W_op_v)

W_op_o = (np.random.rand(o_prime_dim,y_dim)*0.2) - 0.1
W_op_o_update = np.zeros_like(W_op_o)

b_d = (np.random.rand(1)*0.2)+0.1
b_d_update = np.zeros_like(b_d)

b_u = (np.random.rand(1)*0.2)-0.1
b_u_update = np.zeros_like(b_u)

b_v = (np.random.rand(stack_width)*0.2)-0.1
b_v_update = np.zeros_like(b_v)

b_o = (np.random.rand(y_dim)*0.2)-0.1
b_o_update = np.zeros_like(b_o)

This initializes all of our weight matrices necessary for the Recurrent Neural Network Controller. I generally used the notation W_ for weight matrices and b_ for biases. Following the W is shorthand for what it connects from and to. For example, W_xh connects the input (x) to the recurrent hidden layer (h). "op" is shorthand for "o_prime".

There is one other note here that you can find in the appendix of the paper. Initialization of b_d and b_u has significant impact on how well the neural stack is learned. In general, if the first iterations of the network don't push anything onto the stack, then no derivatives will backpropagate THROUGH the stack, and the neural network will just ignore it. Thus, initializing b_d (the push bias) to a higher number (+0.1 instead of -0.1) encourages the network to push onto the neural stack during the beginning of training. This has a nice parallel intuition to life. If you had a stack but never pushed anything onto it... how would you know what it does? Generally the same thing going on here intuitively.

The reason we have _update variables is that we're going to be implementing mini-batch updates. Thus, we'll create updates and save them in the _update variables and only occasionally update the actual variables. This make for smoother training. See more on this in previous blogposts.


error = 0
max_len = 1
batch_size = 10
for it in xrange(1000000000):
    
    sub_sequence_length = np.random.randint(max_len)+1
    sequence = (np.random.random(sub_sequence_length)*options).astype(int)+1
#     sequence[-1] = 0
    sequence
    
    X = np.zeros((sub_sequence_length*2,options+1))
    Y = np.zeros_like(X)
    for i in xrange(len(sequence)):
        X[i][sequence[i]] = 1
        X[i][0] = 1
        Y[-i-1][sequence[i]] = 1

Ok, so the logic above that creates training examples doesn't exactly get used. I just use that section further up to experiment with the training example logic. I encourage you to do it. As you can see here, we randomly generate new training examples as we go. Note that "max_len" refers to the maximum length that we will model initially. As the error goes down (the neural network learns), this number will increase, modeling longer and longer sequences. Basically, we start by training the neural stack on short sequences, and once it gets good at those we start presenting longer ones. Experimenting with how long to start with was very fascinating to me. I highly encourage playing around with it.


layers = list()
stack = NeuralStack(stack_width=stack_width,o_prime_dim=o_prime_dim)    
for i in xrange(len(X)):
    layer = {}

    layer['x'] = X[i]

    if(i == 0):
        layer['h_t-1'] = np.zeros(h_dim)
        layer['h_t-1'][0] = 1
        layer['r_t-1'] = np.zeros(stack_width)
        layer['r_t-1'][0] = 1
    else:
        layer['h_t-1'] = layers[i-1]['h_t']
        layer['r_t-1'] = layers[i-1]['r_t']

    layer['h_t'] = tanh(np.dot(layer['x'],W_xh) + np.dot(layer['h_t-1'],W_hh) + np.dot(layer['r_t-1'],W_rh) + b_h)
    layer['o_prime_t'] = tanh(np.dot(layer['h_t'],W_hop)+b_op)
    layer['o_t'] = tanh(np.dot(layer['o_prime_t'],W_op_o) + b_o)

    if(i < len(X)-1):
        layer['d_t'] = sigmoid(np.dot(layer['o_prime_t'],W_op_d) + b_d)
        layer['u_t'] = sigmoid(np.dot(layer['o_prime_t'],W_op_u) + b_u)
        layer['v_t'] = tanh(np.dot(layer['o_prime_t'],W_op_v) + b_v)

        layer['r_t'] = stack.pushAndPopForward(layer['v_t'],layer['d_t'],layer['u_t'])

    layers.append(layer)

This is the forward propagation step. Notice that it's just a recurrent neural network with one extra input r_t-1 which is the neural stack r_t from the previous timestep. Generally, you can also see that x->h, h->o_prime, o_prime->stack_controllers, stack_controllers->stack, stack->r_t, and then r_t is fed into the next layer. Study this portion of code until the "information flow" becomes clear. Also notice the convention I use of storing the intermediate variables into the "layers" list. This will help make backpropagation easier later. Note that the prediction of the neural network is layer['o'] which isn't exactly what we read off of the stack. Information must travel from the stack to the hidden layer and out through layer['o']. We'll talk more about how to encourage this in a minute.


for i in list(reversed(xrange(len(X)))):
    layer = layers[i]

    layer['o_t_error'] = Y[i] - layer['o_t']
    error += np.sum(np.abs(layer['o_t_error']))
    if(it % 100 == 99):
        if(i == len(X)-1):
            if(it % 10000 == 9999):
                print "MaxLen:"+str(max_len)+ " Iter:" + str(it) + " Error:" + str(error) + "True:" + str(sequence) + " Pred:" + str(map(lambda x:np.argmax(x['o_t']),layers[sub_sequence_length:]))
            if(error < (5*max_len)):
                max_len+=1              
            error = 0

#                     sub_sequence_length += 1                
#             print str(Y[i]) + " - " + str(layer['o_t']) + " = " + str(layer['o_t_error'])        
    layer['o_t_delta'] = layer['o_t_error'] * tanh_out2deriv(layer['o_t'])

    layer['o_prime_t_error'] = np.dot(layer['o_t_delta'],W_op_o.T)

    if(i < len(X)-1):
        layer['r_t_error'] = layers[i+1]['r_t-1_error']
        stack.backprop_single(layer['r_t_error'])

        layer['v_t_error'] = stack.V_delta[i][i]
        layer['v_t_delta'] = layer['v_t_error'] * tanh_out2deriv(layer['v_t'])
        layer['o_prime_t_error'] += np.dot(layer['v_t_delta'],W_op_v.T)

        layer['u_t_error'] = stack.u_error[i]
        layer['u_t_delta'] = layer['u_t_error'] * sigmoid_out2deriv(layer['u_t'])
        layer['o_prime_t_error'] += np.dot(layer['u_t_delta'],W_op_u.T)

        layer['d_t_error'] = stack.d_error[i]
        layer['d_t_delta'] = layer['d_t_error'] * sigmoid_out2deriv(layer['d_t'])
        layer['o_prime_t_error'] += np.dot(layer['d_t_delta'],W_op_d.T)


    layer['o_prime_t_delta'] = layer['o_prime_t_error'] * tanh_out2deriv(layer['o_prime_t'])
    layer['h_t_error'] = np.dot(layer['o_prime_t_delta'],W_hop.T)

    if(i < len(X)-1):
        layer['h_t_error'] += layers[i+1]['h_t-1_error']

    layer['h_t_delta'] = layer['h_t_error'] * tanh_out2deriv(layer['h_t'])
    layer['h_t-1_error'] = np.dot(layer['h_t_delta'],W_hh.T)
    layer['r_t-1_error'] = np.dot(layer['h_t_delta'],W_rh.T)

This is the backpropagation step. Again, we're just taking the delta we get from the neural stack and then backpropagating it through all the layers just like we did with the recurrent neural network in the previous blogpost. Also note the logic on lines 7-13. If the error gets below 5*max_len, then it increases the length of the sequence it's trying to model by 1 at the next iteration.

for i in xrange(len(X)):
    layer = layers[i]
    max_alpha = 0.005 * batch_size
    alpha = max_alpha / sub_sequence_length

    W_xh_update += alpha * np.outer(layer['x'],layer['h_t_delta'])
    W_hh_update += alpha * np.outer(layer['h_t-1'],layer['h_t_delta'])
    W_rh_update += alpha * np.outer(layer['r_t-1'],layer['h_t_delta'])
    b_h_update += alpha * layer['h_t_delta']

    W_hop_update += alpha * np.outer(layer['h_t'],layer['o_prime_t_delta'])
    b_op_update += alpha * layer['o_prime_t_delta']

    if(i < len(X)-1):
        W_op_d_update += alpha * np.outer(layer['o_prime_t'],layer['d_t_delta'])
        W_op_u_update += alpha * np.outer(layer['o_prime_t'],layer['u_t_delta'])
        W_op_v_update += alpha * np.outer(layer['o_prime_t'],layer['v_t_delta'])

        b_d_update += alpha * layer['d_t_delta']
        b_u_update += alpha * layer['u_t_delta']
        b_v_update += alpha * layer['v_t_delta']

    W_op_o_update += alpha * np.outer(layer['o_prime_t'],layer['o_t_delta'])
    b_o_update += alpha * layer['o_t_delta']

At this phase, we create our weight updates by multiplying the outer product of each layers weights by the deltas at the immediately following layer. We save these updates aside into our _update variables. Note that this doesn't change the weights. It just collects the updates.


if(it % batch_size == (batch_size-1)):
    W_xh += W_xh_update/batch_size
    W_xh_update *= 0
    
    W_hh += W_hh_update/batch_size
    W_hh_update *= 0
    
    W_rh += W_rh_update/batch_size
    W_rh_update *= 0
    
    b_h += b_h_update/batch_size
    b_h_update *= 0
    
    W_hop += W_hop_update/batch_size
    W_hop_update *= 0
    
    b_op += b_op_update/batch_size
    b_op_update *= 0
    
    W_op_d += W_op_d_update/batch_size
    W_op_d_update *= 0
    
    W_op_u += W_op_u_update/batch_size
    W_op_u_update *= 0
    
    W_op_v += W_op_v_update/batch_size
    W_op_v_update *= 0
    
    b_d += b_d_update/batch_size
    b_d_update *= 0
    
    b_d += b_d * 0.00025 * batch_size
    b_u += b_u_update/batch_size
    b_u_update *= 0
    
    b_v += b_v_update/batch_size
    b_v_update *= 0
    
    W_op_o += W_op_o_update/batch_size
    W_op_o_update *= 0
    
    b_o += b_o_update/batch_size
    b_o_update *= 0

And if we are at the end of a mini-batch, then we update the weights using the average of all the updates we had accumulated so far. We then clear out each _update variable by multiplying it by 0.

And We Have It!!!

So, for all the code in one big file for you to run

import numpy as np

def sigmoid(x):
    return 1/(1+np.exp(-x))

def sigmoid_out2deriv(out):
    return out * (1 - out)

def tanh(x):
    return np.tanh(x)

def tanh_out2deriv(out):
    return (1 - out**2)

def relu(x,deriv=False):
    if(deriv):
        return int(x >= 0)
    return max(0,x)

class NeuralStack():
    def __init__(self,stack_width=2,o_prime_dim=6):
        self.stack_width = stack_width
        self.o_prime_dim = o_prime_dim
        self.reset()
        
    def reset(self):
        # INIT STACK
        self.V = list() # stack states
        self.s = list() # stack strengths 
        self.d = list() # push strengths
        self.u = list() # pop strengths
        self.r = list()
        self.o = list()

        self.V_delta = list() # stack states
        self.s_delta = list() # stack strengths 
        self.d_error = list() # push strengths
        self.u_error = list() # pop strengths
        
        self.t = 0
        
    def pushAndPopForward(self,v_t,d_t,u_t):

        self.d.append(d_t)
        self.d_error.append(0)

        self.u.append(u_t)
        self.u_error.append(0)

        new_s = np.zeros(self.t+1)
        for i in xrange(self.t+1):
            new_s[i] = self.s_t(i)
        self.s.append(new_s)
        self.s_delta.append(np.zeros_like(new_s))

        if(len(self.V) == 0):
            V_t = np.zeros((0,self.stack_width))
        else:
            V_t = self.V[-1]
        self.V.append(np.concatenate((V_t,np.atleast_2d(v_t)),axis=0))
        self.V_delta.append(np.zeros_like(self.V[-1]))
        
        r_t = self.r_t()
        self.r.append(r_t)
        
        self.t += 1
        return r_t
    
    def s_t(self,i):
        if(i >= 0 and i < self.t):
            inner_sum = self.s[self.t-1][i+1:self.t-0]
            return relu(self.s[self.t-1][i] - relu(self.u[self.t] - np.sum(inner_sum)))
        elif(i == self.t):
            return self.d[self.t]
        else:
            print "Problem"
            
    def s_t_error(self,i,error):
        if(i >= 0 and i < self.t):
            if(self.s_t(i) >= 0):
                self.s_delta[self.t-1][i] += error
                if(relu(self.u[self.t] - np.sum(self.s[self.t-1][i+1:self.t-0])) >= 0):
                    self.u_error[self.t] -= error
                    self.s_delta[self.t-1][i+1:self.t-0] += error
        elif(i == self.t):
            self.d_error[self.t] += error
        else:
            print "Problem"
            
    def r_t(self):
        r_t_out = np.zeros(self.stack_width)
        for i in xrange(0,self.t+1):
            temp = min(self.s[self.t][i],relu(1 - np.sum(self.s[self.t][i+1:self.t+1])))
            r_t_out += temp * self.V[self.t][i]
        return r_t_out
            
    def r_t_error(self,r_t_error):
        for i in xrange(0, self.t+1):
            temp = min(self.s[self.t][i],relu(1 - np.sum(self.s[self.t][i+1:self.t+1])))
            self.V_delta[self.t][i] += temp * r_t_error
            temp_error = np.sum(r_t_error * self.V[self.t][i])

            if(self.s[self.t][i] < relu(1 - np.sum(self.s[self.t][i+1:self.t+1]))):
                self.s_delta[self.t][i] += temp_error
            else:
                if(relu(1 - np.sum(self.s[self.t][i+1:self.t+1])) > 0):
                    self.s_delta[self.t][i+1:self.t+1] -= temp_error # minus equal becuase of the (1-).. and drop the 1
    def backprop_single(self,r_t_error):
        self.t -= 1
        self.r_t_error(r_t_error)
        for i in reversed(xrange(self.t+1)):
            self.s_t_error(i,self.s_delta[self.t][i])
    
    def backprop(self,all_errors_in_order_of_training_data):
        errors = all_errors_in_order_of_training_data
        for error in reversed(list((errors))):
            self.backprop_single(error)

        
options = 2
sub_sequence_length = 5
sequence_length = sub_sequence_length*2

sequence = (np.random.random(sub_sequence_length)*options).astype(int)+1
sequence[-1] = 0
sequence

X = np.zeros((sub_sequence_length*2,options+1))
Y = np.zeros_like(X)
for i in xrange(len(sequence)):
    X[i][sequence[i]] = 1
    Y[-i][sequence[i]] = 1

sequence_length = len(X)
x_dim = X.shape[1]
h_dim = 16
o_prime_dim = 16
stack_width = options
y_dim = Y.shape[1]     


sub_sequence_length = 2

W_xh = (np.random.rand(x_dim,h_dim)*0.2) - 0.1
W_xh_update = np.zeros_like(W_xh)

W_hh = (np.random.rand(h_dim,h_dim)*0.2) - 0.1
W_hh_update = np.zeros_like(W_hh)

W_rh = (np.random.rand(stack_width,h_dim)*0.2) - 0.1
W_rh_update = np.zeros_like(W_rh)

b_h = (np.random.rand(h_dim)*0.2) - 0.1
b_h_update = np.zeros_like(b_h)

W_hop = (np.random.rand(h_dim,o_prime_dim) * 0.2) - 0.1
W_hop_update = np.zeros_like(W_hop)

b_op = (np.random.rand(o_prime_dim)*0.2) - 0.1
b_op_update = np.zeros_like(b_op)

W_op_d = (np.random.rand(o_prime_dim,1)*0.2) - 0.1
W_op_d_update = np.zeros_like(W_op_d)

W_op_u = (np.random.rand(o_prime_dim,1)*0.2) - 0.1
W_op_u_update = np.zeros_like(W_op_u)

W_op_v = (np.random.rand(o_prime_dim,stack_width)*0.2) - 0.1
W_op_v_update = np.zeros_like(W_op_v)

W_op_o = (np.random.rand(o_prime_dim,y_dim)*0.2) - 0.1
W_op_o_update = np.zeros_like(W_op_o)

b_d = (np.random.rand(1)*0.2)+0.1
b_d_update = np.zeros_like(b_d)

b_u = (np.random.rand(1)*0.2)-0.1
b_u_update = np.zeros_like(b_u)

b_v = (np.random.rand(stack_width)*0.2)-0.1
b_v_update = np.zeros_like(b_v)

b_o = (np.random.rand(y_dim)*0.2)-0.1
b_o_update = np.zeros_like(b_o)

error = 0
max_len = 1
batch_size = 10
for it in xrange(1000000000):
    
    sub_sequence_length = np.random.randint(max_len)+1
    sequence = (np.random.random(sub_sequence_length)*options).astype(int)+1
#     sequence[-1] = 0
    sequence
    
    X = np.zeros((sub_sequence_length*2,options+1))
    Y = np.zeros_like(X)
    for i in xrange(len(sequence)):
        X[i][sequence[i]] = 1
        X[i][0] = 1
        Y[-i-1][sequence[i]] = 1
    
    layers = list()
    stack = NeuralStack(stack_width=stack_width,o_prime_dim=o_prime_dim)    
    for i in xrange(len(X)):
        layer = {}

        layer['x'] = X[i]

        if(i == 0):
            layer['h_t-1'] = np.zeros(h_dim)
            layer['h_t-1'][0] = 1
            layer['r_t-1'] = np.zeros(stack_width)
            layer['r_t-1'][0] = 1
        else:
            layer['h_t-1'] = layers[i-1]['h_t']
            layer['r_t-1'] = layers[i-1]['r_t']

        layer['h_t'] = tanh(np.dot(layer['x'],W_xh) + np.dot(layer['h_t-1'],W_hh) + np.dot(layer['r_t-1'],W_rh) + b_h)
        layer['o_prime_t'] = tanh(np.dot(layer['h_t'],W_hop)+b_op)
        layer['o_t'] = tanh(np.dot(layer['o_prime_t'],W_op_o) + b_o)

        if(i < len(X)-1):
            layer['d_t'] = sigmoid(np.dot(layer['o_prime_t'],W_op_d) + b_d)
            layer['u_t'] = sigmoid(np.dot(layer['o_prime_t'],W_op_u) + b_u)
            layer['v_t'] = tanh(np.dot(layer['o_prime_t'],W_op_v) + b_v)

            layer['r_t'] = stack.pushAndPopForward(layer['v_t'],layer['d_t'],layer['u_t'])

        layers.append(layer)

    for i in list(reversed(xrange(len(X)))):
        layer = layers[i]

        layer['o_t_error'] = Y[i] - layer['o_t']
        error += np.sum(np.abs(layer['o_t_error']))
        if(it % 100 == 99):
            if(i == len(X)-1):
                if(it % 10000 == 9999):
                    print "MaxLen:"+str(max_len)+ " Iter:" + str(it) + " Error:" + str(error) + "True:" + str(sequence) + " Pred:" + str(map(lambda x:np.argmax(x['o_t']),layers[sub_sequence_length:]))
                if(error < (5*max_len)):
                    max_len+=1              
                error = 0

        layer['o_t_delta'] = layer['o_t_error'] * tanh_out2deriv(layer['o_t'])

        layer['o_prime_t_error'] = np.dot(layer['o_t_delta'],W_op_o.T)

        if(i < len(X)-1):
            layer['r_t_error'] = layers[i+1]['r_t-1_error']
            stack.backprop_single(layer['r_t_error'])

            layer['v_t_error'] = stack.V_delta[i][i]
            layer['v_t_delta'] = layer['v_t_error'] * tanh_out2deriv(layer['v_t'])
            layer['o_prime_t_error'] += np.dot(layer['v_t_delta'],W_op_v.T)

            layer['u_t_error'] = stack.u_error[i]
            layer['u_t_delta'] = layer['u_t_error'] * sigmoid_out2deriv(layer['u_t'])
            layer['o_prime_t_error'] += np.dot(layer['u_t_delta'],W_op_u.T)

            layer['d_t_error'] = stack.d_error[i]
            layer['d_t_delta'] = layer['d_t_error'] * sigmoid_out2deriv(layer['d_t'])
            layer['o_prime_t_error'] += np.dot(layer['d_t_delta'],W_op_d.T)


        layer['o_prime_t_delta'] = layer['o_prime_t_error'] * tanh_out2deriv(layer['o_prime_t'])
        layer['h_t_error'] = np.dot(layer['o_prime_t_delta'],W_hop.T)

        if(i < len(X)-1):
            layer['h_t_error'] += layers[i+1]['h_t-1_error']

        layer['h_t_delta'] = layer['h_t_error'] * tanh_out2deriv(layer['h_t'])
        layer['h_t-1_error'] = np.dot(layer['h_t_delta'],W_hh.T)
        layer['r_t-1_error'] = np.dot(layer['h_t_delta'],W_rh.T)

    for i in xrange(len(X)):
        layer = layers[i]
        max_alpha = 0.005 * batch_size
        alpha = max_alpha / sub_sequence_length

        W_xh_update += alpha * np.outer(layer['x'],layer['h_t_delta'])
        W_hh_update += alpha * np.outer(layer['h_t-1'],layer['h_t_delta'])
        W_rh_update += alpha * np.outer(layer['r_t-1'],layer['h_t_delta'])
        b_h_update += alpha * layer['h_t_delta']

        W_hop_update += alpha * np.outer(layer['h_t'],layer['o_prime_t_delta'])
        b_op_update += alpha * layer['o_prime_t_delta']

        if(i < len(X)-1):
            W_op_d_update += alpha * np.outer(layer['o_prime_t'],layer['d_t_delta'])
            W_op_u_update += alpha * np.outer(layer['o_prime_t'],layer['u_t_delta'])
            W_op_v_update += alpha * np.outer(layer['o_prime_t'],layer['v_t_delta'])

            b_d_update += alpha * layer['d_t_delta']
            b_u_update += alpha * layer['u_t_delta']
            b_v_update += alpha * layer['v_t_delta']

        W_op_o_update += alpha * np.outer(layer['o_prime_t'],layer['o_t_delta'])
        b_o_update += alpha * layer['o_t_delta']


    if(it % batch_size == (batch_size-1)):
        W_xh += W_xh_update/batch_size
        W_xh_update *= 0
        
        W_hh += W_hh_update/batch_size
        W_hh_update *= 0
        
        W_rh += W_rh_update/batch_size
        W_rh_update *= 0
        
        b_h += b_h_update/batch_size
        b_h_update *= 0
        
        W_hop += W_hop_update/batch_size
        W_hop_update *= 0
        
        b_op += b_op_update/batch_size
        b_op_update *= 0
        
        W_op_d += W_op_d_update/batch_size
        W_op_d_update *= 0
        
        W_op_u += W_op_u_update/batch_size
        W_op_u_update *= 0
        
        W_op_v += W_op_v_update/batch_size
        W_op_v_update *= 0
        
        b_d += b_d_update/batch_size
        b_d_update *= 0
        
        b_d += b_d * 0.00025 * batch_size
        b_u += b_u_update/batch_size
        b_u_update *= 0
        
        b_v += b_v_update/batch_size
        b_v_update *= 0
        
        W_op_o += W_op_o_update/batch_size
        W_op_o_update *= 0
        
        b_o += b_o_update/batch_size
        b_o_update *= 0

Expected Output:

If you run this code overnight on your CPU...you should see output that looks a lot like this. Note that the predictions are the reverse of the original sequence.

.......
.......
.......
.......
MaxLen:19 Iter:2789999 Error:168.444794145True:[2 2 1 1 2 1 2 1 1 2] Pred:[2, 1, 1, 2, 1, 2, 1, 1, 2, 2]
MaxLen:19 Iter:2799999 Error:207.262352698True:[1] Pred:[1]
MaxLen:19 Iter:2809999 Error:182.105266119True:[1 1 2 1 2 2 2 1 1 2 1] Pred:[1, 2, 1, 1, 2, 2, 2, 1, 2, 1, 1]
MaxLen:19 Iter:2819999 Error:184.174791858True:[2 1 2 2] Pred:[2, 2, 1, 2]
MaxLen:19 Iter:2829999 Error:206.158101496True:[2 1 2 2 2 1 1 2] Pred:[2, 1, 1, 2, 2, 2, 1, 2]
MaxLen:19 Iter:2839999 Error:209.114103766True:[1 2 1 2 2 2 1 1 1 1 2] Pred:[2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1]
MaxLen:19 Iter:2849999 Error:167.128615254True:[2 1 2 1 1 2 1 2 1 2 1 1 2 2 2 1 2] Pred:[2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2]
MaxLen:19 Iter:2859999 Error:200.380921774True:[2 2 2 1 1] Pred:[1, 1, 2, 2, 2]
MaxLen:19 Iter:2869999 Error:112.94541202True:[2 2 1 2 2 1 2 2 1 1 2 1 2] Pred:[2, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1, 2, 2]
MaxLen:19 Iter:2879999 Error:160.091183839True:[2 1 2 2 1 2] Pred:[2, 1, 2, 2, 1, 2]
MaxLen:19 Iter:2889999 Error:112.598129039True:[1 1 2] Pred:[2, 1, 1]
MaxLen:19 Iter:2899999 Error:186.041933391True:[2 2 2 1 1 2 1 2 2 1 1 2 2 1] Pred:[1, 2, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 2, 2]
MaxLen:19 Iter:2909999 Error:236.064449725True:[2 2 2 1 1 2 2 1 2 2 1 1 1 1 1 2 2] Pred:[2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1]
MaxLen:19 Iter:2919999 Error:152.776428031True:[1 1 2 1 1 1 1 2 2 1 1 1 2 2 1 2 2] Pred:[2, 2, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2]
MaxLen:19 Iter:2929999 Error:143.007796452True:[1 2 1 1] Pred:[1, 1, 2, 1]
MaxLen:19 Iter:2939999 Error:255.744221264True:[2 2 1 1 1 2 1] Pred:[1, 2, 1, 1, 1, 2, 2]
MaxLen:19 Iter:2949999 Error:183.147078344True:[1 2 1 1 2 2 1 1 1 1 2 1 1 1 2 1] Pred:[1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2]
MaxLen:19 Iter:2959999 Error:231.447857024True:[2 2 1 1] Pred:[1, 1, 2, 2]

Something Went Wrong!!!

At this point, I declared victory! I broke open a few brewskis and kicked back. Yay! Deepmind's Neural Stack before my very eyes! How bout it! Alas... I started taking things apart and realized that something was wrong... most notably these two things. Immediately after this log ending I printed out the following variables.

stack.u

[array([ 0.52627134]),
 array([ 0.48875265]),
 array([ 0.4833596]),
 array([ 0.51072936]),
 array([ 0.51525512]),
 array([ 0.55418935]),
 array([ 0.51561233]),
 array([ 0.54271031]),
 array([ 0.46972942]),
 array([ 0.50030726]),
 array([ 0.50420808]),
 array([ 0.51277284]),
 array([ 0.49249017]),
 array([ 0.52770061]),
 array([ 0.53647627]),
 array([ 0.52879516]),
 array([ 0.57190229]),
 array([ 0.51895631]),
 array([ 0.50232574]),
 array([ 0.44804661]),
 array([ 0.50789469]),
 array([ 0.53620111]),
 array([ 0.57897974]),
 array([ 0.53155877])]

satck.d

[array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.]),
 array([ 1.])]

Disaster... the neural network somehow learned how to model these sequences by pushing all of them onto the stack and then only popping off each number half at a time. What does this mean? Honestly, it could certainly be that I didn't train long enough. What do we do? Andrew... why are you sharing this with us? Was this 30 pages of blogging all for nothing?!?!

At this point, we have reached a very realistic point in a neural network researcher's lifecycle. Furthermore, it's one that the authors have discussed somewhat extensively both in the paper and in external presentations. If we're not careful, the network can discover less than expected ways of solving the problem that you give it. So, what do we do?

Part 6: When Things Really Get Interesting

I did end up getting the neural network to push and pop correctly. Here's the code. This blog is already like 80 pages long on my laptop so... Enjoy the puzzle!

Hint: Autoencoder


import numpy as np

def sigmoid(x):
    return 1/(1+np.exp(-x))

def sigmoid_out2deriv(out):
    return out * (1 - out)

def tanh(x):
    return np.tanh(x)

def tanh_out2deriv(out):
    return (1 - out**2)

def relu(x,deriv=False):
    if(deriv):
        return int(x >= 0)
    return max(0,x)

class NeuralStack():
    def __init__(self,stack_width=2,o_prime_dim=6):
        self.stack_width = stack_width
        self.o_prime_dim = o_prime_dim
        self.reset()
        
    def reset(self):
        # INIT STACK
        self.V = list() # stack states
        self.s = list() # stack strengths 
        self.d = list() # push strengths
        self.u = list() # pop strengths
        self.r = list()
        self.o = list()

        self.V_delta = list() # stack states
        self.s_delta = list() # stack strengths 
        self.d_error = list() # push strengths
        self.u_error = list() # pop strengths
        
        self.t = 0
        
    def pushAndPopForward(self,v_t,d_t,u_t):

        self.d.append(d_t)
        self.d_error.append(0)

        self.u.append(u_t)
        self.u_error.append(0)

        new_s = np.zeros(self.t+1)
        for i in xrange(self.t+1):
            new_s[i] = self.s_t(i)
        self.s.append(new_s)
        self.s_delta.append(np.zeros_like(new_s))

        if(len(self.V) == 0):
            V_t = np.zeros((0,self.stack_width))
        else:
            V_t = self.V[-1]
        self.V.append(np.concatenate((V_t,np.atleast_2d(v_t)),axis=0))
        self.V_delta.append(np.zeros_like(self.V[-1]))
        
        r_t = self.r_t()
        self.r.append(r_t)
        
        self.t += 1
        return r_t
    
    def s_t(self,i):
        if(i >= 0 and i < self.t):
            inner_sum = self.s[self.t-1][i+1:self.t-0]
            return relu(self.s[self.t-1][i] - relu(self.u[self.t] - np.sum(inner_sum)))
        elif(i == self.t):
            return self.d[self.t]
        else:
            print "Problem"
            
    def s_t_error(self,i,error):
        if(i >= 0 and i < self.t):
            if(self.s_t(i) >= 0):
                self.s_delta[self.t-1][i] += error
                if(relu(self.u[self.t] - np.sum(self.s[self.t-1][i+1:self.t-0])) >= 0):
                    self.u_error[self.t] -= error
                    self.s_delta[self.t-1][i+1:self.t-0] += error
        elif(i == self.t):
            self.d_error[self.t] += error
        else:
            print "Problem"
            
    def r_t(self):
        r_t_out = np.zeros(self.stack_width)
        for i in xrange(0,self.t+1):
            temp = min(self.s[self.t][i],relu(1 - np.sum(self.s[self.t][i+1:self.t+1])))
            r_t_out += temp * self.V[self.t][i]
        return r_t_out
            
    def r_t_error(self,r_t_error):
        for i in xrange(0, self.t+1):
            temp = min(self.s[self.t][i],relu(1 - np.sum(self.s[self.t][i+1:self.t+1])))
            self.V_delta[self.t][i] += temp * r_t_error
            temp_error = np.sum(r_t_error * self.V[self.t][i])

            if(self.s[self.t][i] < relu(1 - np.sum(self.s[self.t][i+1:self.t+1]))):
                self.s_delta[self.t][i] += temp_error
            else:
                if(relu(1 - np.sum(self.s[self.t][i+1:self.t+1])) > 0):
                    self.s_delta[self.t][i+1:self.t+1] -= temp_error # minus equal becuase of the (1-).. and drop the 1
    def backprop_single(self,r_t_error):
        self.t -= 1
        self.r_t_error(r_t_error)
        for i in reversed(xrange(self.t+1)):
            self.s_t_error(i,self.s_delta[self.t][i])
    
    def backprop(self,all_errors_in_order_of_training_data):
        errors = all_errors_in_order_of_training_data
        for error in reversed(list((errors))):
            self.backprop_single(error)

        
options = 2
sub_sequence_length = 5
sequence_length = sub_sequence_length*2

sequence = (np.random.random(sub_sequence_length)*options).astype(int)+2
sequence

X = np.zeros((sub_sequence_length*2,options+2))
Y = np.zeros_like(X)
for i in xrange(len(sequence)):
    X[i][sequence[i]] = 1
    X[-i-1][0] = 1
    X[i][1] = 1
    Y[-i-1][sequence[i]] = 1

sequence_length = len(X)
x_dim = X.shape[1]
h_dim = 8
o_prime_dim = 8
stack_width = 2
y_dim = Y.shape[1]


np.random.seed(1)
sub_sequence_length = 2

W_xh = (np.random.rand(x_dim,h_dim)*0.2) - 0.1
W_xh_update = np.zeros_like(W_xh)

W_hox = (np.random.rand(h_dim,x_dim)*0.2) - 0.1
W_hox_update = np.zeros_like(W_hox)

W_opx = (np.random.rand(o_prime_dim,x_dim)*0.2) - 0.1
W_opx_update = np.zeros_like(W_opx)

W_hh = (np.random.rand(h_dim,h_dim)*0.2) - 0.1
W_hh_update = np.zeros_like(W_hh)

W_rh = (np.random.rand(stack_width,h_dim)*0.2) - 0.1
W_rh_update = np.zeros_like(W_rh)

b_h = (np.random.rand(h_dim)*0.2) - 0.1
b_h_update = np.zeros_like(b_h)

W_hop = (np.random.rand(h_dim,o_prime_dim) * 0.2) - 0.1
W_hop_update = np.zeros_like(W_hop)

b_op = (np.random.rand(o_prime_dim)*0.2) - 0.1
b_op_update = np.zeros_like(b_op)

W_op_d = (np.random.rand(o_prime_dim,1)*0.2) - 0.1
W_op_d_update = np.zeros_like(W_op_d)

W_op_u = (np.random.rand(o_prime_dim,1)*0.2) - 0.1
W_op_u_update = np.zeros_like(W_op_u)

W_op_v = (np.random.rand(o_prime_dim,stack_width)*0.2) - 0.1
W_op_v_update = np.zeros_like(W_op_v)

W_op_o = (np.random.rand(o_prime_dim,y_dim)*0.2) - 0.1
W_op_o_update = np.zeros_like(W_op_o)

b_d = (np.random.rand(1)*0.2)+1
b_d_update = np.zeros_like(b_d)

b_u = (np.random.rand(1)*0.2)-1
b_u_update = np.zeros_like(b_u)

b_v = (np.random.rand(stack_width)*0.2)-0.1
b_v_update = np.zeros_like(b_v)

b_o = (np.random.rand(y_dim)*0.2)-0.1
b_o_update = np.zeros_like(b_o)

error = 0
reconstruct_error = 0
reconstruct_error_2 = 0
max_len = 3
batch_size = 50
for it in xrange(750000):
    
#     if(it % 100 == 0):
    sub_sequence_length = np.random.randint(max_len)+3
    sequence = (np.random.random(sub_sequence_length)*options).astype(int)+2
    sequence

    X = np.zeros((sub_sequence_length*2,options+2))
    Y = np.zeros_like(X)
    for i in xrange(len(sequence)):
        X[i][sequence[i]] = 1
        X[-i-1][0] = 1
        X[i][1] = 1
        Y[-i-1][sequence[i]] = 1
            

    layers = list()
    stack = NeuralStack(stack_width=stack_width,o_prime_dim=o_prime_dim)    
    for i in xrange(len(X)):
        layer = {}

        layer['x'] = X[i]

        if(i == 0):
            layer['h_t-1'] = np.zeros(h_dim)
#             layer['h_t-1'][0] = 1
            layer['r_t-1'] = np.zeros(stack_width)
#             layer['r_t-1'][0] = 1
        else:
            layer['h_t-1'] = layers[i-1]['h_t']
            layer['r_t-1'] = layers[i-1]['r_t']

        layer['h_t'] = tanh(np.dot(layer['x'],W_xh) + np.dot(layer['h_t-1'],W_hh) + np.dot(layer['r_t-1'],W_rh) + b_h)
        layer['xo_t'] = sigmoid(np.dot(layer['h_t'],W_hox))
        layer['o_prime_t'] = tanh(np.dot(layer['h_t'],W_hop)+b_op)
        layer['o_prime_x_t'] = sigmoid(np.dot(layer['o_prime_t'],W_opx))
        layer['o_t'] = sigmoid(np.dot(layer['o_prime_t'],W_op_o) + b_o)

        if(i < len(X)-1):
            layer['d_t'] = sigmoid(np.dot(layer['o_prime_t'],W_op_d) + b_d)
            layer['u_t'] = sigmoid(np.dot(layer['o_prime_t'],W_op_u) + b_u)
            layer['v_t'] = tanh(np.dot(layer['o_prime_t'],W_op_v) + b_v)

            layer['r_t'] = stack.pushAndPopForward(layer['v_t'],layer['d_t'],layer['u_t'])

        layers.append(layer)

    for i in list(reversed(xrange(len(X)))):
        layer = layers[i]

        layer['o_t_error'] = (Y[i] - layer['o_t'])

        if(i>0):
            layer['xo_t_error'] = layers[i-1]['x'] - layer['xo_t']
            layer['xo_t_delta'] = layer['xo_t_error'] * sigmoid_out2deriv(layer['xo_t'])            
            
            layer['x_o_prime_x_t_error'] = (layers[i-1]['x'] - layer['o_prime_x_t'])
            layer['x_o_prime_x_t_delta'] = layer['x_o_prime_x_t_error'] * sigmoid_out2deriv(layer['o_prime_x_t'])
        else:
            layer['xo_t_delta'] = np.zeros_like(layer['x'])
            layer['x_o_prime_x_t_delta'] = np.zeros_like(layer['x'])
#         if(it > 2000):
        layer['xo_t_delta'] *= 1
        layer['x_o_prime_x_t_delta'] *= 1
        

        error += np.sum(np.abs(layer['o_t_error']))
        if(i > 0):
            reconstruct_error += np.sum(np.abs(layer['xo_t_error']))
            reconstruct_error_2 += np.sum(np.abs(layer['x_o_prime_x_t_error']))
        if(it % 100 == 99):
            if(i == len(X)-1):
    
                if(it % 1000 == 999):
                    print "MaxLen:"+str(max_len)+ " Iter:" + str(it) + " Error:" + str(error)+ " RecError:" + str(reconstruct_error) + " RecError2:"+ str(reconstruct_error_2) + " True:" + str(sequence) + " Pred:" + str(map(lambda x:np.argmax(x['o_t']),layers[sub_sequence_length:]))
                    if(it % 10000 == 9999):
                        print "U:" + str(np.array(stack.u).T[0])
                        print "D:" + str(np.array(stack.d).T[0])
#                     print "o_t:"
#                     for l in layers[sub_sequence_length:]:
#                         print l['o_t'] 
#                     print "V_t:"
#                     for row in stack.V[-1]:
#                         print row
                if(error < max_len+4 and it > 10000):
                    max_len += 1
                    it = 0
                error = 0
                reconstruct_error = 0
                reconstruct_error_2 = 0

        layer['o_t_delta'] = layer['o_t_error'] * sigmoid_out2deriv(layer['o_t'])

        layer['o_prime_t_error'] = np.dot(layer['o_t_delta'],W_op_o.T)
        layer['o_prime_t_error'] += np.dot(layer['x_o_prime_x_t_delta'],W_opx.T)
        if(i < len(X)-1):
            layer['r_t_error'] = layers[i+1]['r_t-1_error']
            stack.backprop_single(layer['r_t_error'])

            layer['v_t_error'] = stack.V_delta[i][i]
            layer['v_t_delta'] = layer['v_t_error'] * tanh_out2deriv(layer['v_t'])
            layer['o_prime_t_error'] += np.dot(layer['v_t_delta'],W_op_v.T)

            layer['u_t_error'] = stack.u_error[i]
            layer['u_t_delta'] = layer['u_t_error'] * sigmoid_out2deriv(layer['u_t'])
            layer['o_prime_t_error'] += np.dot(layer['u_t_delta'],W_op_u.T)

            layer['d_t_error'] = stack.d_error[i]
            layer['d_t_delta'] = layer['d_t_error'] * sigmoid_out2deriv(layer['d_t'])
            layer['o_prime_t_error'] += np.dot(layer['d_t_delta'],W_op_d.T)


        layer['o_prime_t_delta'] = layer['o_prime_t_error'] * tanh_out2deriv(layer['o_prime_t'])
        layer['h_t_error'] = np.dot(layer['o_prime_t_delta'],W_hop.T)
        layer['h_t_error'] += np.dot(layer['xo_t_delta'],W_hox.T)
        if(i < len(X)-1):
            layer['h_t_error'] += layers[i+1]['h_t-1_error']

        layer['h_t_delta'] = layer['h_t_error'] * tanh_out2deriv(layer['h_t'])
        layer['h_t-1_error'] = np.dot(layer['h_t_delta'],W_hh.T)
        layer['r_t-1_error'] = np.dot(layer['h_t_delta'],W_rh.T)

    for i in xrange(len(X)):
        layer = layers[i]
        if(it<2000):
            max_alpha = 0.05 * batch_size
#         else:
#             max_alpha = 0.05 * batch_size
        alpha = max_alpha / sub_sequence_length

        W_xh_update += alpha * np.outer(layer['x'],layer['h_t_delta'])
        W_hh_update += alpha * np.outer(layer['h_t-1'],layer['h_t_delta'])
        W_rh_update += alpha * np.outer(layer['r_t-1'],layer['h_t_delta'])
        W_hox_update += alpha * np.outer(layer['h_t'],layer['xo_t_delta'])
        
        b_h_update += alpha * layer['h_t_delta']

        W_hop_update += alpha * np.outer(layer['h_t'],layer['o_prime_t_delta'])
        b_op_update += alpha * layer['o_prime_t_delta']
        
        W_opx_update += alpha * np.outer(layer['o_prime_t'],layer['x_o_prime_x_t_delta'])
        
        if(i < len(X)-1):
            W_op_d_update += alpha * np.outer(layer['o_prime_t'],layer['d_t_delta'])
            W_op_u_update += alpha * np.outer(layer['o_prime_t'],layer['u_t_delta'])
            W_op_v_update += alpha * np.outer(layer['o_prime_t'],layer['v_t_delta'])

            b_d_update += alpha * layer['d_t_delta']# * 10
            b_u_update += alpha * layer['u_t_delta']# * 10
            b_v_update += alpha * layer['v_t_delta']

        W_op_o_update += alpha * np.outer(layer['o_prime_t'],layer['o_t_delta'])
        b_o_update += alpha * layer['o_t_delta']


    if(it % batch_size == (batch_size-1)):
        W_xh += W_xh_update/batch_size
        W_xh_update *= 0
        
        W_hh += W_hh_update/batch_size
        W_hh_update *= 0
        
        W_rh += W_rh_update/batch_size
        W_rh_update *= 0
        
        b_h += b_h_update/batch_size
        b_h_update *= 0
        
        W_hop += W_hop_update/batch_size
        W_hop_update *= 0
        
        b_op += b_op_update/batch_size
        b_op_update *= 0
        
        W_op_d += W_op_d_update/batch_size
        W_op_d_update *= 0
        
        W_op_u += W_op_u_update/batch_size
        W_op_u_update *= 0
        
        W_op_v += W_op_v_update/batch_size
        W_op_v_update *= 0
        
        W_opx += W_opx_update/batch_size
        W_opx_update *= 0
        
        W_hox += W_hox_update/batch_size
        W_hox_update *= 0
        
        b_d += b_d_update/batch_size
        b_d_update *= 0
        
        b_u += b_u_update/batch_size
        b_u_update *= 0
        
        b_v += b_v_update/batch_size
        b_v_update *= 0
        
        W_op_o += W_op_o_update/batch_size
        W_op_o_update *= 0
        
        b_o += b_o_update/batch_size
        b_o_update *= 0

Training Time Output

....
....
....
....
MaxLen:3 Iter:745999 Error:7.56448795544 RecError:7.10891969494 RecError2:5.73615942287 True:[3 3 2 2 3] Pred:[3, 2, 2, 3, 3]
MaxLen:3 Iter:746999 Error:7.40633215737 RecError:6.69030096695 RecError2:6.19218015399 True:[3 2 3 2] Pred:[2, 3, 2, 3]
MaxLen:3 Iter:747999 Error:7.6670587332 RecError:7.00484905169 RecError2:5.98610855847 True:[2 2 2 2 2] Pred:[2, 2, 2, 2, 2]
MaxLen:3 Iter:748999 Error:7.58710695632 RecError:7.25143585612 RecError2:6.02631960485 True:[2 3 2] Pred:[2, 3, 2]
MaxLen:3 Iter:749999 Error:7.36136812467 RecError:7.05922903111 RecError2:5.87101840726 True:[3 3 2 3 2] Pred:[2, 3, 2, 3, 3]
U:[  1.42111936e-03   4.24234116e-05   4.38773521e-05   1.90306228e-03
   4.46779468e-05   1.61383041e-03   9.99610386e-01   9.76241526e-01
   9.99635875e-01]
D:[  9.96007159e-01   9.98850792e-01   9.98678243e-01   9.93183510e-01
   9.98787118e-01   2.95677776e-02   4.76367458e-04   4.53054877e-04
   5.11268758e-04]

Known Deviations / Ambiguities From the Paper (and Reasons)

1: The Controller is an RNN instead of an LSTM. I haven't finished the blogpost on LSTMs yet, and I wanted to only used previous blogposts as pre-requisite information.

2: Instead of padding using a single buffer token to signify when to repeat the sequence back, I turned the single buffer on turing all of encoding and off for all of decoding. This is related to not having an LSTM to save the binary state. RNNs lose this kind of information and I wanted the network to converge quickly when training on the CPUs of this blog's readership.

3: I didn't see specifics on which nonlinearities were used in the RNN or how all the various weights were initialized. I chose to use best practices

4: I trained this with a minibatch size of 50 instead of 10.

5: The hidden layers are considerably smaller. This also falls in the category of "getting it to converge faster for readers". However, small hidden layers also force the network to use the stack, which seems like a good reason to use them.

6: Not sure how many epochs this was trained on originally.

7: And of course... this was written in python using just a matrix library as opposed to Torch's deep learning framework. There are likely small things done as a best practice implicit into Torch's framework that might not be represented here.

8: I haven't attempted Queues or DeQueues yet... but in theory it's just a matter of swapping out the Neural Stack... that'd be a great project for a reader if you want to take this to the next level!

9: My timeframe for writing this blogpost was quite short. The post itself was written in < 24 hours. I'd like to do further experimentation with LSTMs and more benchmarking relative to the posted results in the paper. This, however, is primarily a teaching tool.

10: I haven't actually checked the backpropagation against the formulas in Appendix A of the paper. Again.. time constraint and I thought it would be more fun to try to figure them out independently.

11: I wasn't sure if o_prime_t was really generated as a PART of the recurrent hidden layer or if it was supposed to be one layer deeper (with a matrix between the recurrent hidden layer and o_prime). I assumed the latter but the former could be possible. If you happen to be an author on the paper and you're reading this far, I'd love to know.

If you have questions or comments, tweet @iamtrask and I'll be happy to help.

Anyone Can Learn To Code an LSTM-RNN in Python (Part 1: RNN)

Sun, 15 Nov 2015 12:00:00 +0000

Summary: I learn best with toy code that I can play with. This tutorial teaches Recurrent Neural Networks via a very simple toy example, a short python implementation. Chinese Translation Korean Translation

I'll tweet out (Part 2: LSTM) when it's complete at @iamtrask. Feel free to follow if you'd be interested in reading it and thanks for all the feedback!

Just Give Me The Code:

import copy, numpy as np
np.random.seed(0)

# compute sigmoid nonlinearity
def sigmoid(x):
    output = 1/(1+np.exp(-x))
    return output

# convert output of sigmoid function to its derivative
def sigmoid_output_to_derivative(output):
    return output*(1-output)


# training dataset generation
int2binary = {}
binary_dim = 8

largest_number = pow(2,binary_dim)
binary = np.unpackbits(
    np.array([range(largest_number)],dtype=np.uint8).T,axis=1)
for i in range(largest_number):
    int2binary[i] = binary[i]


# input variables
alpha = 0.1
input_dim = 2
hidden_dim = 16
output_dim = 1


# initialize neural network weights
synapse_0 = 2*np.random.random((input_dim,hidden_dim)) - 1
synapse_1 = 2*np.random.random((hidden_dim,output_dim)) - 1
synapse_h = 2*np.random.random((hidden_dim,hidden_dim)) - 1

synapse_0_update = np.zeros_like(synapse_0)
synapse_1_update = np.zeros_like(synapse_1)
synapse_h_update = np.zeros_like(synapse_h)

# training logic
for j in range(10000):
    
    # generate a simple addition problem (a + b = c)
    a_int = np.random.randint(largest_number/2) # int version
    a = int2binary[a_int] # binary encoding

    b_int = np.random.randint(largest_number/2) # int version
    b = int2binary[b_int] # binary encoding

    # true answer
    c_int = a_int + b_int
    c = int2binary[c_int]
    
    # where we'll store our best guess (binary encoded)
    d = np.zeros_like(c)

    overallError = 0
    
    layer_2_deltas = list()
    layer_1_values = list()
    layer_1_values.append(np.zeros(hidden_dim))
    
    # moving along the positions in the binary encoding
    for position in range(binary_dim):
        
        # generate input and output
        X = np.array([[a[binary_dim - position - 1],b[binary_dim - position - 1]]])
        y = np.array([[c[binary_dim - position - 1]]]).T

        # hidden layer (input ~+ prev_hidden)
        layer_1 = sigmoid(np.dot(X,synapse_0) + np.dot(layer_1_values[-1],synapse_h))

        # output layer (new binary representation)
        layer_2 = sigmoid(np.dot(layer_1,synapse_1))

        # did we miss?... if so, by how much?
        layer_2_error = y - layer_2
        layer_2_deltas.append((layer_2_error)*sigmoid_output_to_derivative(layer_2))
        overallError += np.abs(layer_2_error[0])
    
        # decode estimate so we can print it out
        d[binary_dim - position - 1] = np.round(layer_2[0][0])
        
        # store hidden layer so we can use it in the next timestep
        layer_1_values.append(copy.deepcopy(layer_1))
    
    future_layer_1_delta = np.zeros(hidden_dim)
    
    for position in range(binary_dim):
        
        X = np.array([[a[position],b[position]]])
        layer_1 = layer_1_values[-position-1]
        prev_layer_1 = layer_1_values[-position-2]
        
        # error at output layer
        layer_2_delta = layer_2_deltas[-position-1]
        # error at hidden layer
        layer_1_delta = (future_layer_1_delta.dot(synapse_h.T) + layer_2_delta.dot(synapse_1.T)) * sigmoid_output_to_derivative(layer_1)

        # let's update all our weights so we can try again
        synapse_1_update += np.atleast_2d(layer_1).T.dot(layer_2_delta)
        synapse_h_update += np.atleast_2d(prev_layer_1).T.dot(layer_1_delta)
        synapse_0_update += X.T.dot(layer_1_delta)
        
        future_layer_1_delta = layer_1_delta
    

    synapse_0 += synapse_0_update * alpha
    synapse_1 += synapse_1_update * alpha
    synapse_h += synapse_h_update * alpha    

    synapse_0_update *= 0
    synapse_1_update *= 0
    synapse_h_update *= 0
    
    # print out progress
    if(j % 1000 == 0):
        print "Error:" + str(overallError)
        print "Pred:" + str(d)
        print "True:" + str(c)
        out = 0
        for index,x in enumerate(reversed(d)):
            out += x*pow(2,index)
        print str(a_int) + " + " + str(b_int) + " = " + str(out)
        print "------------"

Runtime Output:

Error:[ 3.45638663]
Pred:[0 0 0 0 0 0 0 1]
True:[0 1 0 0 0 1 0 1]
9 + 60 = 1
------------
Error:[ 3.63389116]
Pred:[1 1 1 1 1 1 1 1]
True:[0 0 1 1 1 1 1 1]
28 + 35 = 255
------------
Error:[ 3.91366595]
Pred:[0 1 0 0 1 0 0 0]
True:[1 0 1 0 0 0 0 0]
116 + 44 = 72
------------
Error:[ 3.72191702]
Pred:[1 1 0 1 1 1 1 1]
True:[0 1 0 0 1 1 0 1]
4 + 73 = 223
------------
Error:[ 3.5852713]
Pred:[0 0 0 0 1 0 0 0]
True:[0 1 0 1 0 0 1 0]
71 + 11 = 8
------------
Error:[ 2.53352328]
Pred:[1 0 1 0 0 0 1 0]
True:[1 1 0 0 0 0 1 0]
81 + 113 = 162
------------
Error:[ 0.57691441]
Pred:[0 1 0 1 0 0 0 1]
True:[0 1 0 1 0 0 0 1]
81 + 0 = 81
------------
Error:[ 1.42589952]
Pred:[1 0 0 0 0 0 0 1]
True:[1 0 0 0 0 0 0 1]
4 + 125 = 129
------------
Error:[ 0.47477457]
Pred:[0 0 1 1 1 0 0 0]
True:[0 0 1 1 1 0 0 0]
39 + 17 = 56
------------
Error:[ 0.21595037]
Pred:[0 0 0 0 1 1 1 0]
True:[0 0 0 0 1 1 1 0]
11 + 3 = 14
------------

Part 1: What is Neural Memory?

List the alphabet forward.... you can do it, yes?

List the alphabet backward.... hmmm... perhaps a bit tougher.

Try with the lyrics of a song you know?.... Why is it easier to recall forward than it is to recall backward? Can you jump into the middle of the second verse?... hmm... also difficult. Why?

There's a very logical reason for this....you haven't learned the letters of the alphabet or the lyrics of a song like a computer storing them as a set on a hard drive. You learned them as a sequence. You are really good at indexing from one letter to the next. It's a kind of conditional memory... you only have it when you very recently had the previous memory. It's also a lot like a linked list if you're familiar with that.

However, it's not that you don't have the song in your memory except when you're singing it. Instead, when you try to jump straight to the middle of the song, you simply have a hard time finding that representation in your brain (perhaps that set of neurons). It starts searching all over looking for the middle of the song, but it hasn't tried to look for it this way before, so it doesn't have a map to the location of the middle of the second verse. It's a lot like living in a neighborhood with lots of coves/cul-de-sacs. It's much easier to picture how to get to someone's house by following all the windy roads because you've done it many times, but knowing exactly where to cut straight across someone's backyard is really difficult. Your brain instead uses the "directions" that it knows... through the neurons at the beginning of a song. (for more on brain stuff, click here)

Much like a linked list, storing memory like this is very efficient. We will find that similar properties/advantages exist in giving our neural networks this type of memory as well. Some processes/problems/representations/searches are far more efficient if modeled as a sequence with a short term / pseudo conditional memory.

Memory matters when your data is a sequence of some kind. (It means you have something to remember!) Imagine having a video of a bouncing ball. (here... i'll help this time)

Each data point is a frame of your video. If you wanted to train a neural network to predict where the ball would be in the next frame, it would be really helpful to know where the ball was in the last frame! Sequential data like this is why we build recurrent neural networks. So, how does a neural network remember what it saw in previous time steps?

Neural networks have hidden layers. Normally, the state of your hidden layer is based ONLY on your input data. So, normally a neural network's information flow would look like this:

input -> hidden -> output

This is straightforward. Certain types of input create certain types of hidden layers. Certain types of hidden layers create certain types of output layers. It's kindof a closed system. Memory changes this. Memory means that the hidden layer is a combination of your input data at the current timestep and the hidden layer of the previous timestep.

(input + prev_hidden) -> hidden -> output

Why the hidden layer? Well, we could technically do this.

(input + prev_input) -> hidden -> output

However, we'd be missing out. I encourage you to sit and consider the difference between these two information flows. For a little helpful hint, consider how this plays out. Here, we have 4 timesteps of a recurrent neural network pulling information from the previous hidden layer.

(input + empty_hidden) -> hidden -> output (input + prev_hidden) -> hidden -> output (input + prev_hidden) -> hidden -> output (input + prev_hidden) -> hidden -> output

And here, we have 4 timesteps of a recurrent neural network pulling information from the previous input layer

(input + empty_input) -> hidden -> output (input + prev_input) -> hidden -> output (input + prev_input) -> hidden -> output (input + prev_input) -> hidden -> output

Maybe, if I colored things a bit, it would become more clear. Again, 4 timesteps with hidden layer recurrence:

(input + empty_hidden) -> hidden -> output (input + prev_hidden) -> hidden -> output (input + prev_hidden) -> hidden -> output (input + prev_hidden ) -> hidden -> output

.... and 4 timesteps with input layer recurrence....

(input + empty_input) -> hidden -> output (input + prev_input) -> hidden -> output (input + prev_input) -> hidden -> output (input + prev_input) -> hidden -> output

Focus on the last hidden layer (4th line). In the hidden layer recurrence, we see a presence of every input seen so far. In the input layer recurrence, it's exclusively defined by the current and previous inputs. This is why we model hidden recurrence. Hidden recurrence learns what to remember whereas input recurrence is hard wired to just remember the immediately previous datapoint.

Now compare and contrast these two approaches with the backwards alphabet and middle-of-song exercises. The hidden layer is constantly changing as it gets more inputs. Furthermore, the only way that we could reach these hidden states is with the correct sequence of inputs. Now the money statement, the output is deterministic given the hidden layer, and the hidden layer is only reachable with the right sequence of inputs. Sound familiar?

What's the practical difference? Let's say we were trying to predict the next word in a song given the previous. The "input layer recurrence" would break down if the song accidentally had the same sequence of two words in multiple places. Think about it, if the song had the statements "I love you", and "I love carrots", and the network was trying to predict the next word, how would it know what follows "I love"? It could be carrots. It could be you. The network REALLY needs to know more about what part of the song its in. However, the "hidden layer recurrence" doesn't break down in this way. It subtely remembers everything it saw (with memories becoming more subtle as it they fade into the past). To see this in action, check out this.

stop and make sure this feels comfortable in your mind

Part 2: RNN - Neural Network Memory

Now that we have the intuition, let's dive down a layer (ba dum bump...). As described in the backpropagation post, our input layer to the neural network is determined by our input dataset. Each row of input data is used to generate the hidden layer (via forward propagation). Each hidden layer is then used to populate the output layer (assuming only 1 hidden layer). As we just saw, memory means that the hidden layer is a combination of the input data and the previous hidden layer. How is this done? Well, much like every other propagation in neural networks, it's done with a matrix. This matrix defines the relationship between the previous hidden layer and the current one.

Big thing to take from this picture, there are only three weight matrices. Two of them should be very familiar (same names too). SYNAPSE_0 propagates the input data to the hidden layer. SYNAPSE_1 propagates the hidden layer to the output data. The new matrix (SYNAPSE_h....the recurrent one), propagates from the hidden layer (layer_1) to the hidden layer at the next timestep (still layer_1).

stop and make sure this feels comfortable in your mind

The gif above reflects the magic of recurrent networks, and several very, very important properties. It depicts 4 timesteps. The first is exclusively influenced by the input data. The second one is a mixture of the first and second inputs. This continues on. You should recognize that, in some way, network 4 is "full". Presumably, timestep 5 would have to choose which memories to keep and which ones to overwrite. This is very real. It's the notion of memory "capacity". As you might expect, bigger layers can hold more memories for a longer period of time. Also, this is when the network learns to forget irrelevant memories and remember important memories. What significant thing do you notice in timestep 3? Why is there more green in the hidden layer than the other colors?

Also notice that the hidden layer is the barrier between the input and the output. In reality, the output is no longer a pure function of the input. The input is just changing what's in the memory, and the output is exclusively based on the memory. Another interesting takeaway. If there was no input at timesteps 2,3,and 4, the hidden layer would still change from timestep to timestep.

i know i've been stopping... but really make sure you got that last bit

Part 3: Backpropagation Through Time:

So, how do recurrent neural networks learn? Check out this graphic. Black is the prediction, errors are bright yellow, derivatives are mustard colored.

They learn by fully propagating forward from 1 to 4 (through an entire sequence of arbitrary length), and then backpropagating all the derivatives from 4 back to 1. You can also pretend that it's just a funny shaped normal neural network, except that we're re-using the same weights (synapses 0,1,and h) in their respective places. Other than that, it's normal backpropagation.

Part 4: Our Toy Code

We're going to be using a recurrent neural network to model binary addition. Do you see the sequence below? What do the colored ones in squares at the top signify?

source: angelfire.com

The colorful 1s in boxes at the top signify the "carry bit". They "carry the one" when the sum overfows at each place. This is the tiny bit of memory that we're going to teach our neural network how to model. It's going to "carry the one" when the sum requires it. (click here to learn about when this happens)

So, binary addition moves from right to left, where we try to predict the number beneath the line given the numbers above the line. We want the neural network to move along the binary sequences and remember when it has carried the 1 and when it hasn't, so that it can make the correct prediction. Don't get too caught up in the problem. The network actually doesn't care too much. Just recognize that we're going to have two inputs at each time step, (either a one or a zero from each number begin added). These two inputs will be propagated to the hidden layer, which will have to remember whether or not we carry. The prediction will take all of this information into account to predict the correct bit at the given position (time step).

At this point, I recommend opening this page in two windows so that you can follow along with the line numbers in the code example at the top. That's how I wrote it.

Lines 0-2: Importing our dependencies and seeding the random number generator. We will only use numpy and copy. Numpy is for matrix algebra. Copy is to copy things.

Lines 4-11: Our nonlinearity and derivative. For details, please read this Neural Network Tutorial

Line 15: We're going to create a lookup table that maps from an integer to its binary representation. The binary representations will be our input and output data for each math problem we try to get the network to solve. This lookup table will be very helpful in converting from integers to bit strings.

Line 16: This is where I set the maximum length of the binary numbers we'll be adding. If I've done everything right, you can adjust this to add potentially very large numbers.

Line 18: This computes the largest number that is possible to represent with the binary length we chose

Line 19: This is a lookup table that maps from an integer to its binary representation. We copy it into the int2binary. This is kindof un-ncessary but I thought it made things more obvious looking.

Line 26: This is our learning rate.

Line 27: We are adding two numbers together, so we'll be feeding in two-bit strings one character at the time each. Thus, we need to have two inputs to the network (one for each of the numbers being added).

Line 28: This is the size of the hidden layer that will be storing our carry bit. Notice that it is way larger than it theoretically needs to be. Play with this and see how it affects the speed of convergence. Do larger hidden dimensions make things train faster or slower? More iterations or fewer?

Line 29: Well, we're only predicting the sum, which is one number. Thus, we only need one output

Line 33: This is the matrix of weights that connects our input layer and our hidden layer. Thus, it has "input_dim" rows and "hidden_dim" columns. (2 x 16 unless you change it). If you forgot what it does, look for it in the pictures in Part 2 of this blogpost.

Line 34: This is the matrix of weights that connects the hidden layer to the output layer. Thus, it has "hidden_dim" rows and "output_dim" columns. (16 x 1 unless you change it). If you forgot what it does, look for it in the pictures in Part 2 of this blogpost.

Line 35: This is the matrix of weights that connects the hidden layer in the previous time-step to the hidden layer in the current timestep. It also connects the hidden layer in the current timestep to the hidden layer in the next timestep (we keep using it). Thus, it has the dimensionality of "hidden_dim" rows and "hidden_dim" columns. (16 x 16 unless you change it). If you forgot what it does, look for it in the pictures in Part 2 of this blogpost.

Line 37 - 39: These store the weight updates that we would like to make for each of the weight matrices. After we've accumulated several weight updates, we'll actually update the matrices. More on this later.

Line 42: We're iterating over 100,000 training examples

Line 45: We're going to generate a random addition problem. So, we're initializing an integer randomly between 0 and half of the largest value we can represent. If we allowed the network to represent more than this, than adding two number could theoretically overflow (be a bigger number than we have bits to represent). Thus, we only add numbers that are less than half of the largest number we can represent.

Line 46: We lookup the binary form for "a_int" and store it in "a"

Line 48: Same thing as line 45, just getting another random number.

Line 49: Same thing as line 46, looking up the binary representation.

Line 52: We're computing what the correct answer should be for this addition

Line 53: Converting the true answer to its binary representation

Line 56: Initializing an empty binary array where we'll store the neural network's predictions (so we can see it at the end). You could get around doing this if you want...but i thought it made things more intuitive

Line 58: Resetting the error measure (which we use as a means to track convergence... see my tutorial on backpropagation and gradient descent to learn more about this)

Lines 60-61: These two lists will keep track of the layer 2 derivatives and layer 1 values at each time step.

Line 62: Time step zero has no previous hidden layer, so we initialize one that's off.

Line 65: This for loop iterates through the binary representation

Line 68: X is the same as "layer_0" in the pictures. X is a list of 2 numbers, one from a and one from b. It's indexed according to the "position" variable, but we index it in such a way that it goes from right to left. So, when position == 0, this is the farhest bit to the right in "a" and the farthest bit to the right in "b". When position equals 1, this shifts to the left one bit.

Line 69: Same indexing as line 62, but instead it's the value of the correct answer (either a 1 or a 0)

Line 72: This is the magic!!! Make sure you understand this line!!! To construct the hidden layer, we first do two things. First, we propagate from the input to the hidden layer (np.dot(X,synapse_0)). Then, we propagate from the previous hidden layer to the current hidden layer (np.dot(prev_layer_1, synapse_h)). Then WE SUM THESE TWO VECTORS!!!!... and pass through the sigmoid function.

So, how do we combine the information from the previous hidden layer and the input? After each has been propagated through its various matrices (read: interpretations), we sum the information.

Line 75: This should look very familiar. It's the same as previous tutorials. It propagates the hidden layer to the output to make a prediction

Line 78: Compute by how much the prediction missed

Line 79: We're going to store the derivative (mustard orange in the graphic above) in a list, holding the derivative at each timestep.

Line 80: Calculate the sum of the absolute errors so that we have a scalar error (to track propagation). We'll end up with a sum of the error at each binary position.

Line 83 Rounds the output (to a binary value, since it is between 0 and 1) and stores it in the designated slot of d.

Line 86 Copies the layer_1 value into an array so that at the next time step we can apply the hidden layer at the current one.

Line 90: So, we've done all the forward propagating for all the time steps, and we've computed the derivatives at the output layers and stored them in a list. Now we need to backpropagate, starting with the last timestep, backpropagating to the first

Line 92: Indexing the input data like we did before

Line 93: Selecting the current hidden layer from the list.

Line 94: Selecting the previous hidden layer from the list

Line 97: Selecting the current output error from the list

Line 99: this computes the current hidden layer error given the error at the hidden layer from the future and the error at the current output layer.

Line 102-104: Now that we have the derivatives backpropagated at this current time step, we can construct our weight updates (but not actually update the weights just yet). We don't actually update our weight matrices until after we've fully backpropagated everything. Why? Well, we use the weight matrices for the backpropagation. Thus, we don't want to go changing them yet until the actual backprop is done. See the backprop blog post for more details.

Line 109 - 115 Now that we've backpropped everything and created our weight updates. It's time to update our weights (and empty the update variables).

Line 118 - end Just some nice logging to show progress

Part 5: Questions / Comments

If you have questions or comments, tweet @iamtrask and I'll be happy to help.

Hinton's Dropout in 3 Lines of Python

Tue, 28 Jul 2015 12:00:00 +0000

Summary: Dropout is a vital feature in almost every state-of-the-art neural network implementation. This tutorial teaches how to install Dropout into a neural network in only a few lines of Python code. Those who walk through this tutorial will finish with a working Dropout implementation and will be empowered with the intuitions to install it and tune it in any neural network they encounter.

Followup Post: I intend to write a followup post to this one adding popular features leveraged by state-of-the-art approaches. I'll tweet it out when it's complete @iamtrask. Feel free to follow if you'd be interested in reading more and thanks for all the feedback!

Just Give Me The Code:

import numpy as np
X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
y = np.array([[0,1,1,0]]).T
alpha,hidden_dim,dropout_percent,do_dropout = (0.5,4,0.2,True)
synapse_0 = 2*np.random.random((3,hidden_dim)) - 1
synapse_1 = 2*np.random.random((hidden_dim,1)) - 1
for j in xrange(60000):
    layer_1 = (1/(1+np.exp(-(np.dot(X,synapse_0)))))
    if(do_dropout):
        layer_1 *= np.random.binomial([np.ones((len(X),hidden_dim))],1-dropout_percent)[0] * (1.0/(1-dropout_percent))
    layer_2 = 1/(1+np.exp(-(np.dot(layer_1,synapse_1))))
    layer_2_delta = (layer_2 - y)*(layer_2*(1-layer_2))
    layer_1_delta = layer_2_delta.dot(synapse_1.T) * (layer_1 * (1-layer_1))
    synapse_1 -= (alpha * layer_1.T.dot(layer_2_delta))
    synapse_0 -= (alpha * X.T.dot(layer_1_delta))

Part 1: What is Dropout?

As discovered in the previous post, a neural network is a glorified search problem. Each node in the neural network is searching for correlation between the input data and the correct output data.

Consider the graphic above from the previous post. The line represents the error the network generates for every value of a particular weight. The low-points (READ: low error) in that line signify the weight "finding" points of correlation between the input and output data. The balls in the picture signify various weights. They are trying to find those low points.

Consider the color. The ball's initial positions are randomly generated (just like weights in a neural network). If two balls randomly start in the same colored zone, they will converge to the same point. This makes them redundant! They're wasting computation and memory! This is exactly what happens in neural networks.

Why Dropout: Dropout helps prevent weights from converging to identical positions. It does this by randomly turning nodes off when forward propagating. It then back-propagates with all the nodes turned on. Let’s take a closer look.

Part 2: How Do I Install and Tune Dropout?

The highlighted code above demonstrates how to install dropout. To perform dropout on a layer, you randomly set some of the layer's values to 0 during forward propagation. This is demonstrated on line 10.

Line 9: parameterizes using dropout at all. You see, you only want to use Dropout during training. Do not use it at runtime or on your testing dataset.

EDIT: Line 9: has a second portion to increase the size of the values being propagated forward. This happens in proportion to the number of values being turned off. A simple intuition is that if you're turning off half of your hidden layer, you want to double the values that ARE pushing forward so that the output compensates correctly. Many thanks to @karpathy for catching this one.

Tuning Best Practice

Line 4: parameterizes the dropout_percent. This affects the probability that any one node will be turned off. A good initial configuration for this for hidden layers is 50%. If applying dropout to an input layer, it's best to not exceed 25%.

Hinton advocates tuning dropout in conjunction with tuning the size of your hidden layer. Increase your hidden layer size(s) with dropout turned off until you perfectly fit your data. Then, using the same hidden layer size, train with dropout turned on. This should be a nearly optimal configuration. Turn off dropout as soon as you're done training and voila! You have a working neural network!

Want to Work in Machine Learning?

One of the best things you can do to learn Machine Learning is to have a job where you're practicing Machine Learning professionally. I'd encourage you to check out the positions at Digital Reasoning in your job hunt. If you have questions about any of the positions or about life at Digital Reasoning, feel free to send me a message on my LinkedIn. I'm happy to hear about where you want to go in life, and help you evaluate whether Digital Reasoning could be a good fit.

If none of the positions above feel like a good fit. Continue your search! Machine Learning expertise is one of the most valuable skills in the job market today, and there are many firms looking for practitioners. Perhaps some of these services below will help you in your hunt.

Machine Learning Jobs

jobs by

View More Job Search Results

A Neural Network in 13 lines of Python (Part 2 - Gradient Descent)

Mon, 27 Jul 2015 12:00:00 +0000

Summary: I learn best with toy code that I can play with. This tutorial teaches gradient descent via a very simple toy example, a short python implementation.

Followup Post: I intend to write a followup post to this one adding popular features leveraged by state-of-the-art approaches (likely Dropout, DropConnect, and Momentum). I'll tweet it out when it's complete @iamtrask. Feel free to follow if you'd be interested in reading more and thanks for all the feedback!

Just Give Me The Code:

import numpy as np
X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
y = np.array([[0,1,1,0]]).T
alpha,hidden_dim = (0.5,4)
synapse_0 = 2*np.random.random((3,hidden_dim)) - 1
synapse_1 = 2*np.random.random((hidden_dim,1)) - 1
for j in xrange(60000):
    layer_1 = 1/(1+np.exp(-(np.dot(X,synapse_0))))
    layer_2 = 1/(1+np.exp(-(np.dot(layer_1,synapse_1))))
    layer_2_delta = (layer_2 - y)*(layer_2*(1-layer_2))
    layer_1_delta = layer_2_delta.dot(synapse_1.T) * (layer_1 * (1-layer_1))
    synapse_1 -= (alpha * layer_1.T.dot(layer_2_delta))
    synapse_0 -= (alpha * X.T.dot(layer_1_delta))

Part 1: Optimization

In Part 1, I laid out the basis for backpropagation in a simple neural network. Backpropagation allowed us to measure how each weight in the network contributed to the overall error. This ultimately allowed us to change these weights using a different algorithm, Gradient Descent.

The takeaway here is that backpropagation doesn't optimize! It moves the error information from the end of the network to all the weights inside the network so that a different algorithm can optimize those weights to fit our data. We actually have a plethora of different nonlinear optimization methods that we could use with backpropagation:

A Few Optimization Methods:
• Annealing
• Stochastic Gradient Descent
• AW-SGD (new!)
• Momentum (SGD)
• Nesterov Momentum (SGD)
• AdaGrad
• AdaDelta
• ADAM
• BFGS
• LBFGS

Visualizing the Difference:
• ConvNet.js
• RobertsDionne

Many of these optimizations are good for different purposes, and in some cases several can be used together. In this tutorial, we will walk through Gradient Descent, which is arguably the simplest and most widely used neural network optimization algorithm. By learning about Gradient Descent, we will then be able to improve our toy neural network through parameterization and tuning, and ultimately make it a lot more powerful.

Part 2: Gradient Descent

Imagine that you had a red ball inside of a rounded bucket like in the picture below. Imagine further that the red ball is trying to find the bottom of the bucket. This is optimization. In our case, the ball is optimizing it's position (from left to right) to find the lowest point in the bucket.

(pause here.... make sure you got that last sentence.... got it?)

So, to gamify this a bit. The ball has two options, left or right. It has one goal, get as low as possible. So, it needs to press the left and right buttons correctly to find the lowest spot

So, what information does the ball use to adjust its position to find the lowest point? The only information it has is the slope of the side of the bucket at its current position, pictured below with the blue line. Notice that when the slope is negative (downward from left to right), the ball should move to the right. However, when the slope is positive, the ball should move to the left. As you can see, this is more than enough information to find the bottom of the bucket in a few iterations. This is a sub-field of optimization called gradient optimization. (Gradient is just a fancy word for slope or steepness).

Oversimplified Gradient Descent:

Calculate slope at current position
If slope is negative, move right
If slope is positive, move left
(Repeat until slope == 0)

The question is, however, how much should the ball move at each time step? Look at the bucket again. The steeper the slope, the farther the ball is from the bottom. That's helpful! Let's improve our algorithm to leverage this new information. Also, let's assume that the bucket is on an (x,y) plane. So, its location is x (along the bottom). Increasing the ball's "x" position moves it to the right. Decreasing the ball's "x" position moves it to the left.

Naive Gradient Descent:

Calculate "slope" at current "x" position
Change x by the negative of the slope. (x = x - slope)
(Repeat until slope == 0)

Make sure you can picture this process in your head before moving on. This is a considerable improvement to our algorithm. For very positive slopes, we move left by a lot. For only slightly positive slopes, we move left by only a little. As it gets closer and closer to the bottom, it takes smaller and smaller steps until the slope equals zero, at which point it stops. This stopping point is called convergence.

Part 3: Sometimes It Breaks

Gradient Descent isn't perfect. Let's take a look at its issues and how people get around them. This will allow us to improve our network to overcome these issues.

Problem 1: When slopes are too big

How big is too big? Remember our step size is based on the steepness of the slope. Sometimes the slope is so steep that we overshoot by a lot. Overshooting by a little is ok, but sometimes we overshoot by so much that we're even farther away than we started! See below.

What makes this problem so destructive is that overshooting this far means we land at an EVEN STEEPER slope in the opposite direction. This causes us to overshoot again EVEN FARTHER. This viscious cycle of overshooting leading to more overshooting is called divergence.

Solution 1: Make Slopes Smaller

Lol. This may seem too simple to be true, but it's used in pretty much every neural network. If our gradients are too big, we make them smaller! We do this by multiplying them (all of them) by a single number between 0 and 1 (such as 0.01). This fraction is typically a single float called alpha. When we do this, we don't overshoot and our network converges.

Improved Gradient Descent:

alpha = 0.1 (or some number between 0 and 1)

Calculate "slope" at current "x" position
x = x - (alpha*slope)
(Repeat until slope == 0)

Problem 2: Local Minimums

Sometimes your bucket has a funny shape, and following the slope doesn't take you to the absolute lowest point. Consider the picture below.

This is by far the most difficult problem with gradient descent. There are a myriad of options to try to overcome this. Generally speaking, they all involve an element of random searching to try lots of different parts of the bucket.

Solution 2: Multiple Random Starting States

There are a myriad of ways in which randomness is used to overcome getting stuck in a local minimum. It begs the question, if we have to use randomness to find the global minimum, why are we still optimizing in the first place? Why not just try randomly? The answer lies in the graph below.

Imagine that we randomly placed 100 balls on this line and started optimizing all of them. If we did so, they would all end up in only 5 positions, mapped out by the five colored balls above. The colored regions represent the domain of each local minimum. For example, if a ball randomly falls within the blue domain, it will converge to the blue minimum. This means that to search the entire space, we only have to randomly find 5 spaces! This is far better than pure random searching, which has to randomly try EVERY space (which could easily be millions of places on this black line depending on the granularity).

In Neural Networks: One way that neural networks accomplish this is by having very large hidden layers. You see, each hidden node in a layer starts out in a different random starting state. This allows each hidden node to converge to different patterns in the network. Parameterizing this size allows the neural network user to potentially try thousands (or tens of billions) of different local minima in a single neural network.

Sidenote 1: This is why neural networks are so powerful! They have the ability to search far more of the space than they actually compute! We can search the entire black line above with (in theory) only 5 balls and a handful of iterations. Searching that same space in a brute force fashion could easily take orders of magnitude more computation.

Sidenote 2: A close eye might ask, "Well, why would we allow a lot of nodes to converge to the same spot? That's actually wasting computational power!" That's an excellent point. The current state-of-the-art approaches to avoiding hidden nodes coming up with the same answer (by searching the same space) are Dropout and Drop-Connect, which I intend to cover in a later post.

Problem 3: When Slopes are Too Small

Neural networks sometimes suffer from the slopes being too small. The answer is also obvious but I wanted to mention it here to expand on its symptoms. Consider the following graph.

Our little red ball up there is just stuck! If your alpha is too small, this can happen. The ball just drops right into an instant local minimum and ignores the big picture. It doesn't have the umph to get out of the rut.

And perhaps the more obvious symptom of deltas that are too small is that the convergence will just take a very, very long time.

Solution 3: Increase the Alpha

As you might expect, the solution to both of these symptoms is to increase the alpha. We might even multiply our deltas by a weight higher than 1. This is very rare, but it does sometimes happen.

Part 4: SGD in Neural Networks

So at this point you might be wondering, how does this relate to neural networks and backpropagation? This is the hardest part, so get ready to hold on tight and take things slow. It's also quite important.

That big nasty curve? In a neural network, we're trying to minimize the error with respect to the weights. So, what that curve represents is the network's error relative to the position of a single weight. So, if we computed the network's error for every possible value of a single weight, it would generate the curve you see above. We would then pick the value of the single weight that has the lowest error (the lowest part of the curve). I say single weight because it's a two-dimensional plot. Thus, the x dimension is the value of the weight and the y dimension is the neural network's error when the weight is at that position.

Stop and make sure you got that last paragraph. It's key.

Let's take a look at what this process looks like in a simple 2 layer neural network.

2 Layer Neural Network:

import numpy as np

# compute sigmoid nonlinearity
def sigmoid(x):
    output = 1/(1+np.exp(-x))
    return output

# convert output of sigmoid function to its derivative
def sigmoid_output_to_derivative(output):
    return output*(1-output)
    
# input dataset
X = np.array([  [0,1],
                [0,1],
                [1,0],
                [1,0] ])
    
# output dataset            
y = np.array([[0,0,1,1]]).T

# seed random numbers to make calculation
# deterministic (just a good practice)
np.random.seed(1)

# initialize weights randomly with mean 0
synapse_0 = 2*np.random.random((2,1)) - 1

for iter in xrange(10000):

    # forward propagation
    layer_0 = X
    layer_1 = sigmoid(np.dot(layer_0,synapse_0))

    # how much did we miss?
    layer_1_error = layer_1 - y

    # multiply how much we missed by the 
    # slope of the sigmoid at the values in l1
    layer_1_delta = layer_1_error * sigmoid_output_to_derivative(layer_1)
    synapse_0_derivative = np.dot(layer_0.T,layer_1_delta)

    # update weights
    synapse_0 -= synapse_0_derivative

print "Output After Training:"
print layer_1

So, in this case, we have a single error at the output (single value), which is computed on line 35. Since we have 2 weights, the output "error plane" is a 3 dimensional space. We can think of this as an (x,y,z) plane, where vertical is the error, and x and y are the values of our two weights in syn0.

Let's try to plot what the error plane looks like for the network/dataset above. So, how do we compute the error for a given set of weights? Lines 31,32,and 35 show us that. If we take that logic and plot the overall error (a single scalar representing the network error over the entire dataset) for every possible set of weights (from -10 to 10 for x and y), it looks something like this.

Don't be intimidated by this. It really is as simple as computing every possible set of weights, and the error that the network generates at each set. x is the first synapse_0 weight and y is the second synapse_0 weight. z is the overall error. As you can see, our output data is positively correlated with the first input data. Thus, the error is minimized when x (the first synapse_0 weight) is high. What about the second synapse_0 weight? How is it optimal?

How Our 2 Layer Neural Network Optimizes

So, given that lines 31,32,and 35 end up computing the error. It can be natural to see that lines 39, 40, and 43 optimize to reduce the error. This is where Gradient Descent is happening! Remember our pseudocode?

Naive Gradient Descent:

Lines 39 and 40: Calculate "slope" at current "x" position
Line 43: Change x by the negative of the slope. (x = x - slope)
Line 28: (Repeat until slope == 0)

It's exactly the same thing! The only thing that has changed is that we have 2 weights that we're optimizing instead of just 1. The logic, however, is identical.

Part 5: Improving our Neural Network

Remember that Gradient Descent had some weaknesses. Now that we have seen how our neural network leverages Gradient Descent, we can improve our network to overcome these weaknesses in the same way that we improved Gradient Descent in Part 3 (the 3 problems and solutions).

Improvement 1: Adding and Tuning the Alpha Parameter

What is Alpha? As described above, the alpha parameter reduces the size of each iteration's update in the simplest way possible. At the very last minute, right before we update the weights, we multiply the weight update by alpha (usually between 0 and 1, thus reducing the size of the weight update). This tiny change to the code has absolutely massive impact on its ability to train.

We're going to jump back to our 3 layer neural network from the first post and add in an alpha parameter at the appropriate place. Then, we're going to run a series of experiments to align all the intuition we developed around alpha with its behavior in live code.

Improved Gradient Descent:

Calculate "slope" at current "x" position
Lines 56 and 57: Change x by the negative of the slope scaled by alpha. (x = x - (alpha*slope) )
(Repeat until slope == 0)

import numpy as np

alphas = [0.001,0.01,0.1,1,10,100,1000]

# compute sigmoid nonlinearity
def sigmoid(x):
    output = 1/(1+np.exp(-x))
    return output

# convert output of sigmoid function to its derivative
def sigmoid_output_to_derivative(output):
    return output*(1-output)
    
X = np.array([[0,0,1],
            [0,1,1],
            [1,0,1],
            [1,1,1]])
                
y = np.array([[0],
			[1],
			[1],
			[0]])

for alpha in alphas:
    print "\nTraining With Alpha:" + str(alpha)
    np.random.seed(1)

    # randomly initialize our weights with mean 0
    synapse_0 = 2*np.random.random((3,4)) - 1
    synapse_1 = 2*np.random.random((4,1)) - 1

    for j in xrange(60000):

        # Feed forward through layers 0, 1, and 2
        layer_0 = X
        layer_1 = sigmoid(np.dot(layer_0,synapse_0))
        layer_2 = sigmoid(np.dot(layer_1,synapse_1))

        # how much did we miss the target value?
        layer_2_error = layer_2 - y

        if (j% 10000) == 0:
            print "Error after "+str(j)+" iterations:" + str(np.mean(np.abs(layer_2_error)))

        # in what direction is the target value?
        # were we really sure? if so, don't change too much.
        layer_2_delta = layer_2_error*sigmoid_output_to_derivative(layer_2)

        # how much did each l1 value contribute to the l2 error (according to the weights)?
        layer_1_error = layer_2_delta.dot(synapse_1.T)

        # in what direction is the target l1?
        # were we really sure? if so, don't change too much.
        layer_1_delta = layer_1_error * sigmoid_output_to_derivative(layer_1)

        synapse_1 -= alpha * (layer_1.T.dot(layer_2_delta))
        synapse_0 -= alpha * (layer_0.T.dot(layer_1_delta))


Training With Alpha:0.001
Error after 0 iterations:0.496410031903
Error after 10000 iterations:0.495164025493
Error after 20000 iterations:0.493596043188
Error after 30000 iterations:0.491606358559
Error after 40000 iterations:0.489100166544
Error after 50000 iterations:0.485977857846

Training With Alpha:0.01
Error after 0 iterations:0.496410031903
Error after 10000 iterations:0.457431074442
Error after 20000 iterations:0.359097202563
Error after 30000 iterations:0.239358137159
Error after 40000 iterations:0.143070659013
Error after 50000 iterations:0.0985964298089

Training With Alpha:0.1
Error after 0 iterations:0.496410031903
Error after 10000 iterations:0.0428880170001
Error after 20000 iterations:0.0240989942285
Error after 30000 iterations:0.0181106521468
Error after 40000 iterations:0.0149876162722
Error after 50000 iterations:0.0130144905381

Training With Alpha:1
Error after 0 iterations:0.496410031903
Error after 10000 iterations:0.00858452565325
Error after 20000 iterations:0.00578945986251
Error after 30000 iterations:0.00462917677677
Error after 40000 iterations:0.00395876528027
Error after 50000 iterations:0.00351012256786

Training With Alpha:10
Error after 0 iterations:0.496410031903
Error after 10000 iterations:0.00312938876301
Error after 20000 iterations:0.00214459557985
Error after 30000 iterations:0.00172397549956
Error after 40000 iterations:0.00147821451229
Error after 50000 iterations:0.00131274062834

Training With Alpha:100
Error after 0 iterations:0.496410031903
Error after 10000 iterations:0.125476983855
Error after 20000 iterations:0.125330333528
Error after 30000 iterations:0.125267728765
Error after 40000 iterations:0.12523107366
Error after 50000 iterations:0.125206352756

Training With Alpha:1000
Error after 0 iterations:0.496410031903
Error after 10000 iterations:0.5
Error after 20000 iterations:0.5
Error after 30000 iterations:0.5
Error after 40000 iterations:0.5
Error after 50000 iterations:0.5

So, what did we observe with the different alpha sizes?

Alpha = 0.001
The network with a crazy small alpha didn't hardly converge! This is because we made the weight updates so small that they hardly changed anything, even after 60,000 iterations! This is textbook Problem 3:When Slopes Are Too Small.

Alpha = 0.01
This alpha made a rather pretty convergence. It was quite smooth over the course of the 60,000 iterations but ultimately didn't converge as far as some of the others. This still is textbook Problem 3:When Slopes Are Too Small.

Alpha = 0.1
This alpha made some of progress very quickly but then slowed down a bit. This is still Problem 3. We need to increase alpha some more.

Alpha = 1
As a clever eye might suspect, this had the exact convergence as if we had no alpha at all! Multiplying our weight updates by 1 doesn't change anything. :)

Alpha = 10
Perhaps you were surprised that an alpha that was greater than 1 achieved the best score after only 10,000 iterations! This tells us that our weight updates were being too conservative with smaller alphas. This means that in the smaller alpha parameters (less than 10), the network's weights were generally headed in the right direction, they just needed to hurry up and get there!

Alpha = 100
Now we can see that taking steps that are too large can be very counterproductive. The network's steps are so large that it can't find a reasonable lowpoint in the error plane. This is textbook Problem 1. The Alpha is too big so it just jumps around on the error plane and never "settles" into a local minimum.

Alpha = 1000
And with an extremely large alpha, we see a textbook example of divergence, with the error increasing instead of decreasing... hardlining at 0.5. This is a more extreme version of Problem 1 where it overcorrectly whildly and ends up very far away from any local minimums.

Let's Take a Closer Look

import numpy as np

alphas = [0.001,0.01,0.1,1,10,100,1000]

# compute sigmoid nonlinearity
def sigmoid(x):
    output = 1/(1+np.exp(-x))
    return output

# convert output of sigmoid function to its derivative
def sigmoid_output_to_derivative(output):
    return output*(1-output)
    
X = np.array([[0,0,1],
            [0,1,1],
            [1,0,1],
            [1,1,1]])
                
y = np.array([[0],
			[1],
			[1],
			[0]])



for alpha in alphas:
    print "\nTraining With Alpha:" + str(alpha)
    np.random.seed(1)

    # randomly initialize our weights with mean 0
    synapse_0 = 2*np.random.random((3,4)) - 1
    synapse_1 = 2*np.random.random((4,1)) - 1
        
    prev_synapse_0_weight_update = np.zeros_like(synapse_0)
    prev_synapse_1_weight_update = np.zeros_like(synapse_1)

    synapse_0_direction_count = np.zeros_like(synapse_0)
    synapse_1_direction_count = np.zeros_like(synapse_1)
        
    for j in xrange(60000):

        # Feed forward through layers 0, 1, and 2
        layer_0 = X
        layer_1 = sigmoid(np.dot(layer_0,synapse_0))
        layer_2 = sigmoid(np.dot(layer_1,synapse_1))

        # how much did we miss the target value?
        layer_2_error = y - layer_2

        if (j% 10000) == 0:
            print "Error:" + str(np.mean(np.abs(layer_2_error)))

        # in what direction is the target value?
        # were we really sure? if so, don't change too much.
        layer_2_delta = layer_2_error*sigmoid_output_to_derivative(layer_2)

        # how much did each l1 value contribute to the l2 error (according to the weights)?
        layer_1_error = layer_2_delta.dot(synapse_1.T)

        # in what direction is the target l1?
        # were we really sure? if so, don't change too much.
        layer_1_delta = layer_1_error * sigmoid_output_to_derivative(layer_1)
        
        synapse_1_weight_update = (layer_1.T.dot(layer_2_delta))
        synapse_0_weight_update = (layer_0.T.dot(layer_1_delta))
        
        if(j > 0):
            synapse_0_direction_count += np.abs(((synapse_0_weight_update > 0)+0) - ((prev_synapse_0_weight_update > 0) + 0))
            synapse_1_direction_count += np.abs(((synapse_1_weight_update > 0)+0) - ((prev_synapse_1_weight_update > 0) + 0))        
        
        synapse_1 += alpha * synapse_1_weight_update
        synapse_0 += alpha * synapse_0_weight_update
        
        prev_synapse_0_weight_update = synapse_0_weight_update
        prev_synapse_1_weight_update = synapse_1_weight_update
    
    print "Synapse 0"
    print synapse_0
    
    print "Synapse 0 Update Direction Changes"
    print synapse_0_direction_count
    
    print "Synapse 1"
    print synapse_1

    print "Synapse 1 Update Direction Changes"
    print synapse_1_direction_count



Training With Alpha:0.001
Error:0.496410031903
Error:0.495164025493
Error:0.493596043188
Error:0.491606358559
Error:0.489100166544
Error:0.485977857846
Synapse 0
[[-0.28448441  0.32471214 -1.53496167 -0.47594822]
 [-0.7550616  -1.04593014 -1.45446052 -0.32606771]
 [-0.2594825  -0.13487028 -0.29722666  0.40028038]]
Synapse 0 Update Direction Changes
[[ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 1.  0.  1.  1.]]
Synapse 1
[[-0.61957526]
 [ 0.76414675]
 [-1.49797046]
 [ 0.40734574]]
Synapse 1 Update Direction Changes
[[ 1.]
 [ 1.]
 [ 0.]
 [ 1.]]

Training With Alpha:0.01
Error:0.496410031903
Error:0.457431074442
Error:0.359097202563
Error:0.239358137159
Error:0.143070659013
Error:0.0985964298089
Synapse 0
[[ 2.39225985  2.56885428 -5.38289334 -3.29231397]
 [-0.35379718 -4.6509363  -5.67005693 -1.74287864]
 [-0.15431323 -1.17147894  1.97979367  3.44633281]]
Synapse 0 Update Direction Changes
[[ 1.  1.  0.  0.]
 [ 2.  0.  0.  2.]
 [ 4.  2.  1.  1.]]
Synapse 1
[[-3.70045078]
 [ 4.57578637]
 [-7.63362462]
 [ 4.73787613]]
Synapse 1 Update Direction Changes
[[ 2.]
 [ 1.]
 [ 0.]
 [ 1.]]

Training With Alpha:0.1
Error:0.496410031903
Error:0.0428880170001
Error:0.0240989942285
Error:0.0181106521468
Error:0.0149876162722
Error:0.0130144905381
Synapse 0
[[ 3.88035459  3.6391263  -5.99509098 -3.8224267 ]
 [-1.72462557 -5.41496387 -6.30737281 -3.03987763]
 [ 0.45953952 -1.77301389  2.37235987  5.04309824]]
Synapse 0 Update Direction Changes
[[ 1.  1.  0.  0.]
 [ 2.  0.  0.  2.]
 [ 4.  2.  1.  1.]]
Synapse 1
[[-5.72386389]
 [ 6.15041318]
 [-9.40272079]
 [ 6.61461026]]
Synapse 1 Update Direction Changes
[[ 2.]
 [ 1.]
 [ 0.]
 [ 1.]]

Training With Alpha:1
Error:0.496410031903
Error:0.00858452565325
Error:0.00578945986251
Error:0.00462917677677
Error:0.00395876528027
Error:0.00351012256786
Synapse 0
[[ 4.6013571   4.17197193 -6.30956245 -4.19745118]
 [-2.58413484 -5.81447929 -6.60793435 -3.68396123]
 [ 0.97538679 -2.02685775  2.52949751  5.84371739]]
Synapse 0 Update Direction Changes
[[ 1.  1.  0.  0.]
 [ 2.  0.  0.  2.]
 [ 4.  2.  1.  1.]]
Synapse 1
[[ -6.96765763]
 [  7.14101949]
 [-10.31917382]
 [  7.86128405]]
Synapse 1 Update Direction Changes
[[ 2.]
 [ 1.]
 [ 0.]
 [ 1.]]

Training With Alpha:10
Error:0.496410031903
Error:0.00312938876301
Error:0.00214459557985
Error:0.00172397549956
Error:0.00147821451229
Error:0.00131274062834
Synapse 0
[[ 4.52597806  5.77663165 -7.34266481 -5.29379829]
 [ 1.66715206 -7.16447274 -7.99779235 -1.81881849]
 [-4.27032921 -3.35838279  3.44594007  4.88852208]]
Synapse 0 Update Direction Changes
[[  7.  19.   2.   6.]
 [  7.   2.   0.  22.]
 [ 19.  26.   9.  17.]]
Synapse 1
[[ -8.58485788]
 [ 10.1786297 ]
 [-14.87601886]
 [  7.57026121]]
Synapse 1 Update Direction Changes
[[ 22.]
 [ 15.]
 [  4.]
 [ 15.]]

Training With Alpha:100
Error:0.496410031903
Error:0.125476983855
Error:0.125330333528
Error:0.125267728765
Error:0.12523107366
Error:0.125206352756
Synapse 0
[[-17.20515374   1.89881432 -16.95533155  -8.23482697]
 [  5.70240659 -17.23785161  -9.48052574  -7.92972576]
 [ -4.18781704  -0.3388181    2.82024759  -8.40059859]]
Synapse 0 Update Direction Changes
[[  8.   7.   3.   2.]
 [ 13.   8.   2.   4.]
 [ 16.  13.  12.   8.]]
Synapse 1
[[  9.68285369]
 [  9.55731916]
 [-16.0390702 ]
 [  6.27326973]]
Synapse 1 Update Direction Changes
[[ 13.]
 [ 11.]
 [ 12.]
 [ 10.]]

Training With Alpha:1000
Error:0.496410031903
Error:0.5
Error:0.5
Error:0.5
Error:0.5
Error:0.5
Synapse 0
[[-56.06177241  -4.66409623  -5.65196179 -23.05868769]
 [ -4.52271708  -4.78184499 -10.88770202 -15.85879101]
 [-89.56678495  10.81119741  37.02351518 -48.33299795]]
Synapse 0 Update Direction Changes
[[ 3.  2.  4.  1.]
 [ 1.  2.  2.  1.]
 [ 6.  6.  4.  1.]]
Synapse 1
[[  25.16188889]
 [  -8.68235535]
 [-116.60053379]
 [  11.41582458]]
Synapse 1 Update Direction Changes
[[ 7.]
 [ 7.]
 [ 7.]
 [ 3.]]

What I did in the above code was count the number of times a derivative changed direction. That's the "Update Direction Changes" readout at the end of training. If a slope (derivative) changes direction, it means that it passed OVER the local minimum and needs to go back. If it never changes direction, it means that it probably didn't go far enough.

A Few Takeaways:

When the alpha was tiny, the derivatives almost never changed direction.
When the alpha was optimal, the derivative changed directions a TON.
When the alpha was huge, the derivative changed directions a medium amount.
When the alpha was tiny, the weights ended up being reasonably small too
When the alpha was huge, the weights got huge too!

Improvement 2: Parameterizing the Size of the Hidden Layer

Being able to increase the size of the hidden layer increases the amount of search space that we converge to in each iteration. Consider the network and output

import numpy as np

alphas = [0.001,0.01,0.1,1,10,100,1000]
hiddenSize = 32

# compute sigmoid nonlinearity
def sigmoid(x):
    output = 1/(1+np.exp(-x))
    return output

# convert output of sigmoid function to its derivative
def sigmoid_output_to_derivative(output):
    return output*(1-output)
    
X = np.array([[0,0,1],
            [0,1,1],
            [1,0,1],
            [1,1,1]])
                
y = np.array([[0],
			[1],
			[1],
			[0]])

for alpha in alphas:
    print "\nTraining With Alpha:" + str(alpha)
    np.random.seed(1)

    # randomly initialize our weights with mean 0
    synapse_0 = 2*np.random.random((3,hiddenSize)) - 1
    synapse_1 = 2*np.random.random((hiddenSize,1)) - 1

    for j in xrange(60000):

        # Feed forward through layers 0, 1, and 2
        layer_0 = X
        layer_1 = sigmoid(np.dot(layer_0,synapse_0))
        layer_2 = sigmoid(np.dot(layer_1,synapse_1))

        # how much did we miss the target value?
        layer_2_error = layer_2 - y

        if (j% 10000) == 0:
            print "Error after "+str(j)+" iterations:" + str(np.mean(np.abs(layer_2_error)))

        # in what direction is the target value?
        # were we really sure? if so, don't change too much.
        layer_2_delta = layer_2_error*sigmoid_output_to_derivative(layer_2)

        # how much did each l1 value contribute to the l2 error (according to the weights)?
        layer_1_error = layer_2_delta.dot(synapse_1.T)

        # in what direction is the target l1?
        # were we really sure? if so, don't change too much.
        layer_1_delta = layer_1_error * sigmoid_output_to_derivative(layer_1)

        synapse_1 -= alpha * (layer_1.T.dot(layer_2_delta))
        synapse_0 -= alpha * (layer_0.T.dot(layer_1_delta))

Training With Alpha:0.001
Error after 0 iterations:0.496439922501
Error after 10000 iterations:0.491049468129
Error after 20000 iterations:0.484976307027
Error after 30000 iterations:0.477830678793
Error after 40000 iterations:0.46903846539
Error after 50000 iterations:0.458029258565

Training With Alpha:0.01
Error after 0 iterations:0.496439922501
Error after 10000 iterations:0.356379061648
Error after 20000 iterations:0.146939845465
Error after 30000 iterations:0.0880156127416
Error after 40000 iterations:0.065147819275
Error after 50000 iterations:0.0529658087026

Training With Alpha:0.1
Error after 0 iterations:0.496439922501
Error after 10000 iterations:0.0305404908386
Error after 20000 iterations:0.0190638725334
Error after 30000 iterations:0.0147643907296
Error after 40000 iterations:0.0123892429905
Error after 50000 iterations:0.0108421669738

Training With Alpha:1
Error after 0 iterations:0.496439922501
Error after 10000 iterations:0.00736052234249
Error after 20000 iterations:0.00497251705039
Error after 30000 iterations:0.00396863978159
Error after 40000 iterations:0.00338641021983
Error after 50000 iterations:0.00299625684932

Training With Alpha:10
Error after 0 iterations:0.496439922501
Error after 10000 iterations:0.00224922117381
Error after 20000 iterations:0.00153852153014
Error after 30000 iterations:0.00123717718456
Error after 40000 iterations:0.00106119569132
Error after 50000 iterations:0.000942641990774

Training With Alpha:100
Error after 0 iterations:0.496439922501
Error after 10000 iterations:0.5
Error after 20000 iterations:0.5
Error after 30000 iterations:0.5
Error after 40000 iterations:0.5
Error after 50000 iterations:0.5

Training With Alpha:1000
Error after 0 iterations:0.496439922501
Error after 10000 iterations:0.5
Error after 20000 iterations:0.5
Error after 30000 iterations:0.5
Error after 40000 iterations:0.5
Error after 50000 iterations:0.5

Notice that the best error with 32 nodes is 0.0009 whereas the best error with 4 hidden nodes was only 0.0013. This might not seem like much, but it's an important lesson. We do not need any more than 3 nodes to represent this dataset. However, because we had more nodes when we started, we searched more of the space in each iteration and ultimately converged faster. Even though this is very marginal in this toy problem, this affect plays a huge role when modeling very complex datasets.

Part 6: Conclusion and Future Work

My Recommendation:

If you're serious about neural networks, I have one recommendation. Try to rebuild this network from memory. I know that might sound a bit crazy, but it seriously helps. If you want to be able to create arbitrary architectures based on new academic papers or read and understand sample code for these different architectures, I think that it's a killer exercise. I think it's useful even if you're using frameworks like Torch, Caffe, or Theano. I worked with neural networks for a couple years before performing this exercise, and it was the best investment of time I've made in the field (and it didn't take long).

Future Work

This toy example still needs quite a few bells and whistles to really approach the state-of-the-art architectures. Here's a few things you can look into if you want to further improve your network. (Perhaps I will in a followup post.)

• Bias Units
• Mini-Batches
• Delta Trimming
• Parameterized Layer Sizes
• Regularization
• Dropout
• Momentum
• Batch Normalization
• GPU Compatability
• Other Awesomeness You Implement

Want to Work in Machine Learning?

Machine Learning Jobs

jobs by

View More Job Search Results