How can we make Machine Learning safer and more stable?

Introduction to Machine Learning Robustness

Machine Learning models are great and powerful. However, the usual characteristics of regular training can lead to serious consequences in terms of security and safety. In this blog post, we will take a step back to revisit a regular optimization problem using an example of a binary classification. We will show a way to create more robust and stable models that use features that are more meaningful to humans. In our experiments, we will do a simple binary classification to recognize the digits zero and one from the MNIST dataset. 

Firstly, we will introduce why regular training is not perfect. Next, we will briefly sum up what regular training looks like, and then we will outline more robust training. Finally, we will show the implementation of our experiments and our final results (the GitHub repository is here).


Machine Learning models have achieved extraordinary results across a variety of domains such as computer vision, speech recognition, and natural language modeling. Using a training dataset, a model looks for any correlations between features that are useful in predictions. Every deep neural network has millions of weak patterns, which interact, and on average, give the best results. Nowadays, models in use are huge regardless of the domain e.g. the Inception-V4 (computer vision) contains around 55 million parameters, the DeepSpeech-2 (speech recognition) over 100 million parameters, or the GPT-2 (NLP language model) over 1.5 billion parameters. To feed such big models, we are forced to use unsupervised (or semi-supervised) learning. As a result, we often end up with (nearly) black-box models, which make decisions using tons of well-generalized weak features, which are not interpretable to humans.  This fundamental property can lead to severe and dangerous consequences in the security and safety of deep neural networks in particular.

Why should we care about weak features?

The key is that they are (rather) imperceptible to humans. From the perspective of security, if you know how to fake weak features in input data, you can invisibly take full control of model predictions. This method is called an adversarial attack. This is based on finding a close perturbation of the input (commonly using a gradient), which crosses the decision boundary, and changes the prediction (sometimes to a chosen target, called targeted attacks). Unfortunately, most of the state-of-the-art models, regardless of the domain (image classification, speech recognition, object detection, malware detection), are vulnerable to this kind of attack. In fact, you do not even need to have access to the model itself. The models are so unstable that a rough model approximation is enough to fool them (transferability in black-box attacks).

Safety is another perspective.

Our incorrect assumption that training datasets reflect a true distribution sometimes comes back to haunt us (intensified by data poisoning). In deep neural networks, changes in distribution can unpredictably trigger weak features. This usually gives a slight decline in performance on average, which is fine. However, this decrease often comes as a result of rare events, in which the model will without a doubt offer wrong predictions  (think about incidents regarding self-driving cars).

Regular Binary Classification

Let’s summarize our regular training. The model based on the  blank  input makes a hypothesis  blank  to predict the correct target  blank , where  blank  in the binary classification. The binary loss function can be simplified to the one-argument function  blank, and we can use the elegant hinge loss, which is known as the soft-margin in the SVM. To fully satisfy the loss, the model has to not only ideally separate classes but also preserve a sufficient margin between them (figures code left and right).


For our experiment, we are using a simple linear classifier, so the model has only a single vector blank and bias blankThe landscape of loss in terms of  blank  for non-linear models is highly irregular (left figure, code here), however in our case, it is just a straight line (right figure, code here).


Using a dataset and an optimization method gradient descent, we follow the gradient and look for the model parameters, which minimize the loss function:


Super straightforward, right? This is our regular training. We minimize the expected value of the loss. Therefore, the model is looking for any (even weak) correlations, which improve the performance on average (no matter how disastrous its predictions sometimes are). 


Towards Robust Binary Classification

As we mentioned in the introduction, machine learning models (deep neural networks in particular) are sensitive to small changes. Therefore now, we allow the input to be perturbed a little bit. We are not interested in patterns which concern blank exclusively but the delta neighborhood around blank. In consequence, we face the min-max problem and two related challenges.
Firstly, how can we construct valid perturbations blank? We want to formulate a space (epsilon-neighborhood) around blank (figure below, code here), which sustains human understanding about this space. In our case, if a point blank describes the digit one, then we have to guarantee that each perturbation blank looks like the digit one. We do not know how to do this formally. However, we can (sure enough) assume that the small norm perturbations are correct  blank. In our experiments, we are using the infinity norm blank (others are common too). These tiny boxes are neat because valid perturbations are in the range blank to blank , independent of dimension.


The second challenge is how to solve the inner maximization problem. Most advanced machine learning models are highly non-linear, so this is tough in general. There are several methods to approximate the solution (a lower or upper bound), which we are going to cover in upcoming blog posts. Hopefully, in the linear case, we can easily solve this exactly because the loss directly depends on blank, our simplified loss landscape (formal details here).


The basic intuition is that we do not penalize high weights, which are far from the decision boundary (in contrast to regularization blankHowever, this is far from a complete explanation. 

Firstly, the loss does not penalize if a classifier makes a mistake that is close to the decision boundary (left figure, code here). The error tolerance dynamically changes with regards to model weights, shifting the loss curve. As opposed  to regular training, we do not force a strictly defined margin to be preserved, which sometimes can not be achieved.


Secondly, the back propagation is different (right figure, code here). The gradient is not only diminished, but also if it is smaller than epsilon, it can even change the sign. As a result, the weights that are smaller than epsilon are gently wiped off. 

Finally, our goal is to minimize the expected value of the loss not only of input blank, but the entire subspace around blank:



As we have already mentioned, today, we are doing a simple binary classification. Let’s briefly present regular training ( We reduce the MNIST dataset to include exclusively only the digits zero and one (transforming the original dataset here). We build a regular linear classifier, SGD optimizer, and hinge loss. We work with the high-level Keras under Tensorflow 2.0 with eager execution (PyTorch alike). 


In contrast to just invoking the built-in fit method, we build the custom routine to have full access to any variable or gradient. We abstract the train_step, which processes a single batch. We build several callbacks to collect partial results for further analysis.


The robust training is similar. The crucial change is the customized loss, which contains additionally  blank  term. More details are in



We do a binary classification to recognize the digits zero and one from the MNIST dataset. We train regular and robust models using presented scripts). Our regular model achieves super results (robust models are slightly worse). 


We have only single mistakes (figure below, code here). A dot around a digit causes general confusion. Nevertheless, we have incredibly precise classifiers. We have achieved human performance in recognizing handwritten digits zero and one, haven’t we? 


Not really. We can precisely predict (without overfitting, take a look at the tiny gap between the train and the test results), that’s all. We are absolutely far from human reasoning when it comes to recognizing the digits zero and one. To demonstrate this, we can check the model weights (figure below, code here). Our classifiers are linear therefore the reasoning is straightforward. You can imagine a kind of a single stamp. The decision moves toward the  digit one if black pixels are activated (and to the digit zero if white). The regular model contains a huge amount of weak features, which do not make sense for us, but generalize well. In contrast, robust models  wipe out weak features (which are smaller than epsilon), and stick with more robust and human aligned features.


Now, we will make several white-box adversarial attacks and try to fool our models. We evaluate the models on perturbed test datasets, in which each sample is moved directly towards a decision boundary. In our linear case, the perturbations can be easily defined as:

blankwhere we check out several epsilons (figure below, code here).


As we expect, the regular model is brittle due to the huge amount of weak features. Below, we present the misclassified samples which are closest to the boundary decision (predicted logits are around zero, figure code here). Now, we can understand how perturbed images are so readable to humans  e.g. blank where the regular classifier has the accuracy around zero. The regular classifier absolutely does not know what the digit zero or one looks like. In contrast, robust models are generally confused about the different digit structures, nonetheless, the patterns are more robust and understandable to us.


In the end, we present a slightly different perspective. Take a look at the logit distributions of misclassified samples. We see that the regular model is extremely confident about wrong predictions. In contrast, the robust model (even if it is fooled) is uncertain, because logits tend to be close to zero. Robust models seem to be more reliable (figure code here).



Machine Learning models are great and powerful. However, the characteristics of regular training can lead to serious consequences in terms of the security and safety of deep neural networks in particular. In this blog post, we have shown what simple robust training can look like. This is only a simple binary case, which (we hope) gives more intuition about the drawbacks of regular training, and shows why these problems are so vital. Of course, things are more complex if we want to force deep neural networks to be more robust because performance rapidly declines, and models become useless. Nonetheless, the machine learning community is working hard to popularize and develop the idea of a robust machine learning, as this blog post has tried to do.

In the next blog posts, we will present how to achieve more robust deep neural networks, and how they can be super meaningful.


Blog posts:



You might also like


Rafał Rolczyński

Machine Learning Engineer. Worked on both industrial and academic projects for startup companies, tech giants, and public sector institutions on a wide range of topics,from natural language processing to speech recognition.

Latest Blogposts

08.04.2021 / By Daria Karasek

How Outsourcing Can Solve Technical Debt

Technical debt has become widely prevalent nowadays. Since technology is constantly evolving, many businesses have to choose between acquiring new solutions or sticking with tried-and-tested ones. No right answer can be given to this hard choice. But even though debt sounds threatening for many, it doesn't always have to be.

30.03.2021 / By Marcin Krykowski

Top 15 Scala Libraries for Data Science in 2021

Python and R seem to be the first choice when it comes to data engineering. However, if we look a bit deeper, Scala is gaining more and more popularity. This is because, thanks to its functional nature, it is a more common choice when it comes to building powerful applications that need to cooperate with a vast amount of data.

23.03.2021 / By Tomasz Bogus

How To Tackle Technical Debt Quadrants?

You come across the term technical debt, read a little about this matter, know everything in theory, and you’re still not sure how to correctly identify whether some kind of design flaws are actually technical debt or not.  We have a solution that can help you with this issue – Technical Debt Quadrants. This method […]


Need a successful project?

Estimate project