Sanitizing User-generated Inputs

Strategies to effectively sanitize user-generated data.

Previous Home View on GitHub Next
November 21, 2016 Edwin Velazquez

Sanitizing User-generated Inputs

Every input is an opening that can be exploited by malicious users. Even though modern web application’s have more openings to potentially attack, the attention given to safely handling user generated data has seemed to decrease. This can be seen as byproduct of the ever increasing complexity of modern web applications, and the limited amount of mental resources that can be allocated to each portion of an application. The rise of web development frameworks is also partially to blame. By managing many of our security concerns, they create an abstraction between logic and the developer. This abstraction can create a false sense of security diverting developer’s mental resources to other parts of the application.

Much of our security concerns when dealing with user generated data can be alleviated by understanding a few basic principles essential to data sanitation. As simple as these strategies might seem, understanding how to effectively use them is vital in developing secured interactive applications.

Whitelisting

Whitelisting is the process of defining the smallest set of “good values” and rejecting everything that does not fall within that range. Your goal, when implementing this strategy, reduce the size and variation of the input. Some examples of whitelisting in everyday applications are:

  • Reducing your data against a list of all possible values. States and provinvces can be reduced to a list of allowable input. Example: ( NY, CA, PA, NJ, MA, FL, etc… ).
  • Reducing the range of the basic building block of data. In the case of a string, we can reduce the allowable set of characters.
  • Defining a range for the length of the input.
  • Typecasting. Converting the type of your data is an effective tool to reduce the size and variation of your data.

Whitelisting should be the first strategy implemented when sanitizing user generated data. By reducing the size and variation of your input, you are also reducing the types of attacks a potential hacker might be able to exploit. Almost any data gathered from a user can be drastically whitelisted. The inability to whitelist user generated data is a sign that the input is trying to do too much and should be further reduced into smaller, and more managable subsets. As you will soon see, whitelisted data will make each of the other two strategies infinitely easier to implement.

Blacklist

Blacklist is the proccess of defining unwanted subsets of data and removing them. Blacklisting is commonly used to remove characters and patterns that can be used to compromise a system. Some examples of of blacklisting in everyday applications are:

  • Removing HTML tags and entities to prevent XSS attacks.
  • Removing characters and patterns commonly used in SQL injection.
  • Removing unneeded formatting characters from an input.

A validation strategy that completely relies on blacklisting assumes that you know all the combinations of values, both currently known and those that will eventually be discovered, that can lead to a malicious attack. There are literally millions different combinations that can be used to attack an application. Because of the impossibility of this task, it is important that all data is whitelisted before attempting any form of blacklisting. The reduction of variation in data also reduces the amount of inputs that can compromise a system. It is important to note that you will never be able to create bullet-proof application, but you can reduce the probability that your application will be exploited over a period of time.

A common method to bypass blacklisting filters is to embed malicious data within malicious data. The removal of the obvious attack reveals another attack, often worst than the one initially removed. For this reason, it is important to perform any form of blacklisting recursively, stopping only when the data can no longer be changed. In the case below, the blacklist filter looks for script tags and removes them along with any content they hold inside.

Before:

  <scr<script>console.log('Testing 1');</script>ipt> document.write('cookie: ' + document.cookie) </sc<script>console.log('Testing 2');</script>ript>

After:

  <script> document.write('cookie: ' + document.cookie) </script>

Encoding / Escaping

Whitelisting and blacklisting are destructive actions, that is to say, information is lost when you pass your data through either of the two filters. It is common to come across cases where a potentially dangerous portion of your data is actually desired. Since data travels to different destinations throughout an application, what might be dangerous in one location can be perfectly normal in another. In the case where you need to sanitize without altering your data, a better strategy to implement would be data encoding.

Encoding, or escaping, is the act of replacing characters or patterns with different characters or patterns allowing for the initial data to be recoverable. It is important to note that the act of encoding is reversible, meaning that a potentially harmful version of the data still lies within. Some example of input encoding in everyday applications are:

  • URL escaping.
  • Encoding for privacy reasons.
  • HTML encoding
  • CSS escaping.

Encoding vs Whitelisting / Blacklisting

It is common to see developers use some form of encoding or escaping as their only form of data sanitation for certain inputs. While encoding accomplishes some of the same results as the the previous two strategies, it also comes with big drawbacks that should be kept in mind. Most of the drawbacks revolve around the idea that encoding adds another level of complexity that can be used to attack your system. As we have state before, data travels to many different destinations thought out your application, and in each stage we need to make sure that we take the proper precaution when encoding and decoding data.

A common method to bypass validation involving encoded data is to encode the input twice before submission. The server, expecting the message to be encode, runs the decoder once and performs any validation and/or filtering on the resulting data. Validation is likely to pass since the input is still encoded, hiding the malicious attack. Just like blacklisting, it is important to decode encoded input until no further change is detected.

Initial:

  <script> console.log('testing') </script>

After double encoding:

  %253Cscript%253E%2520console.log(%2522testing%2522)%2520%253C%252Fscript%253E

After decoding:

  %3Cscript%3E%20console.log(%22testing%22)%20%3C%2Fscript%3E

Conclusion

By strategically using these three simple strategies, you can dramatically reduce the probability of an attack due to user generated data. Start the sanitizing process by passing the input through a whitelist filter. Reduce variation in your data by using techniques such as defining the allowable set of characters, restricting the length, and typecasting your input. If the data can not be be significantly reduced, break it into smaller subsets. Only after the whitelisting process, should you start passing the input through blacklist filters. Remove any potentially malicious characters and patterns, keeping in mind the path your data will take. If you are finding yourself removing more than is manageable, go back and further whitelist your data. If an input is still potentially dangerous but vital to the functionality of your application, encode or escape the data. During the sanitation process, put yourself in the shoes of an attacker and lookout for the many techniques used to bypass these filters. Over time this process will become second nature. With this knowledge and a bit of luck, your application will be free from successful attacks.