Untrusted data is most often data that comes from the HTTP request, in the form of URL parameters, form fields, headers, or cookies. But data that comes from databases, web services, and other sources is frequently untrusted from a security perspective. That is, untrusted data is input that can be manipulated to contain a web attack payload. The OWASP Code Review Guide has a decent list of methods that return untrusted data in various languages, but you should be careful about your own methods as well.
Untrusted data should always be treated as though it contains an attack. That means you should not send it anywhere without taking steps to make sure that any attacks are detected and neutralized. As applications get more and more interconnected, the likelihood of a buried attack being decoded or executed by a downstream interpreter increases rapidly.
Traditionally, input validation has been the preferred approach for handling untrusted data. However, input validation is not a great solution for injection attacks. First, input validation is typically done when the data is received, before the destination is known. That means that we don't know which characters might be significant in the target interpreter. Second, and possibly even more importantly, applications must allow potentially harmful characters in. For example, should poor Mr. O'Malley be prevented from registering in the database simply because SQL considers ' a special character?
While input validation is important and should always be performed, it is not a complete solution for injection attacks. It's better to think of input validation as defense in depth and use escaping as described below as the primary defense.
Escaping (aka Output Encoding)
"Escaping" is a technique used to ensure that characters are treated as data, not as characters that are relevant to the interpreter's parser. There are lots of different types of escaping, sometimes confusingly called output "encoding." Some of these techniques define a special "escape" character, and other techniques have a more sophisticated syntax that involves several characters.
Do not confuse output escaping with the notion of Unicode character encoding, which involves mapping a Unicode character to a sequence of bits. This level of encoding is automatically decoded, and does not defuse attacks. However, if there are misunderstandings about the intended charset between the server and browser, it may cause unintended characters to be communicated, possibly enabling XSS attacks. This is why it is still important to specify the Unicode character encoding (charset), such as UTF-8, for all communications.
Escaping is the primary means to make sure that untrusted data can't be used to convey an injection attack. There is no harm in escaping data properly - it will still render in the browser properly. Escaping simply lets the interpreter know that the data is not intended to be executed, and therefore prevents attacks from working.
Injection is an attack that involves breaking out of a data context and switching into a code context through the use of special characters that are significant in the interpreter being used. A data context is like <div>data context</div>. If the attacker's data gets placed into the data context, they might break out like this <div>data < script>alert("attack")</script> context</div>.
To really understand what's going on with XSS, you have to consider injection into the hierarchical structure of the HTML DOM. Given a place to insert data into an HTML document (that is, a place where a developer has allowed untrusted data to be included in the DOM), there are two ways to inject code:
- Injecting UP
- Injecting DOWN
The rules in this document have been designed to prevent both UP and DOWN varieties of XSS injection. To prevent injecting up, you must escape the characters that would allow you to close the current context and start a new one. To prevent attacks that jump up several levels in the DOM hierarchy, you must also escape all the characters that are significant in all enclosing contexts. To prevent injecting down, you must escape any characters that can be used to introduce a new sub-context within the current context.