OWASP Testing Guide Appendix D: Encoded Injection
Character Encoding is primarily used to represent characters, numbers and other symbols in a format that is suitable for a computer to understand, store, and render data. It is, in simple terms, the conversion of bytes into characters - characters belonging to different languages like English, Chinese, Greek or any other known language. A common and one of the early character encoding schemes is ASCII (American Standard Code for Information Interchange) that initially, used 7 bit coded characters. Today, the most common encoding scheme used is Unicode (UTF 8)
Character encoding can be used in different ways to attack a web application:
Web applications usually employ different types of input filtering mechanisms to limit the input that can be submitted by its users. If these input filters are not implemented sufficiently well, it is possible to slip a character or two through these filters. For instance, a / can be represented as 2F (hex) in ASCII, while the same character (/) is encoded as C0 AF in Unicode (2 byte sequence). Therefore, it is important for the input filtering control to be aware of the encoding scheme used. If the filter is found to be detecting UTF 8 encoded injections a different encoding scheme may be employed to bypass the filter.
Web browsers, in order to coherently display a web page, are required to be aware of the encoding scheme used. Ideally, this information should be provided to the browser through HTTP headers (“Content-Type”) as shown below:
Content-Type: text/html; charset=UTF-8
or through HTML META tag (“META HTTP-EQUIV”), as shown below:
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
It is through these character encoding declarations that the browser understands which set of characters to use when converting bytes to characters.
However, in the event of not receiving this information from the server, the browser either attempts to ‘guess’ the encoding scheme or reverts to a default scheme. In some cases, the user explicitly sets the default encoding in the browser to a different scheme. Any such mismatch in the encoding scheme used by the web page and the browser may cause the browser to interpret the page in a manner that is unintended or unexpected. This behavior of the browser is sometimes exploited to bypass output encoding mechanisms i.e HTML entities encoding (< for ‘<’, > for ‘>’).
Consider the following scenarios that better illustrate the idea of filter bypass using character encoding
All the scenarios given below form only a subset of the various ways obfuscation can be achieved in order to bypass input filters. Also, the success of encoded injections depends on the browser in use. For e.g US-ASCII encoded injections were previously successful only in IE browser but not in Firefox. Therefore, it may be noted that encoded injections, to a large extent, are browser dependent.
Consider a basic input validation filter that protects against injection of single quote character. In this case the following injection would easily bypass this filter:
The above uses HTML Entities to construct the injection string. HTML Entities encoding is used to display characters that have a special meaning in HTML. For instance, ‘>’ works as a closing bracket for a HTML tag. In order to actually display this character on the web page HTML character entities should be inserted in the page source. The injections mentioned above are one way of encoding. There are numerous other ways in which a string can be encoded (obfuscated) in order to bypass the above filter.