How to perform HTML entity encoding in Java

Status
Released 14/1/2008

Overview
Injection attacks rely on the fact that interpreters take data and execute it as commands. If an attacker can modify the data that's sent to an interpreter, they may be able to make it misbehave. One way to help prevent this from happening is to encode the attacker's data in such a way that the interpreter will not get confused. HTML entity encoding is just such an encoding mechanism for many interpreters.

This is not a guarantee by the way. It's almost certain that someone, probably from the XML/Web Services world, will create an engine that performs HTML entity decoding automatically, thus reintroducing the injection threat. However, for the time being, HTML entity encoding seems to work pretty well to prevent many types of injection.

Approach
We're going to implement a simple little method that encodes special characters. The nice .NET folks over at Microsoft had the foresight to build this into their platform, but the Java community seems to resist adding validation to the Java EE environment despite all the security issues that it could solve. View layers such as Java Server Faces, Spring-MVC, WebWork and others automatically perform HTML encoding through custom tags that is often incomplete.

For example, Spring provides both HTML and JavaScript encoding functionality (spring:message htmlEscape and htmlEscape) that can be set at the form element level.  HTML escape functionality in Spring can also be set at the page or servlet container. Note that's Spring's default entity encoder only encodes the "big 5" and does not handle double-encoding. This code that handles this functionality was last updated in 2003.

21         {"#39", new Integer(39)}, // ' - apostrophe 22         {"quot", new Integer(34)}, // " - double-quote   23          {"amp", new Integer(38)}, // & - ampersand   24          {"lt", new Integer(60)}, // < - less-than   25          {"gt", new Integer(62)}, // > - greater-than

Encoding the "big 5" serves exactly the purpose it was designed for: prevents injecting HTML markup with ilegal characters inside tags and attribute values. However it does not prevent more elaborate injections, does not help with "out of range characters" when outputting to single byte encodings, nor prevents character reinterpretation when user switches browser encoding over displayed page.

The best place for a more complete method of HTML entity encoding is in some kind of ValidationEngine, but since it's a good candidate for being static, it doesn't matter what class it ends up in that much.

Note that this implementation doesn't produce the special characters like & lt; or & gt; - but it's not difficult to implement with a simple lookup table.

public static String HTMLEntityEncode( String s ) {       StringBuffer buf = new StringBuffer; int len = (s == null ? -1 : s.length); for ( int i = 0; i < len; i++ ) {           char c = s.charAt( i ); if ( c>='a' && c<='z' || c>='A' && c<='Z' || c>='0' && c<='9' ) {               buf.append( c ); }           else {               buf.append( "&#" + (int)c + ";" ); }       }        return buf.toString; }

Libraries

 * The Jakara Commons Lang package has a generic class for performing a wide range of String escaping functions.
 * jTidy includes an HTMLEncode class for performing HTML encoding.