r.va.gg

Data URI + SVG

Data URIs are great when you want to serve small resources that there's no point serving up in a combined sprite. Consider microjs.com which serves up an HTML file plus a single JavaScript file containing the latest data used to build the site. The build logic is in an embedded script, the CSS is also embedded, so it's pretty lean considering what you see and the amount of data displayed. But, notice the 3 icons for each project, 2 GitHub icons and a Twitter icon. They are PNG images, combined as a sprite but to avoid an additional HTTP request to fetch them they are simply embedded in the CSS which is embedded on the page:

.title .stat span {
  background-image: url("data:image/png;base64,iVBORw0KGgoAAAANSUhE...
}

Easy and quick and fairly well supported across browsers.

But Data URIs can do so much more, including embed SVG!

url("data:image/svg+xml,<svg viewBox='0 0 40 40' height='25' width='25'
xmlns='http://www.w3.org/2000/svg'><path fill='rgb(91, 183, 91)' d='M2.379,
14.729L5.208,11.899L12.958,19.648L25.877,6.733L28.707,9.561L12.958,25.308Z'
/></svg>")

The above will produce a 25px square image but the SVG is drawn in a 40x40 coordinate box, because I'm using a Raphaël Icon paths (you can try it yourself by replacing the d='' content with the path data you get when you click on any of the icons on the Raphaël Icons page.)

SVG of course gives you perfectly scalable graphics, embedding in a Data URI in your CSS lets you use them in the same way that you use other CSS images, minus the need to fetch them via an additional HTTP request.

What's the catch?

It's the web, of course there's a catch, and of course it involves Internet Explorer!

For a start you don't get SVG support in IE8 and below, which is a bit of a problem right now because IE8 is still very much with us due to the fact that IE9 isn't available for Windows XP users. But there's more than that. IE adheres to the spec more strongly than other browsers in that there are 2 types of encoding for Data URIs, base64 and non-base64. If you leave the ;base64 off your string then most browsers let you get away with anything that doesn't conflict with standard CSS, so basically don't use ", or if you do, escape them with simple ". What the Data URI spec says is:

...the data (as a sequence of octets) is represented using ASCII encoding for octets inside the range of safe URL characters and using the standard %xx hex encoding of URLs for octets outside that range.
And IE doesn't let you have it any other way. So you either encode your SVG into Base64 or escape it with %xx's, which kind of loses some of the elegance of SVG in CSS. But at least you'll get IE9+ support.

So here's some examples to fiddle with. Click through to the CSS tab to see the gory details. The first icon is Base64 encoded, the second icon is URL escaped (%xx), the rest are just plain SVG, so you'll get different results viewing in IE9 vs the rest.

SVG in Data URIs is an elegant solution (and a bit of fun) but only really useful at the moment if you don't need to support IE8 and below.

Update 17th Sept 2012

Below in the comments, Ben reports on his (much more rigorous) research into browser support; refer to that if you're serious about using SVG in Data URIs. An interesting result of his work comes from the issue he filed with Chromium (I don't know if this is a generic WebKit thing or not but you could easily test if you're interested). It turns out that Chromium/WebKit requires Base64 Data URIs to be multiples of 4 characters, so you just need to pad with ==.

A mod_geoip2 that properly handles X-Forwarded-For

This is just a short follow-up to my original post on Wrangling the X-Forwarded-For Header where I promised that one of the things I would follow up with was how to get MaxMind's mod_geoip2 to handle the X-Forwarded-For according to the rule:

Always use the leftmost non-private address.

Well, since it's turning out to be such a popular post I thought I'd better get it done to help anyone else out that's searching around for solutions. So, I've put up the code on my GitHub account here:

https://github.com/rvagg/mod_geoip2_xff

I'm maintaining a maxmind branch that contains the original code from MaxMind and the master contains my changes, so you can see a nice diff of what I've done.

I have to warn that I haven't done any serious C programming for more than 15 years or so, my code probably isn't fantastic, and I'm open to outside contributions from anyone with suggestions. The approach I've taken is to embed the regexes of my previous post into the module and walk through the IP addresses looking for a non-private match.

Since my initial release, based on MaxMind's 1.2.5, they've put out a 1.2.7 which includes the addition of a GeoIPUseLastXForwardedForIP flag. I can imagine what prompted this addition but as I said in my previous post this isn't the way to get the best IP address. As of writing, my current master branch is based on 1.2.7 and has this new flag but because the first_public_ip_in_list is done first it's mostly useless.

If anyone wants to hassle MaxMind on my behalf then feel free, I sent them an email a couple of months ago about this but received no answer.

Update 6-July-2012: A new release with some changes, details here.

JavaScript and Semicolons

In syntax terms, JavaScript is in the broad C-family of languages. The C-family is diverse and includes languages such as C (obviously), C++, Objective-C, Perl, Java, C# and the newer Go from Google and Rust from Mozilla. Common themes in these languages include:

  • The use of curly braces to surround blocks.
  • The general insignificance of white space (spaces, tabs, new lines) except in very limited cases. Indentation is optional and is therefore a matter of style and preference, plus programs can be written on as few or as many lines as you want.
  • The use of semicolons to end statements, expressions and other constructs. Semicolons become the delimiter that the new line character is in white-space-significant languages.
JavaScript’s rules for curly braces, white space and semicolons are consistent with the C-family and its formal specification, known as the ECMAScript Language Specification makes this clear:
Certain ECMAScript statements (empty statement, variable statement, expression statement, do-while statement, continue statement, break statement, return statement, and throw statement) must be terminated with semicolons.
But it doesn’t end there–JavaScript introduces what’s known as Automatic Semicolon Insertion (ASI). The specification continues:
Such semicolons may always appear explicitly in the source text. For convenience, however, such semicolons may be omitted from the source text in certain situations. These situations are described by saying that semicolons are automatically inserted into the source code token stream in those situations.
The general C-family rules for semicolons can be found in most teaching material for JavaScript and has been advocated by most of the prominent JavaScript personalities since 1995. In a recent post, JavaScript’s inventor, Brendan Eich, described ASI as “a syntactic error correction procedure”, (as in “parsing error”, rather than “user error”).

The rest of this article about semicolons in JavaScript can be found on DailyJS.

Minifying HTML in the Servlet container

Google's mod_pagespeed is great. I've been using it for a while now on feedxl.com but the only filter that I actually find really useful is Collapse Whitespace; the rest of the filters I either already do myself as part of the site build process or I don't want applied. But, I imagine that there are a lot of admins out there that would really benefit from all of the clever things it can do.

Unfortunately it's just an Apache2 module so it's a bit difficult to use the cleverness elsewhere. I recently launched a new service that serves content directly from Apache Tomcat without passing through an Apache2 web server like I would normally do (because there was just no need!). Having got used to the nice whitespace optimisations you can get from mod_pagespeed I decided to implement a simple version of my own for Tomcat. Dynamic content is somewhere that you're better off trying not to optimise your whitespace during generation, leave it for post-processing so your logic can be clear.

So, enter HTMLMinifyFilter. It's nowhere near as clever as mod_pagespeed but it'll do for basic needs. The core of it is a regular expression that will remove certain patterns and it's configurable so you decide which patterns to include.

package au.com.xprime.misc.webapp.filter;

import java.io.*;
import java.util.regex.*;
import javax.servlet.*;

public class HTMLMinifyFilter implements Filter {
	private Pattern regex = null;

	public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain) throws IOException, ServletException {
		HttpServletResponse response = (HttpServletResponse) res;
		ResponseWrapper wrapper = new ResponseWrapper(response);
		chain.doFilter(req, wrapper);
		String html = wrapper.toString();
		if (regex != null && response.getContentType() != null && response.getContentType().startsWith("text/html"))
			html = regex.matcher(html).replaceAll("");
		response.setContentLength(html.getBytes().length);
		PrintWriter out = response.getWriter();
		out.write(html);
		out.close();
	}

	public void destroy() {
	}

	public void init(FilterConfig config) throws ServletException {
		StringBuffer pattern = new StringBuffer();
		appendIf(config, "strip-linestart-whitespace", pattern, "(?<=^)[ \\t]+");
		appendIf(config, "strip-lineend-whitespace", pattern, "[ \\t]+(?:$)");
		appendIf(config, "strip-multiple-whitespace", pattern, "([ \\t](?:[ \\t]))+");
		appendIf(config, "strip-blank-lines", pattern, "(\\n[ \\t]*(?:\\n))+");
		if (pattern.length() != 0)
			regex = Pattern.compile(pattern.toString(), Pattern.MULTILINE);
	}

	private void appendIf(FilterConfig config, String configKey, StringBuffer pattern, String s) {
		if (config.getInitParameter(configKey) != null && config.getInitParameter(configKey).equals("true")) {
			if (pattern.length() != 0)
				pattern.append('|');
			pattern.append(s);
		}
	}

	static class ResponseWrapper extends HttpServletResponseWrapper {
		private CharArrayWriter output;

		public ResponseWrapper(HttpServletResponse response) {
			super(response);
			this.output = new CharArrayWriter();
		}

		public String toString() {
			return output.toString();
		}

		public PrintWriter getWriter() {
			return new PrintWriter(output);
		}
	}
}

How does it work?

We start off by wrapping our response in an object that will supply a CharArrayWriter so we can capture and process whatever the rest of the stack is doing (credit for this idea goes here). We can then process the output with our regular expression(s) and pass it to the real response.

Before I explain what the regular expressions do I want to caution that this won't be satisfactory in certain situations. It's not aware of <script>, <pre> or any other content where whitespace may be important, so unless you're sure stripping whitespace doesn't matter you may want to find a more intelligent solution.

I've split the regex up into 4 optional parts, you turn them on with init-parameters (explained later), matches of each of these are replaced with an empty string:

strip-linestart-whitespace - (?<=^)[ \t]+

This regex will match whitespace at the beginning of any line. You'll notice that I'm not using \s for my whitespace match, this is because with multi-line pattern matching it'll also match \n and \r which we want to handle separately. The (?<=^) at the beginning is a non-capturing positive look-behind for line-start; so it'll match the start of the line but won't include it in our returned match-group so we only strip out the whitespace.

This option is likely to make the biggest impact on HTML minification on dynamic content because we love to use indentation to define structure.

strip-lineend-whitespace - [ \t]+(?:$)

Same deal as the linestart regex but this time we have (?:$), a non-capturing positive look-ahead for line end.

This will pick up any sloppyness in your HTML (I wish I could do this in Microsoft Word when I have to edit other people's documents, you can't see it, but it's still there!).

strip-multiple-whitespace - ([ \t](?:[ \t]))+

Here we have a group of one or more whitespace characters followed by another whitespace character, non-captured, so we don't strip out all whitespace, remember that we are replacing matches with an empty string so we need the non-capturing second space to leave one intact.

This is probably going to be the most dangerous if you might have content where whitespace is important, e.g. <script>, <pre>.l

strip-blank-lines - (\n[ \t]*(?:\n))+

This is very similar to the multiple-whitespace regex but we match a newline, followed by zero or more whitespace characters, followed by a non-captured newline, all repeated one or more times. So we'll get rid of any lines that don't contain content.

Configuration

You simply put the filter into your classpath somewhere and wire it up in web.xml. You first define the filter reference and any parameters:

<filter>
	<filter-name>htmlMinifyFilter</filter-name>
	<filter-class>au.com.xprime.misc.webapp.filter.HTMLMinifyFilter</filter-class>
	<init-param>
		<param-name>strip-linestart-whitespace</param-name>
		<param-value>true</param-value>
	</init-param>
	<init-param>
		<param-name>strip-lineend-whitespace</param-name>
		<param-value>true</param-value>
	</init-param>
	<init-param>
		<param-name>strip-multiple-whitespace</param-name>
		<param-value>true</param-value>
	</init-param>
	<init-param>
		<param-name>strip-blank-lines</param-name>
		<param-value>true</param-value>
	</init-param>
</filter>

Any of the parameters can be set to false or omitted all together to turn it off.

Then you need to wire up the filter to any incoming URIs which is done just like servlet-mapping (but still hopelessly unhelpful, why can't we have proper regular expressions for these??). You'll notice that I'm only using a Writer so even though it checks for a text/html response before it does any rewriting you won't want it touching any binary data because we don't wrap getOutputStream(). So, either make sure the filter only gets applied to text/html URIs or modify the filter to be binary-safe. I only have a few URIs that I want to apply this to so I've put them in manually with one of these per URI:

<filter-mapping>
	<filter-name>htmlMinifyFilter</filter-name>
	<url-pattern>/myuri</url-pattern>
</filter-mapping>

But you can also do the simple url-pattern matching with .ext or /, etc.

And there you go! Cheap and easy HTML minification from within the Servlet container.

Handling X-Forwarded-For in Java and Tomcat

This is the first follow-up to my post on X-Forwarded-For, I'll assume you've at least scanned that article.

Revision of the security issues

It's important to recap the security message of my previous post. Don't assume that the content of the X-Forwarded-For header is either correct or syntactically valid. The header is not hard to spoof and there are only certain situations where you may be able to trust parts of the content of the header.

So, my simple advice is not to use this header for anything important. Don't use it for authentication purposes or anything else that has security implications. It really should only be used for your own information purposes or to provide customised content for the user where it's OK to be basing that customisation on false information, because this will be a possibility.

We use it on FeedXL for IP address geolocation using GeoIP to serve country specific information to visitors. Ultimately it doesn't really matter a whole lot if we get it wrong; while there are differences in the content the differences aren't major. It may cause some confusion but that confusion can be resolved if the customer wants to contact us. You sign up to FeedXL based on your country but we still let you select your country from a list even though we pre-select the one we guess from your IP address. And if you sign up to the wrong country then you won't get access to the correct database for your country; hardly a major security issue, more of an inconvenience. If you're spoofing X-Forwarded-For then you're probably not the kind of person who's going to get confused at the content, you're probably just poking around and are not really interested in our product anyway!

Extracting a useful IP address

I ended my last post with a generalised rule for extracting the most likely useful IP address from the X-Forwarded-For header:

Always use the leftmost non-private address.
And I gave a couple of regular expressions to help with this process: ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) or (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) to match an IP address. And (^127\.0\.0\.1)|(^10\.)|(^172\.1[6-9]\.)|(^172\.2[0-9]\.)|(^172\.3[0-1]\.)|(^192\.168\.). To match a private IP address. ### Java use cases In my Java code I have 2 uses for the IP address from X-Forwarded-For, both of these come up because we're working behind a load balancer (Amazon's Elastic Load Balancing) and don't have direct access to the remote host information:
  • Looking up the country information in the GeoIP database using their Java API. Most of our use of GeoIP is with mod_geoip in Apache but we also want to occasionally use it from within a servlet. For example, on our sign-up page we pre-select the country at the top of the page based on your IP address, this is done within Java.
  • More interesting logging from Tomcat: if I want to have AccessLogValve turned on, the host information isn't very interesting behind a load balancer.
A generic parser would serve both of these purposes! ### Parsing X-Forwarded-For I have created a simple utility class to do the parsing, called from wherever I need either an IP address or a hostname.
package au.com.xprime.webapp.util;

import java.net.Inet4Address;
import java.net.InetAddress;
import java.net.UnknownHostException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import javax.servlet.http.HttpServletRequest;

public class InetAddressUtil {
	private static final String IP_ADDRESS_REGEX = "([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3})";
	private static final String PRIVATE_IP_ADDRESS_REGEX = "(^127\\.0\\.0\\.1)|(^10\\.)|(^172\\.1[6-9]\\.)|(^172\\.2[0-9]\\.)|(^172\\.3[0-1]\\.)|(^192\\.168\\.)";
	private static Pattern IP_ADDRESS_PATTERN = null;
	private static Pattern PRIVATE_IP_ADDRESS_PATTERN = null;

	private static String findNonPrivateIpAddress(String s) {
		if (IP_ADDRESS_PATTERN == null) {
			IP_ADDRESS_PATTERN = Pattern.compile(IP_ADDRESS_REGEX);
			PRIVATE_IP_ADDRESS_PATTERN = Pattern.compile(PRIVATE_IP_ADDRESS_REGEX);
		}
		Matcher matcher = IP_ADDRESS_PATTERN.matcher(s);
		while (matcher.find()) {
			if (!PRIVATE_IP_ADDRESS_PATTERN.matcher(matcher.group(0)).find())
				return matcher.group(0);
			matcher.region(matcher.end(), s.length());
		}
		return null;
	}

	public static String getAddressFromRequest(HttpServletRequest request) {
		String forwardedFor = request.getHeader("X-Forwarded-For");
		if (forwardedFor != null && (forwardedFor = findNonPrivateIpAddress(forwardedFor)) != null)
			return forwardedFor;
		return request.getRemoteAddr();
	}

	public static String getHostnameFromRequest(HttpServletRequest request) {
		String addr = getAddressFromRequest(request);
		try {
			return Inet4Address.getByName(addr).getHostName();
		} catch (Exception e) {
		}
		return addr;
	}

	public static InetAddress getInet4AddressFromRequest(HttpServletRequest request) throws UnknownHostException {
		return Inet4Address.getByName(getAddressFromRequest(request));
	}
}

(Download here)

Given an HttpServletRequest we can call either getAddressFromRequest() or getHostnameFromRequest() to get the data we need.

We first use the general IP address regular expression and on line 23 we loop through each match we find, starting from the left of the beginning of the string. This way we don't even look at the commas in the string and don't care if there are any spaces or not. We also get to avoid any nonsense data that may be in the string. If you spoof the header with a random string of characters then it'll be ignored. The code is quite strict in that it'll only bother with non-private IP addresses in the header, otherwise it will resort to the remote address of the request as a fall-back.

Our hostname resolution is also prepared for failure and will return the original IP address if it can't get you a hostname.

Instead of just calling request.getRemoteAddr() and request.getRemoteHost() from our own code, you'd simply wrap them in InetAddressUtil.getAddressFromRequest(request) and InetAddressUtil.getHostnameFromRequest(request).

Extending Tomcat logging

You enable request logging in Tomcat by attaching an AccessLogValve to your context or host. It mirrors the custom formatting options that you'll find in Apache's CustomLog. So, you can print out a %h for the request hostname but behind a load balancer you'll just get the name or address of the load balancer that's forwarding the request. You could also just use %{X-Forwarded-For}i to get access to the raw header value, but this will either just be an IP address or a comma separated string of IP addresses. This may be useful for your purposes but not mine, I want a hostname!

Unfortunately, AccessLogValve doesn't lend itself to easy extension, there are two createAccessLogElement() methods that you'd ideally be able to overwrite in your own subclass and return a new custom AccessLogElement for the character you've chosen to represent your log element.

The best we can do is overwrite the protected createLogElements and copy the functionality from there and extend with our own. However, in my extension of AccessLogValve I've assumed that the Tomcat boys will eventually fix the access modifiers for the createLogElement() methods so I've just copied the whole class, named it AccessLogValve_ and changed the modifiers myself. The plan being to remove this in the future and take the _ of the extended class name in my code.

Here's my extended AccessLogValve

package au.com.xprime.catalina.valves;

import java.util.Date;
import org.apache.catalina.connector.Request;
import org.apache.catalina.connector.Response;
import au.com.xprime.webapp.util.InetAddressUtil;

public class AccessLogValve extends org.apache.catalina.valves.AccessLogValve_ {
	protected class ForwardedForAddrElement implements AccessLogElement {
		public void addElement(StringBuffer buf, Date date, Request request, Response response, long time) {
			buf.append(InetAddressUtil.getAddressFromRequest(request));
		}
	}
	protected class ForwardedForHostElement extends ForwardedForAddrElement {
		public void addElement(StringBuffer buf, Date date, Request request, Response response, long time) {
			buf.append(InetAddressUtil.getHostnameFromRequest(request));
		}
	}

	protected AccessLogElement createAccessLogElement(char pattern) {
		AccessLogElement accessLogElement = super.createAccessLogElement(pattern);
		if (accessLogElement instanceof StringElement) {
			switch (pattern) {
				case 'f' :
					return new ForwardedForAddrElement();
				case 'F' :
					return new ForwardedForHostElement();
			}
		}
		return accessLogElement;
	}
}

(Download here and AccessLogValve_ here)

Which gives me %f for the X-Forwarded-For IP address and %F for the X-Forwarded-For address. My valve pattern looks like this:

pattern="%F %f %h %l %u %t %r" %s %b "%{Referer}i" "%{User-Agent}i""

Simply compile, place together in a JAR, put it in your Tomcat lib directory then make sure you use the right class name when building your AccessLogValve descriptor. The lazy can find a JAR (including source) here.

Next I'll be getting dirty with C and hack mod_geoip to do something similar.