r.va.gg

A mod_geoip2 that properly handles X-Forwarded-For

This is just a short follow-up to my original post on Wrangling the X-Forwarded-For Header where I promised that one of the things I would follow up with was how to get MaxMind's mod_geoip2 to handle the X-Forwarded-For according to the rule:

Always use the leftmost non-private address.

Well, since it's turning out to be such a popular post I thought I'd better get it done to help anyone else out that's searching around for solutions. So, I've put up the code on my GitHub account here:

https://github.com/rvagg/mod_geoip2_xff

I'm maintaining a maxmind branch that contains the original code from MaxMind and the master contains my changes, so you can see a nice diff of what I've done.

I have to warn that I haven't done any serious C programming for more than 15 years or so, my code probably isn't fantastic, and I'm open to outside contributions from anyone with suggestions. The approach I've taken is to embed the regexes of my previous post into the module and walk through the IP addresses looking for a non-private match.

Since my initial release, based on MaxMind's 1.2.5, they've put out a 1.2.7 which includes the addition of a GeoIPUseLastXForwardedForIP flag. I can imagine what prompted this addition but as I said in my previous post this isn't the way to get the best IP address. As of writing, my current master branch is based on 1.2.7 and has this new flag but because the first_public_ip_in_list is done first it's mostly useless.

If anyone wants to hassle MaxMind on my behalf then feel free, I sent them an email a couple of months ago about this but received no answer.

Update 6-July-2012: A new release with some changes, details here.

JavaScript and Semicolons

In syntax terms, JavaScript is in the broad C-family of languages. The C-family is diverse and includes languages such as C (obviously), C++, Objective-C, Perl, Java, C# and the newer Go from Google and Rust from Mozilla. Common themes in these languages include:

  • The use of curly braces to surround blocks.
  • The general insignificance of white space (spaces, tabs, new lines) except in very limited cases. Indentation is optional and is therefore a matter of style and preference, plus programs can be written on as few or as many lines as you want.
  • The use of semicolons to end statements, expressions and other constructs. Semicolons become the delimiter that the new line character is in white-space-significant languages.
JavaScript’s rules for curly braces, white space and semicolons are consistent with the C-family and its formal specification, known as the ECMAScript Language Specification makes this clear:

Certain ECMAScript statements (empty statement, variable statement, expression statement, do-while statement, continue statement, break statement, return statement, and throw statement) must be terminated with semicolons.
But it doesn’t end there–JavaScript introduces what’s known as Automatic Semicolon Insertion (ASI). The specification continues:

Such semicolons may always appear explicitly in the source text. For convenience, however, such semicolons may be omitted from the source text in certain situations. These situations are described by saying that semicolons are automatically inserted into the source code token stream in those situations.
The general C-family rules for semicolons can be found in most teaching material for JavaScript and has been advocated by most of the prominent JavaScript personalities since 1995. In a recent post, JavaScript’s inventor, Brendan Eich, described ASI as “a syntactic error correction procedure”, (as in “parsing error”, rather than “user error”).

The rest of this article about semicolons in JavaScript can be found on DailyJS.

Minifying HTML in the Servlet container

Google's mod_pagespeed is great. I've been using it for a while now on feedxl.com but the only filter that I actually find really useful is Collapse Whitespace; the rest of the filters I either already do myself as part of the site build process or I don't want applied. But, I imagine that there are a lot of admins out there that would really benefit from all of the clever things it can do.

Unfortunately it's just an Apache2 module so it's a bit difficult to use the cleverness elsewhere. I recently launched a new service that serves content directly from Apache Tomcat without passing through an Apache2 web server like I would normally do (because there was just no need!). Having got used to the nice whitespace optimisations you can get from mod_pagespeed I decided to implement a simple version of my own for Tomcat. Dynamic content is somewhere that you're better off trying not to optimise your whitespace during generation, leave it for post-processing so your logic can be clear.

So, enter HTMLMinifyFilter. It's nowhere near as clever as mod_pagespeed but it'll do for basic needs. The core of it is a regular expression that will remove certain patterns and it's configurable so you decide which patterns to include.

package au.com.xprime.misc.webapp.filter;

import java.io.*;
import java.util.regex.*;
import javax.servlet.*;

public class HTMLMinifyFilter implements Filter {
    private Pattern regex = null;

    public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain) throws IOException, ServletException {
        HttpServletResponse response = (HttpServletResponse) res;
        ResponseWrapper wrapper = new ResponseWrapper(response);
        chain.doFilter(req, wrapper);
        String html = wrapper.toString();
        if (regex != null && response.getContentType() != null && response.getContentType().startsWith("text/html"))
            html = regex.matcher(html).replaceAll("");
        response.setContentLength(html.getBytes().length);
        PrintWriter out = response.getWriter();
        out.write(html);
        out.close();
    }

    public void destroy() {
    }

    public void init(FilterConfig config) throws ServletException {
        StringBuffer pattern = new StringBuffer();
        appendIf(config, "strip-linestart-whitespace", pattern, "(?<=^)[ \\t]+");
        appendIf(config, "strip-lineend-whitespace", pattern, "[ \\t]+(?:$)");
        appendIf(config, "strip-multiple-whitespace", pattern, "([ \\t](?:[ \\t]))+");
        appendIf(config, "strip-blank-lines", pattern, "(\\n[ \\t]*(?:\\n))+");
        if (pattern.length() != 0)
            regex = Pattern.compile(pattern.toString(), Pattern.MULTILINE);
    }

    private void appendIf(FilterConfig config, String configKey, StringBuffer pattern, String s) {
        if (config.getInitParameter(configKey) != null && config.getInitParameter(configKey).equals("true")) {
            if (pattern.length() != 0)
                pattern.append('|');
            pattern.append(s);
        }
    }

    static class ResponseWrapper extends HttpServletResponseWrapper {
        private CharArrayWriter output;

        public ResponseWrapper(HttpServletResponse response) {
            super(response);
            this.output = new CharArrayWriter();
        }

        public String toString() {
            return output.toString();
        }

        public PrintWriter getWriter() {
            return new PrintWriter(output);
        }
    }
}

How does it work?

We start off by wrapping our response in an object that will supply a CharArrayWriter so we can capture and process whatever the rest of the stack is doing (credit for this idea goes here). We can then process the output with our regular expression(s) and pass it to the real response.

Before I explain what the regular expressions do I want to caution that this won't be satisfactory in certain situations. It's not aware of <script>, <pre> or any other content where whitespace may be important, so unless you're sure stripping whitespace doesn't matter you may want to find a more intelligent solution.

I've split the regex up into 4 optional parts, you turn them on with init-parameters (explained later), matches of each of these are replaced with an empty string:

strip-linestart-whitespace - (?<=^)[ \t]+

This regex will match whitespace at the beginning of any line. You'll notice that I'm not using \s for my whitespace match, this is because with multi-line pattern matching it'll also match \n and \r which we want to handle separately. The (?<=^) at the beginning is a non-capturing positive look-behind for line-start; so it'll match the start of the line but won't include it in our returned match-group so we only strip out the whitespace.

This option is likely to make the biggest impact on HTML minification on dynamic content because we love to use indentation to define structure.

strip-lineend-whitespace - [ \t]+(?:$)

Same deal as the linestart regex but this time we have (?:$), a non-capturing positive look-ahead for line end.

This will pick up any sloppyness in your HTML (I wish I could do this in Microsoft Word when I have to edit other people's documents, you can't see it, but it's still there!).

strip-multiple-whitespace - ( \t)+

Here we have a group of one or more whitespace characters followed by another whitespace character, non-captured, so we don't strip out all whitespace, remember that we are replacing matches with an empty string so we need the non-capturing second space to leave one intact.

This is probably going to be the most dangerous if you might have content where whitespace is important, e.g. <script>, <pre>.l

strip-blank-lines - (\n[ \t]*(?:\n))+

This is very similar to the multiple-whitespace regex but we match a newline, followed by zero or more whitespace characters, followed by a non-captured newline, all repeated one or more times. So we'll get rid of any lines that don't contain content.

Configuration

You simply put the filter into your classpath somewhere and wire it up in web.xml. You first define the filter reference and any parameters:

<filter>
    <filter-name>htmlMinifyFilter</filter-name>
    <filter-class>au.com.xprime.misc.webapp.filter.HTMLMinifyFilter</filter-class>
    <init-param>
        <param-name>strip-linestart-whitespace</param-name>
        <param-value>true</param-value>
    </init-param>
    <init-param>
        <param-name>strip-lineend-whitespace</param-name>
        <param-value>true</param-value>
    </init-param>
    <init-param>
        <param-name>strip-multiple-whitespace</param-name>
        <param-value>true</param-value>
    </init-param>
    <init-param>
        <param-name>strip-blank-lines</param-name>
        <param-value>true</param-value>
    </init-param>
</filter>

Any of the parameters can be set to false or omitted all together to turn it off.

Then you need to wire up the filter to any incoming URIs which is done just like servlet-mapping (but still hopelessly unhelpful, why can't we have proper regular expressions for these??). You'll notice that I'm only using a Writer so even though it checks for a text/html response before it does any rewriting you won't want it touching any binary data because we don't wrap getOutputStream(). So, either make sure the filter only gets applied to text/html URIs or modify the filter to be binary-safe. I only have a few URIs that I want to apply this to so I've put them in manually with one of these per URI:

<filter-mapping>
    <filter-name>htmlMinifyFilter</filter-name>
    <url-pattern>/myuri</url-pattern>
</filter-mapping>

But you can also do the simple url-pattern matching with .ext or /, etc.

And there you go! Cheap and easy HTML minification from within the Servlet container.

Handling X-Forwarded-For in Java and Tomcat

This is the first follow-up to my post on X-Forwarded-For, I'll assume you've at least scanned that article.

Revision of the security issues

It's important to recap the security message of my previous post. Don't assume that the content of the X-Forwarded-For header is either correct or syntactically valid. The header is not hard to spoof and there are only certain situations where you may be able to trust parts of the content of the header.

So, my simple advice is not to use this header for anything important. Don't use it for authentication purposes or anything else that has security implications. It really should only be used for your own information purposes or to provide customised content for the user where it's OK to be basing that customisation on false information, because this will be a possibility.

We use it on FeedXL for IP address geolocation using GeoIP to serve country specific information to visitors. Ultimately it doesn't really matter a whole lot if we get it wrong; while there are differences in the content the differences aren't major. It may cause some confusion but that confusion can be resolved if the customer wants to contact us. You sign up to FeedXL based on your country but we still let you select your country from a list even though we pre-select the one we guess from your IP address. And if you sign up to the wrong country then you won't get access to the correct database for your country; hardly a major security issue, more of an inconvenience. If you're spoofing X-Forwarded-For then you're probably not the kind of person who's going to get confused at the content, you're probably just poking around and are not really interested in our product anyway!

Extracting a useful IP address

I ended my last post with a generalised rule for extracting the most likely useful IP address from the X-Forwarded-For header:

Always use the leftmost non-private address.
And I gave a couple of regular expressions to help with this process: ([0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}) or (\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}) to match an IP address. And (^127.0.0.1)|(^10.)|(^172.1[6-9].)|(^172.2[0-9].)|(^172.3[0-1].)|(^192.168.). To match a private IP address.

Java use cases

In my Java code I have 2 uses for the IP address from X-Forwarded-For, both of these come up because we're working behind a load balancer (Amazon's Elastic Load Balancing) and don't have direct access to the remote host information:

  • Looking up the country information in the GeoIP database using their Java API. Most of our use of GeoIP is with mod_geoip in Apache but we also want to occasionally use it from within a servlet. For example, on our sign-up page we pre-select the country at the top of the page based on your IP address, this is done within Java.
  • More interesting logging from Tomcat: if I want to have AccessLogValve turned on, the host information isn't very interesting behind a load balancer.
A generic parser would serve both of these purposes!

Parsing X-Forwarded-For

I have created a simple utility class to do the parsing, called from wherever I need either an IP address or a hostname.

package au.com.xprime.webapp.util;

import java.net.Inet4Address;
import java.net.InetAddress;
import java.net.UnknownHostException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import javax.servlet.http.HttpServletRequest;

public class InetAddressUtil {
    private static final String IP_ADDRESS_REGEX = "([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3})";
    private static final String PRIVATE_IP_ADDRESS_REGEX = "(^127\\.0\\.0\\.1)|(^10\\.)|(^172\\.1[6-9]\\.)|(^172\\.2[0-9]\\.)|(^172\\.3[0-1]\\.)|(^192\\.168\\.)";
    private static Pattern IP_ADDRESS_PATTERN = null;
    private static Pattern PRIVATE_IP_ADDRESS_PATTERN = null;

    private static String findNonPrivateIpAddress(String s) {
        if (IP_ADDRESS_PATTERN == null) {
            IP_ADDRESS_PATTERN = Pattern.compile(IP_ADDRESS_REGEX);
            PRIVATE_IP_ADDRESS_PATTERN = Pattern.compile(PRIVATE_IP_ADDRESS_REGEX);
        }
        Matcher matcher = IP_ADDRESS_PATTERN.matcher(s);
        while (matcher.find()) {
            if (!PRIVATE_IP_ADDRESS_PATTERN.matcher(matcher.group(0)).find())
                return matcher.group(0);
            matcher.region(matcher.end(), s.length());
        }
        return null;
    }

    public static String getAddressFromRequest(HttpServletRequest request) {
        String forwardedFor = request.getHeader("X-Forwarded-For");
        if (forwardedFor != null && (forwardedFor = findNonPrivateIpAddress(forwardedFor)) != null)
            return forwardedFor;
        return request.getRemoteAddr();
    }

    public static String getHostnameFromRequest(HttpServletRequest request) {
        String addr = getAddressFromRequest(request);
        try {
            return Inet4Address.getByName(addr).getHostName();
        } catch (Exception e) {
        }
        return addr;
    }

    public static InetAddress getInet4AddressFromRequest(HttpServletRequest request) throws UnknownHostException {
        return Inet4Address.getByName(getAddressFromRequest(request));
    }
}

(Download here)

Given an HttpServletRequest we can call either getAddressFromRequest() or getHostnameFromRequest() to get the data we need.

We first use the general IP address regular expression and on line 23 we loop through each match we find, starting from the left of the beginning of the string. This way we don't even look at the commas in the string and don't care if there are any spaces or not. We also get to avoid any nonsense data that may be in the string. If you spoof the header with a random string of characters then it'll be ignored. The code is quite strict in that it'll only bother with non-private IP addresses in the header, otherwise it will resort to the remote address of the request as a fall-back.

Our hostname resolution is also prepared for failure and will return the original IP address if it can't get you a hostname.

Instead of just calling request.getRemoteAddr() and request.getRemoteHost() from our own code, you'd simply wrap them in InetAddressUtil.getAddressFromRequest(request) and InetAddressUtil.getHostnameFromRequest(request).

Extending Tomcat logging

You enable request logging in Tomcat by attaching an AccessLogValve to your context or host. It mirrors the custom formatting options that you'll find in Apache's CustomLog. So, you can print out a %h for the request hostname but behind a load balancer you'll just get the name or address of the load balancer that's forwarding the request. You could also just use %{X-Forwarded-For}i to get access to the raw header value, but this will either just be an IP address or a comma separated string of IP addresses. This may be useful for your purposes but not mine, I want a hostname!

Unfortunately, AccessLogValve doesn't lend itself to easy extension, there are two createAccessLogElement() methods that you'd ideally be able to overwrite in your own subclass and return a new custom AccessLogElement for the character you've chosen to represent your log element.

The best we can do is overwrite the protected createLogElements and copy the functionality from there and extend with our own. However, in my extension of AccessLogValve I've assumed that the Tomcat boys will eventually fix the access modifiers for the createLogElement() methods so I've just copied the whole class, named it AccessLogValve and changed the modifiers myself. The plan being to remove this in the future and take the of the extended class name in my code.

Here's my extended AccessLogValve

package au.com.xprime.catalina.valves;

import java.util.Date;
import org.apache.catalina.connector.Request;
import org.apache.catalina.connector.Response;
import au.com.xprime.webapp.util.InetAddressUtil;

public class AccessLogValve extends org.apache.catalina.valves.AccessLogValve_ {
    protected class ForwardedForAddrElement implements AccessLogElement {
        public void addElement(StringBuffer buf, Date date, Request request, Response response, long time) {
            buf.append(InetAddressUtil.getAddressFromRequest(request));
        }
    }
    protected class ForwardedForHostElement extends ForwardedForAddrElement {
        public void addElement(StringBuffer buf, Date date, Request request, Response response, long time) {
            buf.append(InetAddressUtil.getHostnameFromRequest(request));
        }
    }

    protected AccessLogElement createAccessLogElement(char pattern) {
        AccessLogElement accessLogElement = super.createAccessLogElement(pattern);
        if (accessLogElement instanceof StringElement) {
            switch (pattern) {
                case 'f' :
                    return new ForwardedForAddrElement();
                case 'F' :
                    return new ForwardedForHostElement();
            }
        }
        return accessLogElement;
    }
}

(Download here and AccessLogValve <a href="http://src.vagg.org/java/AccessLogValve.java">here)

Which gives me %f for the X-Forwarded-For IP address and %F for the X-Forwarded-For address. My valve pattern looks like this:

pattern="%F %f %h %l %u %t %r" %s %b "%{Referer}i" "%{User-Agent}i""

Simply compile, place together in a JAR, put it in your Tomcat lib directory then make sure you use the right class name when building your AccessLogValve descriptor. The lazy can find a JAR (including source) here.

Next I'll be getting dirty with C and hack mod_geoip to do something similar.

Wrangling the X-Forwarded-For Header

Until recently, we've served pages directly from the server for FeedXL.com but we've since moved to a load balancing situation with multiple servers behind a load balancer.

AWS & ELB

We use Amazon Web Services to host FeedXL and are now using their Elastic Load Balancing (ELB) service to spread the load across 3 Availability Zones in the main datacentre we operate from. We're doing this primarily for high availability purposes rather than to handle heavy load but the added benefit is that it lets us scale up really easily if we have any sudden spikes in our traffic. We're using some small instances at the front using Apache to handle the main traffic. The dynamic content is passed on to larger back-end instances running our webapp in Tomcat.

A couple of our important EBS volumes were among the last to be restored during Judgement Day, April 2011 and while we had regular snapshots we hesitated for too long before rebuilding our service in a different Availability Zone (or Region), partly because of lack of clear information about the outage from Amazon (we were continually given the impression that it wouldn't be long before things were back online, so why not wait just a tiny bit longer to restore to normal service than restore from slightly older snapshots?). Probably like many AWS customers impacted by the outage, we've increased our spend to boost our redundancy to better handler outages of this kind. We now span multiple Availability Zones and have increased the quality of our off-Region backups. I'm pretty sure that in the end Amazon has ended up doing very well from their rather embarrassing incident with many customers keen to avoid their own embarrassment the next time it happens.

However, switching to ELB hasn't been without hiccups.

GeoIP

We rely very heavily on GeoIP from MaxMind to serve content customised to each country. We have a large amount of functionality built right in to our Apache configuration that uses both rewrites and SSI to make our static content relatively dynamic. We even do spelling correction for UK/US English depending on where you view our site from! The main reason we customise content though is because FeedXL is a different product for each country. We have to maintain country specific feeds databases and we also mostly deal with local currencies so our price details change a little depending on where you are. We've had a very good experience with GeoIP with only a few mismatches reported by customers and they've always been corporate networks where traffic is routed internationally (Australia->USA or NZ->AU for example) or satellite connections without a likely country of origin.

The way that mod_geoip for Apache works is that it takes the request IP address and looks it up in its database to find the (most likely) country of origin, you then get environment variables in your Apache request: GEOIP_COUNTRY_CODE & GEOIP_COUNTRY_NAME. You can use these with mod_rewrite to do all sorts of crazy things, plus mod_include lets you do more straightforward things with your content. For example, if we want to make a North America specific announcement we might wrap our announcement block in <!--#if expr='"$GEOIP_COUNTRY_CODE" = "US" || "$GEOIP_COUNTRY_CODE" = "CA"' --> ... content ... <!--#endif -->.

However, one of the most important catches of load balancing is that your requests come to your web server from the load balancer itself and not the original client, so you don't get the raw IP address of the client built into your request. Instead, with ELB and most other load balancers you need to use the X-Forwarded-For HTTP header.

X-Forwarded-For

The X-Forwarded-For header was first introduced by Squid as a means of passing on the IP address of the client to the server. It has since been widely adopted by other proxy servers and load balancers so it's pretty much considered a standard even if it technically isn't.

What you are supposed to get as your header is this:

X-Forwarded-For: clientIP, server1IP, server2IP, server3IP

The client IP address should be first, followed by first proxy server, followed by any other servers in a comma separated list. The final server that passes the request on to you won't be in the list, a proxy server or load balancer will only append the address of server it received the request from if the X-Forwarded-For header was passed to it otherwise it just constructs a new X-Forwarded-For with just the client address in it. The address of the last server in the complete chain is simply the address of the client making the request to your server. But as usual in the web world there are no guarantees.

Apache kindly gives you an HTTP_X_FORWARDED_FOR environment variable (although I can't find official documentation on this so I'm not sure of the specifics of what conditions may prevent you from getting this variable). You could use this in custom modules or standard modules that use environment variables such as mod_rewrite. If you want to log with it then you could configure your LogFormat to print it out with %{X-Forwarded-For}i to make your logs more interesting than just showing the load balancer hostname as %h.

mod_geoip has a configuration switch, GeoIPScanProxyHeaders On that tells it to use X-Forwarded-For (or HTTP_X_FORWARDED_FOR) to determine the client IP address rather than just the remote address.

There are some important catches to consider before you proceed to use this header to do anything interesting:

  1. Most importantly, headers can be crafted by anyone, never trust a header value unless you are certain that it can't be spoofed. I'd actually just simplify that to just never trust a header value. So if you are going to use it then don't use it for anything that has security implications.
  2. The client IP address that you get from the first entry may not actually be the address that you want. Most of the time the requests will probably come directly from the browser of your visitor but what if they are behind a proxy server within a private network themselves? The IP address you may end up with could be something like 10.1.34.121 which is of no value because it only tells you that they are sitting on a private network somewhere in the world.

Security Implications

This is pretty straightforward. If you're in the situation of handling traffic behind a load balancer then you may be able to guarantee that your traffic comes from the load balancer so the header is constructed by it, but consider the situation where X-Forwarded-For contains a chain of addresses, potentially from untrusted sources. If the header contains at least one server IP address then the client IP address will have been passed on by the upstream server with no way for your load balancer to verify its correctness; all it's doing is adding the address of the requesting host onto the end of the list.

There's also the possibility of direct connections to your web server(s). Are your servers walled off from the outside world with only the load balancer able to communicate with it? Is there a possibility that a client can make a direct connection to your server and construct its own X-Forwarded-For header? On AWS, all standard instances have a public IP address but you can set up your security groups to only allow access to port 80 from your load balancer. This is probably a good idea for many reasons.

Basically, I would suggest working on the assumption that X-Forwarded-For is only likely to be correct, nothing more.

Best Guess IP Address

When using X-Forwarded-For, the assumption normally made is that the first IP address in the list is the client address that you can use to do interesting things with, like IP address geolocation (à la GeoIP). But what about private addresses? What about the casual browser at McDonalds using their WiFi with a 10.x.x.x address or a company network with a 192.168.x.x internal address structure? You'll end up with a very unhelpful address that'll tell you nothing very interesting about the client.

There are 3 sets of address ranges in IPv4 (lets ignore IPv6 for now) that are reserved for private networks. Normally these are hidden behind NAT gateways and often traffic is forced to either manually or automatically route through a proxy server of some kind. The address ranges are:

  • 10.0.0.0 – 10.255.255.255
  • 172.16.0.0 – 172.31.255.255
  • 192.168.0.0 – 192.168.255.255
You can thank these beauties for extending the life of IPv4 way beyond what it would otherwise have been.

If you have a client behind one of these networks and it's not routed through a proxy server then you'll probably just get the IP address of the NAT gateway which is likely to be the address you want to use. If the request is routed through a proxy server then you may get an X-Forwarded-For that looks something like this:

X-Forwarded-For: 10.208.4.38, 58.163.175.187

Where the address you probably want is actually the (proxy) server address on the end rather than the private client address.

You may also have a chain of multiple servers, perhaps you have a downstream proxy server going through a larger upstream one before heading out of the network, so you may get something like this:

X-Forwarded-For: 10.208.4.38, 58.163.1.4, 58.163.175.187

Or, the downstream proxy server could be within the private network, perhaps a departmental proxy server connecting to a company-wide proxy server and then this may happen:

X-Forwarded-For: 10.208.4.38, 10.10.300.23, 58.163.175.187

This could of course be even more complex as you may have a longer chain of proxy servers (although I've never actually seen anyone chain more than 2 layers of proxy servers together in a network before).

So what general rule should we construct for extracting our usable client IP from these addresses?

Of course, I'm suggesting that the rule: always use the leftmost address is not correct as there is a good chance it may be a private IP address if there is more than 1 address in the list. Unfortunately this is the rule that mod_geoip adopts, if it finds a comma it just chops off the string at that comma. We immediately found this led to unsatisfactory results with ELB as we had more requests than we expected originating from private networks routed through proxy servers; and we heard about it in the form of error reports from our users ("where's the log in link?"--it's not normally displayed in countries where we haven't released FeedXL).

An alternative would be always use the rightmost address which would probably get you a pretty good guess in almost all cases. If there is more than one IP address in the list then the rightmost address will probably be the address where the request left whatever corporate or internal network the client was hidden behind, even if there are multiple layers. However, multiple layers of IP addresses suggests a fairly large network, possibly widely disbursed. There's also a chance that you have one proxy server piggybacking off a higher capacity upstream proxy server: for example, some ISPs run their own very large proxy servers that customers can use and may make ideal upstream connections for internal proxy servers with caching at both levels. The ISP proxy server is likely to be located in a very different place to the client though and if you're trying to pin down the IP address of the client using something like GeoIP City then you'll probably get the wrong city.

So, here's the rule that I suggest would be the best general case rule to allow you to extract the address most likely to be physically close to the real client:

Always use the leftmost non-private address.

We can do this because the rules are clear about what is and what is not a private IP address (see above).

Doing It the Regular Expression Way

First, remember that the X-Forwarded-For header is not very trustworthy. You don't want to even assume that it contains IP addresses! So, before you even check if an entry is a private IP address or not you should probably simply check if it's an IP address.

Here's a simple regular expression to match an IP address: ([0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}) or alternatively, if you're working in an environment that supports \d then this will do the same thing: (\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}) (with or without the parentheses but as you'll see, they are useful for the next step).

Then you'll want to check if an IP address is private or not, here's a regular expression that'll do that for you, given a valid IP address: (^127.0.0.1)|(^10.)|(^172.1[6-9].)|(^172.2[0-9].)|(^172.3[0-1].)|(^192.168.). This matches all of the addresses matched in the ranges above and 127.0.0.1 as a bonus (quite possible in our chain!).

So a general algorithm could be something like this: walk through the string starting from the first match of our general IP address regular expression through to the last. For each match, check if the matched component matches our private IP address regular expression, if it does then proceed to the next address in the list, if it doesn't match then we have the IP address we want. If we get to the end of the list without finding an IP address that isn't private then we have to have some kind of generic fall-back.

Exactly what your fall-back might be depends on your environment and whether your trust the server passing you the request or not. In the case of ELB, if it's working properly we should never need the fall-back case. For FeedXL our fall-back for any failure during the GeoIP process is to just assume that they are coming from the country where most of our customers are from (currently Australia).

I have 2 follow-up posts to make after this one, first I'll show how I deal with X-Forwarded-For in both Tomcat and our own Java software, then I'll show how I've hacked mod_geoip to use the algorithm outlined above with excellent results.

Follow-up #1: Handling X-Forwarded-For in Java and Tomcat

Follow-up #2: A mod_geoip2 that properly handles X-Forwarded-For

Update July 30th 2011

I've just stumbled upon this, an "X-Forwarded-For Spoofer" Add-On for Firefox and I love the description, sums up the security concerns:

Some clients add X-Forwarded-For to HTTP requests in an attempt to help servers identify the originating IP address of a request. Some clients, however, can set X-Forwarded-For to any arbitrary value. Some servers assume X-Forwarded-For is unassailable. No server should.

With this add-on, you can assign an arbitrary IP address to the X-Forwarded-For field, attempt to perform XSS by including HTML in this field, or even attempt SQL injection.

May be useful for testing and debugging your web application.