Monday, October 15, 2007

Blog has been moved. I've gone Typo. It's new home is at http://blog.bcarlso.net.

Wednesday, September 26, 2007

Tomcat and Weak ETags

UPDATE: Transfered this to the new blog
Being known as the HTTP/REST guy in the company, I was pulled into an interesting conversation about CSS/JavaScript caching issues this week and was fortunate enough to learn a couple of things on the way. It seems to be a common (anti)pattern in the Java world to consistently fight JavaScript browser caching issues by simply adding a query parameter to your script tags:

<script type="text/javascript" src="myscript.js?version=<%= Application.VERSION %>"></script>

I've seen (and used) a number of variations on the same theme, including going so far as to create a JSP custom tag for more advanced schemes.

The obvious solution is to use the Last-Modified header with the timestamp of the file. According to the spec, the client can utilize this value to create a conditional GET request by adding the If-Modified-Since header. If we look at all of the major browsers, they dutifully follow this pattern by sending the If-Modified-Since header the next time the resource is requested.

The "workflow" goes something like this:

GET /some-resource.html

HTTP/1.1 200 OK
Last-Modified: Wed, 26 Sep 2007 04:58:08 GMT

<html>
<head><title>Some Resource</title></head>
<body></body>
</html>

The next time the browser asks for the file and the file remains unchanged:

GET /some-resource.html
If-Modified-Since: Wed, 26 Sep 2007 04:58:08 GMT

HTTP/1.1 304 Not Modified
Last-Modified: Wed, 26 Sep 2007 04:58:08 GMT

Note the use of the 304 status code and no message body. This is an indication to the client that it is free to use the cached version of the resource.

What if the file has changed? This is as simple as returning the content with an updated Last-Modified date as seen below:

GET /some-resource.html
If-Modified-Since: Wed, 26 Sep 2007 04:58:08 GMT

HTTP/1.1 200 OK
Last-Modified: Thu, 27 Sep 2007 05:00:00 GMT

<html>
<head><title>Some Updated Resource</title></head>
<body></body>
</html>

Now that the server has returned the updated resource, the client should use update its caches with the latest version and Last-Modified information.

Easy huh? Well you'd think so... This works fine on most browsers, and unfortunately it doesn't work quite as you would expect in IE. Using Fiddler you can track what's actually going on and see that IE ignores the 200 + content returned via the conditional GET and takes the version of the resource from cache anyway!

This behavior in IE is, in my experience, the cause of many of our caching woes. Fortunately, there is a lesser known cousin to Last-Modified that IE supports pretty well. It's a HTTP header known as ETag (Entity tag).

ETags are also used to identify whether a resource has changed, and can be created a number of ways, including taking a hash of the response body or serializing the Last-Modified timestamp.

The same workflow is used for ETag processing, but with a couple of different headers:

GET /some-resource.html

HTTP/1.1 200 OK
ETag: "1234567890"

<html>
<head><title>Some Resource</title></head>
<body></body>
</html>

GET /some-resource.html
If-None-Match: "123456789"

HTTP/1.1 304 Not Modified
ETag: "1234567890"

GET /some-resource.html
If-None-Match: "123456789"

HTTP/1.1 200 OK
ETag: "0987654321"

<html>
<head><title>Some Updated Resource</title></head>
<body></body>
</html>

Notice, same workflow, different headers. The difference in this case is that IE handles the 200 as expected, replacing the cached version with the new content and updating the ETag metadata in the cache for this resource. So to properly handle caching in IE all we have to do is set the ETag for JavaScript files. But how do we do that...

Well, the title of the post mentioned Tomcat, and this is where we actually talk about it. As it turns out, there's a "dark side" to ETag processing. Something called a Weak ETag. Weak ETags are prefaced with a "W/" and would look like this from our above example:

ETag: W/"1234567890"

The notion of a "Weak" ETag as it states in the spec is


a weak value changes whenever the meaning of an entity changes


As I interpret it, let's say that you're downloading a Java source file via HTTP. You could take a hash of the program, excluding comments and whitespace and return this as a Weak ETag. Subsequent updates to the comments or formatting of the document would not change the actual "meaning" of the returned result. If the code itsself changed, however a new Weak ETag would be generated and returned. Weak ETags are not, as far as I can tell, very well supported by browsers.

The problem is that Tomcat shows loyalties to the dark side when it comes to static content. Tomcat's FileDirContext class does not populate the ETag for static content, leaving the decision about an ETag to DefaultServlet. DefaultServlet simply generates a Weak ETag (by concatenating the content length and the last modified time in milliseconds), sending it back to the browser to basically be ignored.

In my quest to figure out how to prevent these cache problems I turned to Google and the Tomcat source for help. I was hoping to find a configuration setting to prevent the Weak ETag behavior I was seeing for static content but turned up nothing. Instead I found a little gem hiding in the context.xml configuration file.

The Resources Element

As it turns out, you can configure your own context for serving static content using the Resources element in context.xml It looks like this:

<context>
<Resources className="org.example.StrongETagDirContext" />
...
</context>

I extended the FileDirContext class and overrode the getAttributes() method:

public Attributes getAttributes() {
ResourceAttributes r = (ResourceAttributes) super.getAttributes();
String strongETag = String.format("\"s%-s%\", r.getContentLength(), r.getLastModifiedTime());
r.setETag( strongETag );
}

This associates a strong ETag (instead of Tomcat's default Weak ETags) with each static resource served up. Now we have a conditional GET request that behaves well in all browsers and we can get rid of those hacks we've been using forever. I'm not sure Tomcat is doing the right thing with using Weak ETags by default, and I'll probably post some of these comments to the Tomcat mailing list for consideration, but for now I've got caching behaving as I would expect.

Tuesday, September 25, 2007

Passion and Programming

I just read J.B.'s post on TDD and complexity and it reminded me of a conversation I had last weekend and the conclusions I came to afterward. We were discussing software development and agile project management. The woman I was speaking to mentioned that I just don't have the passion for software development in me that I do for project management. I was kind of shocked by the statement and, after our conversation had finished, spent some time thinking about how that could be, I mean I'm always talking about architecture, unit/acceptance testing, and automation with my colleagues; why doesn't that come through? One of the comments she made went something like "when I talk to other programmers they always talk about the 'cool' stuff that they did but, when I talk to you, you don't have any of those stories". Hmm. After a bit of pondering I realized that my passion is still there, it's just that my attitude has changed. I appreciate simplicity much more than I once did and the most elegant solutions for me are simple and clean. I look at code loaded with design patterns and wonder if it's all really necessary or if the programmers are just "respond[ing] by making that program intricate enough to challenge their professional skill".

Tuesday, June 26, 2007

Individuals and interactions over processes and tools



Today I heard about a project team that is transitioning to Agile and was looking to better prioritize their feature backlog. The team was using a voting system to determine priorities, however they decided that a better solution was required. After some discussion, the project team decided to utilize business value points to prioritize the work. This is a good thing! The team recognized a deficiency in their system and adapted to improve their effectiveness.

Unfortunately I think the story takes a bit of a downward turn from then on.

The team decided to build a "backlog management/prioritization tool" and subordinate most of the backlog items to building a new tool. Now, I don't think that backlog management tools are bad per se, however I would prefer the use of a tool such as Microsoft Excel or even Version One or Rally to hand-rolling a solution and only when the demands of the project really need it. From what I've heard so far, I don't believe the latter to be the case. Let me assume for a minute that it will take 1-2 months to develop a basic software package. Given 2 developers with an average salary of $60 per hour (base pay + benefits), we're looking at around $19,200 for a basic application, not including the lost time on higher business value features sitting on the backlog. Is that worth the cost? My other thoughts on this revolve around committing to building a tool before having ever estimated business value points. Is this really a good idea? Looking to lean for inspiration, "defer commitments until the last possible moment" comes to mind. What if the organization pours even half of my estimated $20k on this tool and the business users decide that business value points are too ambiguous to be used effectively and want to go back to voting?

It is an interesting story and I don't know if it will have a happy ending. One thing is for sure, I will be in touch with my collegaue to see how things shake out. More to come I'm sure...

Sunday, June 17, 2007

CURL is your friend


One of our systems requires that we download a file from a website and import the file's contents. When we first started importing the file we did it manually (which we all know is not the best use of a developer's time), unfortunately there were a few bumps in the road to automation.

1) No FTP site, HTTP access only.
2) No RSS/e-mail/notification of any kind that the file has been updated and a new one is available. (A Last Modified date existed on the page, but we found it was out of sync with the actual file modifications)
3) The file is large (~25MB).

Fortunately HTTP and the CURL utility make overcoming these limitations pretty easy.

Step 1: Use CURL to download the file

CURL is a Linux command line program that will retrieve the contents of a given URL. We can use CURL to easily get around limitation #1.

curl http://example.com/data-file.dat > data-file.dat

This works great. We can now incorporate the downloading of the file into a scheduled script. This leaves us with limitations #2 and #3. As it turns out, CURL's support for HTTP allows us to take advantage of HTTP in order to save the overhead of downloading a 25MB file.

Step 2: Mix in a little HTTP goodness

We've all heard of GET and POST, two of the methods defined in the HTTP spec. There is a lesser known HTTP method called HEAD, that can be used to get information about a resource on the web, but not actually return its contents. That seems to do the trick for limitation #3, but how does it help us with #2? Let's try it and see:
curl --head http://example.com/data-file.dat

Which returns something like this:

HTTP/1.1 200 OK
Content-Type: application/rdf+xml
Last-Modified: Sun, 17 Jun 2007 06:43:12 GMT
Expires: Sun, 17 Jan 2038 19:14:07 GMT
Server: Apache
Content-Length: 26214400
Date: Mon, 18 Jun 2007 00:35:28 GMT

The interesting item in this response is the "Last-Modified" header. It specifies the timestamp on the file it is returning. We can now use this in our script to compare the timestamp of the last file we downloaded with the timestamp of the file on the web and download the new file if necessary.

Step 3: Even more HTTP goodness

If you don't like to do date comparisons in script, HTTP offers another option: the If-Modified-Since header. This is a conditional GET request. If the resource specified has not been modified since the date passed in the header, the server should respond with a 304 (Not Modifed). If the resource has been modified, the server will return a 200 (OK) along with the contents of the file.

So, to clean this up even more, simply enter the curl command using the time-cond option with the timestamp from the last file. Our final curl command looks like this:

curl http://example.com/data-file.dat --time-cond "Sun, 17 Jun 2007 06:43:12 GMT" > data-file.dat

This request will download the contents of the file
Using CURL enables us to completely automate the downloading of that file in a standard way, even though the vendor didn't make it the easiest for us.

Wednesday, May 09, 2007

My First Distributed Planning Session

We had our first distributed planning session today and I wasn't too impressed. Conference call + WebEx = Challenging

First off, I was using MS Paint + Mouse as my "virtual whiteboard". I bet WebEx has something that can be used and I could bring my Wacom tablet from home, but I'm a newbie to this so that will have to wait until another day. Another thing that's tough is that I rely on reading peoples facial expressions to guide my line of questioning in planning meetings, but that aspect is also nonexistent in distributed planning. I've got several more coming up, so I'll be getting more practice...

I guess some distributed Agile sessions are in order at Agile 2007...

Sunday, May 06, 2007

IntelliJ Sucks! AKA keybinding hell.

Ok, out of morbid curiosity we decided to download and install IntelliJ trial for doing our development during this iteration. Getting our Eclipse project imported and Tomcat installed was pretty painless so we thought what the heck, let's try it. It didn't last long. It really boiled down to our addiction to the keyboard, no mouse for us. Even with the Eclipse keybindings set, it didn't cover all of the shortcuts we use regularly. The "Quick Fix" keystroke (Ctl+1) was not available for instance. So we could probably have taken the time to get them completely set up "Eclipse Style", but how many IntelliJ shops are using Eclipse bindings? I guess my thoughts turn more and more towards thinking that the IDE doesn't really matter that much, it's all in what you're used to.