Sunday, June 17, 2007

CURL is your friend


One of our systems requires that we download a file from a website and import the file's contents. When we first started importing the file we did it manually (which we all know is not the best use of a developer's time), unfortunately there were a few bumps in the road to automation.

1) No FTP site, HTTP access only.
2) No RSS/e-mail/notification of any kind that the file has been updated and a new one is available. (A Last Modified date existed on the page, but we found it was out of sync with the actual file modifications)
3) The file is large (~25MB).

Fortunately HTTP and the CURL utility make overcoming these limitations pretty easy.

Step 1: Use CURL to download the file

CURL is a Linux command line program that will retrieve the contents of a given URL. We can use CURL to easily get around limitation #1.

curl http://example.com/data-file.dat > data-file.dat

This works great. We can now incorporate the downloading of the file into a scheduled script. This leaves us with limitations #2 and #3. As it turns out, CURL's support for HTTP allows us to take advantage of HTTP in order to save the overhead of downloading a 25MB file.

Step 2: Mix in a little HTTP goodness

We've all heard of GET and POST, two of the methods defined in the HTTP spec. There is a lesser known HTTP method called HEAD, that can be used to get information about a resource on the web, but not actually return its contents. That seems to do the trick for limitation #3, but how does it help us with #2? Let's try it and see:
curl --head http://example.com/data-file.dat

Which returns something like this:

HTTP/1.1 200 OK
Content-Type: application/rdf+xml
Last-Modified: Sun, 17 Jun 2007 06:43:12 GMT
Expires: Sun, 17 Jan 2038 19:14:07 GMT
Server: Apache
Content-Length: 26214400
Date: Mon, 18 Jun 2007 00:35:28 GMT

The interesting item in this response is the "Last-Modified" header. It specifies the timestamp on the file it is returning. We can now use this in our script to compare the timestamp of the last file we downloaded with the timestamp of the file on the web and download the new file if necessary.

Step 3: Even more HTTP goodness

If you don't like to do date comparisons in script, HTTP offers another option: the If-Modified-Since header. This is a conditional GET request. If the resource specified has not been modified since the date passed in the header, the server should respond with a 304 (Not Modifed). If the resource has been modified, the server will return a 200 (OK) along with the contents of the file.

So, to clean this up even more, simply enter the curl command using the time-cond option with the timestamp from the last file. Our final curl command looks like this:

curl http://example.com/data-file.dat --time-cond "Sun, 17 Jun 2007 06:43:12 GMT" > data-file.dat

This request will download the contents of the file
Using CURL enables us to completely automate the downloading of that file in a standard way, even though the vendor didn't make it the easiest for us.

No comments: