Shearer Software

Andrew Shearer’s Drivel

99 and 44/100 percent pure.

Wednesday, December 29, 2004

AMC News

Just to keep everyone current who’s playing along at home: I recently became chair of the Narragansett AMC Young Members, which until we get some more leaders to post is practically my other weblog.

Also, my application with the Boston AMC Young Members to become a 3-season dayhike leader was approved. And I’m now a co-webmaster of the Boston AMC Young Members site. Busy, busy.

   Providence, Outdoors, Personal, General  Posted at 1:01 AM    Add a comment
Monday, October 4, 2004

Oct. 10 Day Hike

I’m going to co-lead a beginner/intermediate day hike with the in the North Pack Monadnock area on Oct. 10. It’s a Boston AMC Young Members event. Most participants will be in their 20s and 30s, but there’s no age limit. AMC membership is encouraged, but not required, and the trip is free.

See more details at the Narragansett AMC Young Members site, which I just set up a few days ago.

   Outdoors, General  Posted at 11:34 PM    Add a comment
Friday, September 3, 2004

Two hikes for you

I’m co-leading two upcoming New Hampshire hikes for the Boston AMC Young Members.

On Sunday, Oct. 3, join us for an interesting beginner hike on Squam Mountain during foliage season. 5 miles. Read more information and register.

On Sep. 17-19, we’ll have car camping and a somewhat advanced 10-mile hike over the Baldies loop trail, summiting North and South Baldface. It’s known as one of the best hikes in the White Mountains. We’ll have some terrific views of the Presidentials. Includes some steep, exposed ledges, and finishes up with a visit to Emerald Pool. Waitlist only. Read more information and register for the Polish Off the Baldies trip.

   Outdoors  Posted at 1:04 AM    Add a comment
Thursday, June 24, 2004

Kayaking trip this Sunday

I’m co-leading an easy kayak trip with the AMC this Sunday, June 27. It’s a combination Young Members/Kayak Committee event.

Easy Paddling on Pawtuxet River: One of RI’s major rivers with an abundance of wildlife and history. Pontiac Mills in Warwick to Rhodes on the Pawtuxet.

Some rental boats may still be available. AMC membership isn’t strictly required, though participants should consider it. Contact me if you’d like to come.

See also: pictures from past AMC trips and past kayaking trips.

   Outdoors, General  Posted at 12:02 AM    Add a comment
Thursday, June 17, 2004

AMC Trip Rating Bookmarklet

The AMC Boston hike/bike and young members committees use trip rating codes such as “B3B” to indicate difficulty. Recently there’s been some discussion of how hard this makes the listings to understand for newcomers. Here’s a workaround for users, and the core of a script that would help the authors automate a solution.

This tool translates the codes to English text. To try it out, click the link below and enter a code such as A2C.

Decode AMC Rating

To use this on any trip listing page, drag the Decode link up into your links toolbar, or open your Bookmarks or Favorites window and drag it in. You won’t even have to type the codes anymore; it can work on any text you’ve selected. Go to the AMC Boston YM trip listings, double-click any trip rating to select it, and choose the Decode AMC Rating bookmark.

This is part of a proposal to provide English tooltips for trip rating codes in announcement web pages and emails. Here’s an example of a trip rating with a tooltip. Move your mouse pointer over the following code and hold it still for a few seconds: AA1B.

   Outdoors, General  Posted at 12:57 AM    Add a comment
Thursday, June 10, 2004

Why We Aren’t Safe

Broken Windows: With viruses, worms, and vulnerabilities in the news, John Gruber wrote an excellent piece. “Here’ s a billion-dollar question: Why are Windows users besieged by security exploits, but Mac users are not?”

And, like clockwork, here comes the latest Windows vulnerability:

Internet Explorer Carved Up By Zero-Day Hole:
“Two new vulnerabilities have been discovered in Internet Explorer which allow a complete bypass of security and provide system access to a computer, including the installation of files on someone’s hard disk without their knowledge, through a single click.

Worse, the holes have been discovered from analysis of an existing link on the Internet and a fully functional demonstration of the exploit have been produced and been shown to affect even fully patched versions of Explorer.

It has been rated ‘extremely critical’ by security company Secunia, and the only advice is to disable Active Scripting support for all but trusted websites.”

The article goes on to say that the code exploits three holes in Internet Explorer for Windows, including one that has been known since August 2003, and there’s no patch available for any of them. (You could turn off Active Scripting, which breaks functionality on many sites, or stop browsing web sites you don’t trust completely. If that’s not acceptable, you have to switch another browser such as Mozilla , or switch to a Mac.)

   Mac OS X, Technology, General  Posted at 8:24 AM    Comments (1)

WordPress RSS Import

WordPress 1.2 now has an its own RSS import feature. However, it’s based on a different technique (regular expressions) than the code I contributed in January (which uses a true XML SAX parser). So I’m posting the code here as open source under the GPL license. This code has some additional features:

  • It can import single files from either your local drive or from a URL you specify, or it can import entire folder hierarchies of RSS files (blogBrowser-style: one folder per year, one file per month), making it a general-purpose weblog batch import tool using RSS as the exchange format.
  • It aggregates RSS feeds, if you point one or more copies of it at feeds on the web and set it to run regularly. (Even when run frequently, it won’t import the same item twice.) You can also use this to maintain more than one WordPress site that shares the same content, such as a test site and a production site.
  • It handles time zones in a sophisticated way, preserving the timezone offset so that each item can appear on your weblog under the author’s original local time, while using GMT for all date comparisons.
  • It respects and stores modification dates if given in the RSS file.
  • If modification dates are given in the RSS file, it can optionally import only new or changed posts, leaving posts alone that haven’t been changed or that have been changed more recently on the local machine.
  • Using the above feature and two copies of WordPress, it can synchronize two or more weblogs, bidirectionally or multi-directionally. New and changed posts on any one weblog will automatically show up on the others.
  • It complies with the XML specification, for correct behavior with XML namespaces with arbitrary prefixes and CDATA sections in arbitrary locations, both of which can trip up a regular-expression-based parser.

As long as your RSS feed passes the XML well-formedness test (which it probably does, even if it doesn’t validate according to the RSS Validator), you can use this RSS Import filter. If it’s not well-formed XML, you’re better off with the RSS import filter built into WordPress.

Versions are available for WordPress 0.9 through 1.2.

More Info and Download

   Open Source, Software, General  Posted at 8:15 AM    Add a comment
Sunday, May 2, 2004

House on the Water

Box below overpassBelow a Providence street as it crossed over the water, I once came across a large wooden crate-like structure hanging by heavy cables, only easily visible or accessible by boat. Liana Araujo-Lane saw my picture of it and wrote in with an explanation—thanks, Liana!

“I believe that that cardboard wood thing was a project of a student at RISD for a while. She lived in it and commuted to school with a kayak every day. she had to co-ordinate her commuting times with the tides…it was pretty interesting.

I learned about this from a teacher I had over last summer at RISD who was friends with this girl.”

   Providence, Pictures  Posted at 5:04 PM    Add a comment
Wednesday, March 24, 2004

AMC Newport Cliff Walk

Interested in doing the Newport Cliff Walk? I’m co-leading an easy trip there with an Appalachian Mountain Club group on Sunday, April 25.

Young Members trips like this one are most often attended by people in their 20s and 30s, but all are welcome, and it’s not even required that you be a member. Other typical trips include kayaking, biking, hiking, rock climbing, and skiing, at any level from beginner to advanced. This trip isn’t very demanding; it’s billed as a “leisurely walk.” We’ll probably have some kind of social afterwards and talk about future trips.

I have an archive of pictures of past trips online (including two snowshoeing & cross-country skiing trips this March with the Boston chapter).

The official writeup:

Young Adult Members Walk/Social
Sunday, 4/25 11am
Join us for a leisurely walk on the Newport Cliff Walk before the tourists take over. 6 mile easy hike, picnic lunch to follow.

Leader, Deb Hanley
Co-leader, Andrew Shearer

Contact me to register.

   Outdoors, General  Posted at 1:14 AM    Add a comment
Wednesday, February 18, 2004

RSSFilter

Now available: RSSFilter, an open source Python module for modifying RSS files and blogBrowser-format RSS archives in place. It builds on XMLFilter. (Speaking of which, thanks to Mark Pilgrim for its recent mention in his b-links.)

The module can also be used an RSS parser for valid XML feeds, though it trades in ultra-liberal parsing for its ability to safely modify files.

Operations such as inserting, modifying, or deleting a post are designed to cause minimal disruption to the rest of the file.

Read more and download.

   Python, Open Source, Software, General  Posted at 10:57 PM    Add a comment
Tuesday, February 3, 2004

Vermont Snowshoeing

Pictures from a fun weekend snowshoeing in Grafton and Londonderry, Vermont, on a trip organized by the Boston AMC Young Members.

   Outdoors, Personal, General  Posted at 11:10 PM    Add a comment
Monday, January 26, 2004

iPhoto comments, flattened with Text File Technology

Here’s a way to back up iPhoto’s image comments into an easy-to-read flat directory structure. (Translation: one big folder.) You’d want to do this when archiving your photos to CD or DVD, or when trying to merge photo libraries, or when leaving iPhoto for another program, or at any other time you want your comments saved in a non-proprietary, easily readable format.

As you may have read last week, when I upgraded to iPhoto 4, all the image descriptions temporarily disappeared from my online photo albums. (I caught the problem on my own staging server before it appeared on this site.) The culprit was a change in the way iPhoto stores photo comments. Comments are now entirely gone from the easy-to-parse AlbumData.xml file; iPhoto now stores them in a binary format that appears to be proprietary.

AppleScript to the rescue. Last week’s script saved the comments to text files and generated a directory structure that exactly paralleled iPhoto’s library, with one text file for each comment. These files were in folders for each day, which were in turn inside folders for each month, etc., guaranteeing there would be no name conflicts. I had rejected using the internal ID of each picture (which would have allowed a flat conflict-free directory structure) because the ID wasn’t user-visible anywhere in the iPhoto interface, making comment files named for the ID difficult to map back to the original pictures.

One of the comments on that post asked for a version that generated the comment files in one folder, based on the image’s filename. That was a good idea. Though the filename is not guaranteed to be unique, it often is in practice. Most digital cameras save unique serial numbers for each picture as part of the filename. So this is enough for most people. (The exceptions would be if you have more than one digital camera using a similar naming convention, or if your camera is configured to reset its numbering between rolls.)

If you like guaranteed accuracy, use my original script; if you like simplicity, use the following alternate script. It will only save one of the conflicting comments if photo filenames are duplicated. Dropping the parallel folder structure simplified the script, since this version doesn’t need to employ any POSIX path manipulation.

Copy the following into Script Editor and run. Tested with iPhoto 4.0 on Mac OS X 10.3. (It may also work with earlier versions; drop me a comment below if you’ve tried it.)

-- Export iPhoto Comments - Flat
-- Creates a text file corresponding to each picture with a comment, containing just the comment. The filenames of the text files correspond to the filenames of the images. So avoid having more than one image with the same filename (taken by two different cameras with similar naming conventions, perhaps). This isn’t a problem for most people, but if it is for you, use the slightly more complex version of the script that duplicates the iPhoto folder hierarchy: <http://www.shearersoftware.com/personal/weblog/2004/01/18/iphoto-4-has-comments-no-more>.
-- Note: this does not remove files in the comments folder when a comment disappears (due to deletion of either the comment or the image). To guard against this, you may want to delete the whole comment folder before rerunning this script. (Using a separate folder rather than storing comment files alongside the image makes this easier; you can flush the whole cache at once.)
-- Written to work around the fact that iPhoto 4 no longer stores photo comments in the AlbumData.xml file.
-- by Andrew Shearer, 2004-01-25 <mailto:ashearerw at shearersoftware dot com>

-- config
set commentsFolderName to "iPhoto Library - My Comments Cache - Flat"
set stripJPG to true --whether to strip .JPG extension
set openFolderInFinder to true
set commentFileSuffix to ".comment.txt"
set requiredAlbumPrefix to "Web-"
-- end config

tell application "Finder"

--return some folder of (path to pictures folder)
if not (exists folder named commentsFolderName of (path to pictures folder)) then make new folder at (path to pictures folder) with properties {name:commentsFolderName}
set commentsFolderPath to folder named commentsFolderName of (path to pictures folder) as text

end tell
--set commentsFolderPath to POSIX path of (path to pictures folder) & commentsFolderName

tell application "iPhoto"

repeat with theAlbum in (every album whose name starts with requiredAlbumPrefix)

repeat with thePhoto in (every photo of theAlbum whose comment is not "")

set commentText to comment of thePhoto as Unicode text
set commentFilename to image filename of thePhoto
if stripJPG then

-- strip .JPG suffix (optionally)
if commentFilename does not end with ".JPG" then

error "Error: file does not end with .JPG: \"" & commentFilename & "\""

end if
set commentFilename to text 1 through -5 of commentFilename

end if
-- add suffix to comment filename (.txt extension, etc.)
set commentFilename to commentFilename & commentFileSuffix

set f to open for access file (commentsFolderPath & commentFilename) with write permission
set eof f to 0 -- truncate file, or old data can remain
write commentText to f as Unicode text
close access f

end repeat -- photos in album

end repeat -- albums

end tell

if openFolderInFinder then tell application "Finder" to open folder commentsFolderPath

   Mac OS X, Open Source, Pictures, Software  Posted at 11:38 AM    Comments (4)
Saturday, January 24, 2004

What’s This Site Running?

I’m now using the release WordPress 1.0 to generate the content area of this weblog. (The headers, footers, site navigation, and subscription list are generated by ShearerSite.)

In many ways, it’s going from one extreme to the other. My own system is based on static rendering without a database, to the point that the original data itself is kept in RSS-compliant XML files on the site, and HTML files are generated from those. So there’s no programmatic server overhead for retrieval, but there is for authoring, since all the dependent pages have to be re-rendered on the spot. I’m still a fan of this type of system, but I wanted to try something different. WordPress is about as different as you can get: by default, it runs a battery of regular expressions–dozens upon dozens of them–over each post to format it at retrieval time. (Some kind of static caching may be on its way, though, judging from hints in the database schema.) The administration interface is mostly very good, making it much easier to perform administration tasks such as adding new categories than my homegrown config-file-based system did.

Pros of WordPress: very hackable (the good way, by the site owner); terrific setup routines; good navigation controls, easy to set up; well-rounded feature set.

Cons: frequently passes HTML through finicky regular expressions; too much use of addslashes() for my taste, including some double applications; a few bugs in 1.0 (though, to be fair, 1.0.1 final is imminent).

Some changes I made to my own copy include:

  • Improvements for source code posting, as well as XHTML validity. Made some changes in the regular expressions in the wptexturize and wpautop filters. Unmodified, they kept turning some my posts into invalid XHTML by adding an extra </p> tag. I also had some problems with snippets of source code that I posted. WordPress’s filters would get too smart, and try to produce curly quotes around strings, as well as em dashes before AppleScript comments. They would also tend to double-space the code, because newlines were turned into <br /> and a newline by wpautop, and the pre element honors both. I modified the code so that any filter could (optionally) avoid <pre> sections in the content, letting them go through unmodified. I did this using a loop and, much as I hated to add them, two more regular expressions.
  • Site-relative blog home page links, to handle my unorthodox split-directory setup.
  • Minor permalink change, to send out two-digit days and months.
  • RSS import and synchronization. (I already contributed my RSS 0.9/1.0/2.0 import and sync. code to the WordPress project, but it was far too near the 1.0 series’ release date to make it in.)
   Open Source, Software  Posted at 3:14 AM    Comments (5)

England Photos II: This Time, It’s Personal

In addition to the public England trip photos, the semi-private family photos are now up, so members on both sides of the Atlantic can see them equally easily. Use family for the username and my mother’s maiden name as the password when you click this photo link.

   Pictures, Personal  Posted at 2:09 AM    Add a comment
Sunday, January 18, 2004

iPhoto 4 has comments no more

I bought the upgrade to the Apple’s iLife suite, released on Friday. Here’s a gotcha for developers who parse iPhoto’s AlbumData.xml file, though it doesn’t directly affect most users. It affects me, because my own code parses AlbumData.xml to generate my web-based photo albums (such as the England trip pictures I just posted).

Though the overall format of iPhoto’s XML file stays the same (and my script had no trouble reading it), the Comments and Date fields are gone! The Date field is renamed and in a different format, which is no problem to work around because the image file’s embedded EXIF data contains the date as well. The missing Comments field is a different story.

From my quick inspection, the comment data seems to be only stored in a newly introduced iPhoto.db file, which is in some binary format. The rationale for this is presumably performance, but that doesn’t completely make sense, since the photo title is still stored in the XML file and it may be changed just as often.

In any case, here’s a workaround that uses AppleScript to write a parallel folder structure holding just the comments, one per text file. Paste the following into a Script Editor window and run. Use this anytime you’d like to protect your comments from the vagaries of software or platform transitions or upgrades. (The parallel folder structure helps this; the script could have used iPhoto’s internal IDs and generated all the files in a single folder, but that wouldn’t have been as forward-compatible.) GPL-licensed.

Read the rest of this entry »

   Mac OS X, Python, Open Source, Pictures, Software  Posted at 4:57 PM    Comments (8)

England Photos

Here are pictures of the scenery in Dartmouth taken during my trip to England over the New Year. Uploading has been slow due to the sudden death of my cable modem. Family pictures are coming next, and are semi-private: you’ll need to enter “family” as the username and my mother’s maiden name in lowercase as the password. The aunts, uncles, and cousins involved should have no problem figuring that one out.

   Pictures, Personal  Posted at 4:57 PM    Add a comment

Counterfeiting Restrictions and Unintended Consequences

Macintouch has some interesting commentary on anti-counterfeiting measures that Adobe quietly slipped into Photoshop CS. The program now detects images containing currency and prevents you from working with them, even though doing so is perfectly legal, as long as you don’t then make a printout that’s double-sided or very close in size to the original.

[Tim Wright] It would be fairly easy to create other documents which would mistrigger this pattern [described in eurion.pdf].
Now the cat is out of the bag, I fully expect this to start appearing on magazine page backgrounds, books, any documents considered “sensitive”, grocery coupons, etc, which will rapidly render colour photocopiers pretty useless until they disable this feature.
For more amusement, why not put it onto t-shirts or baseball caps, which will neatly prevent people from printing (or editing) photos of you? I’m sure more inventive people will be able to think of plenty of other uses, like car decorations, wallpaper, badges and so on…

   Society, Technology  Posted at 3:41 PM    Add a comment
Wednesday, January 14, 2004

The Hole in Postel’s Law

“Be conservative in what you do, be liberal in what you accept from others.”

This law is making the rounds again, with arguments both pro and con. Here are my thoughts.

Postel’s Law is a great, useful principle for writing programs that communicate. However, the law is so elegant and successful that it’s easy to regard it as an absolute. And then, because be liberal in what you accept is such an open-ended goal, people go too far. Here’s an analysis of the problem, followed by a suggestion.

The first half of Postel’s Law, be conservative in what you transmit, is a well-specified rule with a clearly defined goal. The tools to achieve it are specs and validators. But the vague goal of the other half, to be liberal in what you accept, can turn into a bottomless hole. There’s hardly any limit to how loose an interpretation of the spec can get, how cleverly the code can guess at the sender’s intent, and how much code for special cases you can write to fix invalid data. Because such code can provide an immediate user benefit and a market advantage, it turns into an arms race. Often, the code ends up violating the spec itself, intentionally or unintentionally, which we’ll see below.

The Growing Hole

Plenty has been written praising be liberal in what you accept. So I won’t repeat it. Here are some of the problems:

It enlarges the spec. Every additional error condition fixed by a market leader becomes an (undocumented) part of the spec. Senders come to rely on it. The senders probably don’t even realize that their output is wrong because of the way software is written.

In the edit-run-debug cycle of the typical software development process, testing is often done just by trying the program out, not through any mathematical process or formal validation suite. HTML authoring tends to be done the same way. Though modern XP [Extreme Programming] test-first practices call for a thorough suite of test cases to be written before the actual code, most software still doesn’t have this advantage. HTML is an easy case for validation, with scores of easily accessible validators already written, much easier to test than most program code, yet the bulk of new pages in the world have probably never been through an HTML validator.

The problem is that, even after removing all obvious bugs, the product of this run-test-debug cycle can only run at the “seems-to-work” level. There’s no guarantee that the it’s really working, and specifically no proof that the program or web page is being conservative in what it sends. If it’s a program that communicates with other types of programs, the developer will test it with real examples of those programs. So, when a developer writing program Z needs to interoperate with programs such as A and B, and A and B are silently fixing errors in the output of program Z, Z’s developer will declare the code “working” (because to all appearances, it is), and say “ship it!”. And everything will be fine until an edge case comes along that program A or program B either can’t fix or interpret differently. Or until someone tries program Z with program C, which didn’t get the memo about all the particular types of errors that programs A and B fix. All this because Z had a latent bug, due to the second half of Postel’s Law, because:

It hides violations of the other half of Postel’s Law. In other words, by being more liberal on the receiver, it becomes more difficult to find bugs in the sender.

As an example, Microsoft Internet Explorer sports what some have called a “ridiculous tolerance for errors in HTML markup”. Microsoft FrontPage has a well-known tendency to silently create invalid HTML markup. (One of the bugs: FrontPage 98 and 2000 will occasionally go through a valid page with spacer images and replace all of their alt=”" attributes with the lone word alt, which is invalid HTML. A developer familiar with the SGML foundations of HTML might think the fix is to parse this as a boolean attribute, alt=”alt”, but IE and other browsers choose to interpret it as alt=”".) Though I doubt that any such bugs are intentional, the tendencies of the two products feed on each other. If the developers of FrontPage were testing with a browser that flagged such errors, it’s likely that the bugs wouldn’t have made it to release.

The bind here is that Postel’s Law tries to make things work as often as possible for users, but people trying to test other programs are users too, and errors are also covered up for them. One way out of this would be some kind of Postel Kill Switch, a strict mode intended for interoperability testing. (Turning off the other half of the law at the same time, causing the program to send out data malformed in various ways, would be harder to switch on programmatically.) Though the strict mode might do some good, it has some drawbacks: it would require a different code path, making it prudent to test both modes; and even without the extra work that would entail, testers might not bother turning the feature on every time in the first place.

Market Forces

Even though it’s usually more work to be more liberal, developers with time or money on their hands will still do it. They are often motivated just to provide convenience for their users, but with competitors in the same market, it has a predictable effect:

It increases the cost of entry. Accepting everything is a greedy strategy. It rewards the incumbents, and makes more work for newcomers. Not only do the newcomers have to catch up with all the error-fixing logic that the market leaders have been writing since the beginning, they have to somehow figure out what all those error conditions are. They’re not in the spec, and it’s almost certain that they’re not publicly documented anywhere. Even if the types of errors to be fixed were known, the new programs would have to fix them exactly the same way as the old ones, even in the face of multiple overlapping errors or ambiguous edge cases. And in some cases, this may require disregarding the spec, deliberately misinterpreting a valid document to match an overzealous fix.

Safety

This leads to one of the most damning consequences:

It makes software unreliable. Even the safest-looking fix can have unexpected consequences once others depend on it. (Which they will, and, unless the fix was added purely on speculation, already do.)

For instance, if you’re writing an HTML parser, and you see a lone ampersand (technically illegal–it should be encoded as &amp;) the liberally accepting thing to do is to display an ampersand, just as if it had been encoded properly. Which is fine, at that moment. If the users knew what had happened, they would probably thank you for soldiering on through the rest of the document and not giving up right there. But in reality they don’t even know it happened, and as the years go by, they will keep turning out pages with unencoded ampersands. (It’s the testers-are-also-users problem again.) New high-end content management systems will be deployed without anyone working with the system even knowing that they’re entering raw HTML into some of the text fields, and that they have to be careful with ampersands (yes, this already happens). A validator may catch the problem if it happens to crop up on the page at the time it’s checked, but most likely, no one will notice until the unlucky day that someone writes a classified for an electric guitar setup saying “For Sale: guitar&amp; $200.” Then the amp will just mysteriously disappear on the post, putting a guitar and $200 on sale. (If you think the example is contrived, note that in another attempt to apply Postel’s Law, real-world browsers end up expanding the error domain even further: “guitars&amplifiers” will have three letters dropped out of it, because the first browsers judged that to be most likely what the author intended. However, if you added spaces around the punctuation, whole words would show up. This is the kind of bizarre behaviour that makes people distrust computers.)

At its root, the ampersand problem is really just confusion over a weakly specified input format. (You can find similar examples on display in comment forums across the web, which often treat visitors to the spectacle of a web developer repeatedly trying to describe an HTML tag, only to have the tag itself disappear.) However, in this case being liberally accepting didn’t fix the problem; it just made its symptoms more rare, and therefore the real problem harder to find, more capricious, and more puzzling.

In an effort to do the right thing, some programs intentionally go against the spec. Internet Explorer (and therefore Outlook, when opening HTML mail) will disbelieve the content type specified by the web server, and choose a different type itself based on heuristics, a behaviour which is even documented. An XHTML document might not be rendered if it starts with a comment that’s too long, or a plain text file might be parsed as HTML because it contained a tag-like sequence of characters. The HTTP spec specifically forbids browsers to second-guess the content type provided by the server, but IE does it anyway. This makes IE compatible with many badly-configured web servers. It also frustrates the owners of well-configured web servers for whom IE always guesses wrongly.

In certain cases, outright bugs in complex code designed to tolerate many errors has the ironic effect of limiting the spec. For example, RSS is based on XML, but because of the existence of RSS feeds with invalid XML, liberal RSS parsers can’t be based on real XML parsers. Real XML parsers are thoroughly tested and widely deployed. But instead, the developers have to roll their own quasi-XML parsers (increasing the barrier to entry). The chance of getting some part of the XML spec wrong is high (making the software unreliable). This in turn has made feed developers reluctant at various times to begin using any XML features that don’t already appear in the most common feeds, such as CDATA blocks in the description element, namespaces, and XML comments, because they might break regexp-based parsers. (Mark Pilgrim’s Ultra Liberal Feed Parser is a solution for Python programmers, and while it gets everything right as far as I know, it still doesn’t much help developers in other languages.)

In this example, XML is special, because the XML spec itself violates Postel’s Law. It calls for clients to terminate parsing entirely when they encounter malformed content. While it may have been better if this decision hadn’t been made, that’s the current reality of XML parsers. Replacing them all with less flighty ones would be nice. (Any takers?)

Security

Finally, security. A whole class of security vulnerabilities results from automatically fixing errors in input data. Because the set of errors to be fixed is ill-defined, software downstream can take a radically different action than what the software upstream thought possible. Malicious users can exploit this.

Think of the difficulty just of reliably filtering out dangerous HTML tags and attributes from a comment left on a web site. The browser is working as hard as possible to be liberal in its definition of an HTML tag, working by unknown rules to fix almost-tags. Can the author of such a filter ever be truly certain that nothing gets through? (Thinking about this, the only sure way around it without writing an entire HTML validator would be to fully parse the HTML input into an intermediate HTML-free representation, then write it back out as guaranteed-valid HTML code. The only thing left to worry about: an overzealous fix that would cause the valid code to be misinterpreted.)

The rule: Arbitrary fixes to bad input data will thwart any previous filtering or security checking of the data.

What to do?

A Suggestion

Future specs could require implementations to report whenever they encounter and correct errors, with an interface that could be as simple and non-intrusive as an exclamation point icon. (A newsreader, for instance, would place it next to a suspect newsfeed and link it to the Feed Validator.) There’s nothing particularly new about this kind of interface; several products, such as Opera, already do something similar. The trick would be that that it would be required by the spec. The market leaders would be compelled to adopt it, not just the smaller products.

This behavior wouldn’t hamper a program’s ability to accept liberally; it would just let testers and other interested users know that the data had not been sent conservatively. It would thus remove the conflict of interest between the two parts of the law. The feature would be on by default, so testers wouldn’t need to activate it, but it wouldn’t be so annoying that users wanted it off (as a modal alert box would be).

This doesn’t mean that each implementation has to have a full-fleged validator aboard. Only errors detectable by reasonably straightforward means and cases where the implementation goes to extra lengths to make sense of the input would have to be flagged. That does give implementations some wiggle room.

It’s important that this minor error display mechanism be required in order to comply with the spec. It can’t be voluntary on the part of the implementors. There’s nothing in it for them, at least not directly. To record the error as it’s fixed and display the fact takes extra code, albeit not much. Considering that the benefit goes mainly to future implementors as well as users of less liberal implementations that don’t know how to handle the same error, implementors will tend not the write that code unless nudged.

And the developers can be nudged, even for specs without trademarks or an official logo program. Having the requirement enshrined in the spec at least provides some social pressure for implementors to comply.

And some other things that seem to make sense right now:

  • Developers should also take great care to hold back and not misinterpret technically valid input in an attempt to do the right thing. Internet Explorer’s habit of second-guessing the Content-Type header is the kind of thing to avoid.
  • By the same token, to provide tolerant XML parsing, use a real standards-compliant XML parser first, and fall back to a handcoded quasi-XML parser only when that fails. (Or, if you can absolutely guarantee that the result will be identical, use the quasi-XML parser alone, but that guarantee is hard to make.)
  • To avoid unintentionally thwarting security filters, all heroic fixes to input should be made as far upstream in the call chain as possible. If there’s still a danger the downstream code will try to outsmart the upstream code, the upstream code could rewrite the input to be canonical and unmisinterpretable.
   Technology  Posted at 7:14 AM    Comments (1)

Back from England

Got back recently from visiting family in the mild weather of Dartmouth, England, where it’s not cold, and it’s not warm, and where the camera flash usually goes off outside at noon. Some pictures to come later.

   Personal,   Posted at 6:34 AM    Add a comment
March 2010
M T W T F S S
« Nov    
1234567
891011121314
15161718192021
22232425262728
293031  
Recent Reading

A Heartbreaking Work of Staggering Genius, by Dave Eggers

Harry Potter and the Order of the Phoenix, by J. K. Rowling

Player Piano, by Kurt Vonnegut

Bad News, by Donald E. Westlake

The Blank Slate: The Modern Denial of Human Nature, by Steven Pinker

The Jungle, by Upton Sinclair

Gödel, Escher, Bach: An Eternal Golden Braid, by Douglas R. Hofstadter

Speaking With the Angel, by Nick Hornby (Editor)

In Progress

The Language Instinct, by Steven Pinker

The Corrections, by Jonathan Franzen