4. Principles 9–12: Publishing Data
9. Permanent: Data should be made available at a stable Internet location indefinitely. Providing documents with permanent web addresses helps the public share documents with others by allowing them to point others directly to the authoritative source of the document, rather than having to provide instructions on how to find it, or distributing the document separately themselves. Permanent locations are especially useful on government websites which are prone to being scratched and re-created as political power shifts.
A common format for permalinks to documents, which is used at most newspaper websites, is “www.agency.gov/year/month/day/name.doc.” Web addresses of this form give a clue about the date and nature of the document which helps users verify that they have the right link. The League of Technical Voters proposes that web addresses be used to help distinguish document versions by having a different but related web address for each published version of a document, as well as in the extreme case to identify paragraphs within documents (see citability.org). The American Association of Law Libraries’ principles call permanent addresses “persistent URLs (PURLs)” — although PURLs are usually short-URLs that can updated at any time to redirect to the current location of a resource. The use of redirecting URLs should be a last resort when a persistent, descriptive URL cannot be created.
When data changes over time, persistence means 1) retaining copies of all published versions of the data, and 2) maintaining stability of format from version to version. Changes to a data format should strive to be backwards compatible and use a two-stage deprecation process: warn first, then change.
10. Promote analysis: “Data published by the government should be in formats and approaches that promote analysis and reuse of that data.” Although I have discussed this throughout, it is worth emphasizing that the most critical value of open government data comes from the public’s ability to carry out its own analyses of raw data, rather than relying on a government’s own analysis. Most of the other principles relate to promoting analysis.
11. Safe file formats: “Government bodies publishing data online should always seek to publish using data formats that do not include executable content.” Executable content within documents poses a security risk to users of the data because the executable content may be malware (viruses, worms, etc.).
Even with anti-virus software installed, malware is spread easily through file formats that contain natively executable code (.exe’s on Microsoft Windows), macros with full access to the user’s computer (Microsoft Office documents with macros enabled), and in rarer cases formats that permit scripting languages (PDFs) because such formats are prone to bugs. In many cases the best protection for a user is to simply not open files that may contain executable content. Governments should not ask a user to choose between their security and access to government information, and so open government data should avoid these formats.
The most common violation of this principle has been the use of Microsoft Office documents with macros. These macros were once a widely used method of spreading computer viruses. This is rarer today, in part because of more useful security settings available in Microsoft products, and in part because since Microsoft Office 2007 documents with macros are saved in .–m files (.docm, .xlsm). The new file naming convention ensures document creators and document users can be sure that their files do not contain executable content. Documents that end in .–x (.docx, .xlsx) do not contain executable content and therefore satisfy the principle of safe file formats.
12. Provenance and trust: “Published content should be digitally signed or include attestation of publication/creation date, authenticity, and integrity.” Digital signatures help the public validate the source of the data they find so that they can trust that the data has not been modified since it was published.
Establishing provenance and trust in a machine-processable way is important for static information, but it is actually incompatible with the goal of re-analysis. A digital signature is a method to ensure that, byte for byte, the data you have is the same as the data published by the source. However, as I’ve argued throughout, it is the transformation of data into new forms by mediators that makes data most powerful. That necessarily changes the bytes. Digital signatures are useful in the direct relationship between the data publisher and the data consumer and should be used on source documents, but they cannot be used to maintain a sense of authenticity in re-uses of the data.
The value of digital signatures on source documents should not become a reason to hinder the sort of changes that need to be made to source documents to create innovative applications.