2. Principle 5: Data Format Matters

The 8 Principles of Open Government Data also say that data should be:

5. “Machine processable: Data are reasonably structured to allow automated processing.”

The goal of this principle needs some unpacking. Before the 8 Principles of Open Government Data were published, the term I heard most often for this was “machine readable.” At the workshop, Aaron Swartz pointed out that any data can be read by a machine and that it is not the reading of the bytes that is important to openness but whether the machine can usefully process it. The group adopted “processable” instead. The machine-processable principle is important because as the sizes of data sets grow, the most interesting, informative, or innovative applications of government data require the use of a computer to search, sort, or transform it into a new form.

But as powerful as computers are, they don’t work well with uncertain data. For instance, human language is remarkably uncertain at least from the point of view of the computer. It doesn’t know what any of this text means. Prose to a computer is like poetry to me. I just don’t get poetry. When I read a poem I need someone to explain to me what it means and how it is meant to be interpreted. And so it is for a computer. There is no meaning to a computer besides what a programmer gives it. Give a computer audio or images and without the proper software a computer is like an octopus suddenly relocated into the middle of Times Square. Its mind could not begin to make sense of the signals coming from its eyes. (The refined octopus would ask of Times Square “why?”, but your common octopus would have no idea what is going on.) Think of how each type of computer file can be opened only in the software it was meant for: Microsoft Word documents are opened in Microsoft Word, PDFs are opened in Adobe Acrobat, and so on. Data is nothing without the ability to create software to process it.

Sometimes government data falls into a category for which there is an appropriate, standard data format — and corresponding application. Tabular data, such as spending data, can be saved into a spreadsheet format (e.g. Microsoft Excel or, better, the CSV format). Saving it in this format, compared to a scanned image of the same information, creates certainty. Programmers know how to make programs that work reliably with numbers in rows and columns. Software already exists for that. There is no reliable way to work with the same information when stored in a scanned image.

But other government data doesn’t fit into a standard data format. When the relationships between things encoded in the data become hierarchical, rather than tabular, it is time to think about using a dialect of XML. For instance, an organizational chart of a government department is hierarchical and would not fit nicely into a spreadsheet. The information could be typed into a spreadsheet, but whether you can figure out how to interpret such a spreadsheet and whether a computer has been programmed to interpret and process it are not the same. Much like the difference between poetry and prose, the use of an incorrect format can render the data completely opaque to software that could process it.

The principle of machine processability guides the choice of file format — free-form text is not a substitute for tabular and normalized records, images of text are not a substitute for the text itself — but it also guides how the format is used. When publishing documents it is important to avoid scanned images of printed documents even though scanned images can be included in a PDF, and PDF is a recommended format for certain types of documents. A scanned image is an uncertain representation of the text within it. An RSS or Atom feed of a schedule encodes dates and times with certainty, but it does not provide a way to include a meeting’s location or topic in a way that a computer could meaningfully process. If those fields are important, and that will depend on what will be done with the feed, then the feed will need to be augmented with additional XML for that purpose.

XML is best thought of as a type of data format, rather than a particular format. There are many ways to apply XML to any particular case, and the choices made in applying XML should continue to be guided by the principle of machine processability. For instance, the following is a poor XML representation of two scheduled meetings:

<schedule>
<meeting>The committee will hold a hearing next Thursday to consider unfinished business on the defense appropriations bill.</meeting>
<meeting>The markup on the Public Online Information Act will resume at 10am on Friday.</meeting>
</schedule>

(Boldface is added for clarity and is used below to highlight changes.)

This representation merely encloses a description of the event in words in “tags” surrounded by angled brackets (the tags are in boldface for clarity). Angled brackets are the hallmark of XML. If you are not familiar with XML, note how each “start” tag corresponds with an “end” tag of the same name but with a “/” indicating it’s an end tag.

The example above is well-formed XML. However, it is a useless use of XML because the information expressed in words is difficult or impossible to program a computer to extract with any reliability. A better use of XML would identify the key terms of the description with additional tags that with sufficient documentation a programmer could program a computer to locate reliably. Here I add who, what, when, and subject tags that wrap particular parts of the meeting information:

<schedule>

<meeting><who>The committee</who> will hold a <what>hearing</what> <when>next Thursday</when> to consider unfinished business on the <subject>defense appropriations bill</subject>. </meeting>

<meeting>The <what>markup</what> on the <subject>Public Online Information
Act</subject> will resume at <when>10 am
on Friday</when>.</meeting>

</schedule>

Now if there are only ever two meetings to deal with, it’s easy enough to process this information: get an intern to type it into a spreadsheet. But imagine the same question asked of tens of thousands of meetings in all of the parliaments world-wide. With the right XML, it’s easy to create an automated process to locate all of the subjects because the format commits the author to using the precise text “<subject>” to identify subjects. XML creates reliability through standardization.

Still, so far it is impossible to figure out the dates of these meetings when ambiguous terms like “next Thursday” are used. How would a computer identify all of the meetings coming up in August 2011? We can use additional standard date representations to clarify this for automated processing. And the same problem can be addressed for naming bills in Congress (there are often multiple bills with the same name or the same number!).

. . .
<meeting>
The <what>markup</what> on the
<subject bill="http://hdl.loc.gov/loc.
uscongress/legislation.112hr1349"
>
Public Online Information Act</subject> will
resume at <when date="2011-07-08T10:00:00-400">10 am on Friday</when>.</meeting>
. . .

I’ve added so-called XML “attributes” to add additional information to the tags. The new date attribute specifies the date (“2011-07-08”), 24-hour time (“10:00:00”), and timezone (“-400”, meaning four hours before UTC) of the meeting in a date format that has been created by an international standards body. The new bill attribute reliably identifies the bill to be discussed, using “http://hdl.loc.gov/loc.uscongress/legislation.112hr1349” to indicate the bill clearly, rather than the bill’s title alone, which is ambiguous. This is sometimes called normalization: the process of regularizing data so that it can be processed more reliably. Again, standardization creates certainty in computer processing. That certainty is required to produce useful results out of data. And as this example shows, the devil can be in the details. While XML is often the recommended data format, it is entirely in how the particular XML is crafted that determines whether the data will be usefully machine processable.

The choice of file format involves a consideration of both access and machine processability. Figure 21 lists recommended formats for different types of media, glossing over the details discussed previously. For tabular data such as spreadsheets, CSV (“comma separated values”) is the most widely usable format.[141] It is simply a text file with commas delimiting the columns. In the case of tabular data, simplicity is key.

There is no one answer for the appropriate file format for documents because of two often competing requirements: being print-ready (e.g. including pagination, layout, etc.) and encoding the content in a machine-processable way. The PDF format is always print-ready (PDF/A-1b is the specific minimum requirement for documents that need to be archived), but PDF normally does not make the information contained in the document machine-processable. While “Tagged PDF” (PDF/A-1a) could achieve both goals simultaneously, most software that supports PDF output does not adhere to PDF/A-1a. It is therefore necessary in some cases to publish a document in two formats simultaneously, once in PDF to satisfy the needs of printing and once in a format such as XHTML or XML. As I noted above, however, the use of XML does not guarantee machine processability: it’s in how the XML is used.

For images, audio, and video the overriding concern is using a non-proprietary format, for the reasons discussed in the accessibility section. The most useful non-proprietary formats are PNGs for images and Ogg for audio and video.

When dealing with heterogeneous data — such as entities with arbitrary relationships between them that defy a simple hierarchy — then the semantic web format RDF may be most appropriate (more on that later).

Media

Recommended Data Formats

Spreadsheets

CSV (UTF-8), or

OpenOffice spreadsheet, or

XML

Documents

XHTML/XML (for structure)

plus PDF/A-1b (for pagination)

Images

JPEG, SVG, or PNG (as appropriate)

Audio

uncompressed WAV or Ogg Vorbis

Video

uncompressed AVI or Ogg Theora

Figure 21. Recommended Data Formats

Machine processability also implies that the data should be clean. The smallest of mistakes in the data can dramatically increase the cost of use of the data because fixing mistakes almost always requires human intervention: making the data not machine-processable. This isn’t to say the data must be correct. Data that has been collected, e.g. a survey or a measurement, can be reported as it was collected when it has uncorrectable errors or when correction involves judgment. Data is clean when its data format has been appropriately applied and when its values are normalized.

Machine processability is closely related to the notion of data quality. For more, see section 5.2.

If you like this book, please consider buying a copy:

Support independent publishing: Buy this book on Lulu.

Subscribe to updates to the book:
Google Groups
Read comments or add a comment on this book.