6. Other Practical Guidelines for Web Pages and Databases

The recommendations in this section address more narrow concerns about websites and databases and should be addressed only after the preceding principles are applied.

Google has made several recommendations from the point of view of web search.[157] The ability for the public to find government information is a crucial part of government information being open. Their first recommendation is to use their Sitemaps protocol which helps search engines crawl websites more deeply and efficiently. Their second recommendation was to review whether search engines are blocked from parts of an agency’s website by a robots.txt file, which describes the agency’s policy regarding automated access to their website. A robots.txt file should be used sparingly so as not to limit the public’s ability to gather data from the agency or gather data about the agency. As noted by Webcontent.gov, restricting access with a robots.txt file may be contrary to an Office of Management and Budget memorandum in the United States.[158]

Permanent web addresses (discussed earlier) are a part of a larger picture of using globally unique identifiers (GUIDs). This concept is that any document, resource, data record, or entity mentioned in a database, or some might say every paragraph in a document, should have a unique identification that others can use to point to or cite it elsewhere. A web address is a globally unique identifier. Any web address refers to that document and nothing else, and this reliability promotes the dissemination of the document as it provides a means to refer to and direct people to it. GUIDs that persist across database versions allow users of the database to process the changes more easily. If two datasets use a common set of GUIDs to refer to entities, such as campaign donors, then the value of the two datasets becomes more than just the sum of their parts. The connections between the databases adds great value to how they can be used. An easy (and accepted) way to choose GUIDs is to piggy-back off of your agency’s web domain, which provides a space of IDs for you to choose from that won’t clash with anyone else’s IDs. For instance, you may coin verbose GUIDs for entities such as "http://www.youragency.gov/guids/john_smith", rather than a simple, opaque, and non-globally-unique numeric ID "12345". Such GUIDs are a form of URI (uniform resource identifier), but the important part is that they are simply a unique identifier.

The use of GUIDs in the form of URIs is a part of a technological movement called Linked Open Data (LOD, see linkeddata.org). Promoted by the creator of the Word Wide Web, Tim Berners-Lee,[159] the LOD method for publishing databases achieves data openness in a standard format and the potential for interconnectivity with other databases without the expense of wide agreement on unified inter-agency or global data standards. LOD is a practical implementation of Semantic Web ideas, and several tools exist to expose legacy databases and spreadsheets in the LOD method. Though I have been writing about the uses of the Semantic Web for government data[160] for as long as I’ve been publishing legislative data, it has not caught on in the United States, though it has become a core part of Data.gov.uk and is a recommendation of the Australian Governments Open Access and Licensing Framework[161].

The W3C working draft Publishing Open Government Data[162] and the Linked Data Cookbook published by the W3C Government Linked Data committee[163] provide additional best practices with regard to GUIDs and Linked Open Data.

If you like this book, please consider buying a copy:

Support independent publishing: Buy this book on Lulu.

Subscribe to updates to the book:
Google Groups
Read comments or add a comment on this book.