2. Data Quality: Precision, Accuracy, and Cost
Many of the principles of open government data relate to a notion of data quality, meaning the suitability of the data for a particular purpose. Timeliness, for instance, is important if the data is to be useful for decisions in ongoing policy debates, but what constitutes “timely” depends on the particular circumstances of the debate. The choice of data format similarly depends on the purpose. For financial disclosure records, a spreadsheet listing the numbers is normally more useful than an image scan of the paper records because the intended use — searching for abberations — is facilitated by computer processing of the names and numbers. If the most important use of the records were instead to locate forged signatures, then the image scans would become important. Data quality cannot be evaluated without a purpose in mind.
Government data normally represents facts about the real world (who voted on what, environmental conditions, financial holdings) and in those cases two measures become important: precision and accuracy. Precision is the depth of knowledge encoded by the data. Precision comes in many forms such as the resolution of images, audio, and video and the degree of dis-aggregation of statistics. Accuracy is the likelihood that the data reflect the truth. A scanned image of a government record is 100% accurate in some sense. But analog recordings like images, audio, and video have low precision with regard to the facts of what was recorded, such as the numeric values in images or the participants in recorded meetings. In many cases, there is no automated method to extract those details. Therefore, with regard to automated analysis, these details are not encoded by the data — the data is not precise in those ways.
Government agencies have long prioritized accuracy in information dissemination. However, accuracy as defined here is a more nuanced notion by making it always relative to a particular purpose. In other words, recordings are 100% accurate in the sense that they record physical events or objects reliably. A photo does not lie. If the intended purpose of the recording is to re-witness the event or object, then it has 100% accuracy. But if the intended purpose is to support the analysis of government records to create oversight, then the colors, volumes, and other physical details present in the recording are not relevant for judging accuracy. What is relevant are the facts that were recorded, such as who the parties were in a transaction and the dollar amount that was exchanged. An image recording of a typed physical document, i.e. a scan, has low accuracy with regard to these facts because automated analysis of a large volume of such records could not avoid a large number of errors. OCR (optical character recognition) software to “read” the letters and numbers in a scan will occasionally swap letters, yielding an incorrect read of the facts.
Precision is often at odds with wide public consumability. A prepared report, which is still data, may be easily consumable by a general audience precisely because it looks at aggregates, summarizes trends, and focuses on conclusions. Therefore a report has low precision. The same information in raw form, at high precision, can be used by specialists such as developers, designers, and statisticians who can write articles, create infographics, or transform the same information into other consumable forms. The two ends of this spectrum are often mutually exclusive. A government-issued report in PDF format, say on environmental conditions, may be the most consumable for the public at large. But at the same time it provides little underlying data for an environmental scientist to draw alternative conclusions from. On the other hand, a table of worldwide temperature measurements would be of little value to the public at large because only environmental scientists understand the climate models with which conclusions can be reached, and in collaboration with a designer could create a compelling visual explanation of climate change.
When discussing open government data, the principle of promoting analysis is primary when considering data quality. Large volumes of data are useless if they cannot be analyzed with automated, computerized processes. Therefore when using terms such as accuracy and precision for data, they are always with respect to some automated process to analyze the facts in the data. They do not refer to whether a recording captured the physical details of events or objects.
In a structured data format such as XML, greater precision breaks fields down into more subcomponents, for instance the difference between a single field for a name versus breaking the field down into first, middle, and last name. Names are particularly difficult to process in an automated way because they are so idiosyncratic: for instance, Congresswoman Debbie Wasserman Shultz is “Rep. Wasserman Schultz” not “Rep. Shultz” as you might think if you weren’t already familiar with her name. A more precise way to identify an individual, beyond a name, would be through a numeric identifier that relates the individual to other occurrences in other datasets. Greater precision is always better, other things being equal, but if the intended use of the data does not require processing names then we might not expect extra effort to be spent increasing the precision of names.
Precision and accuracy are intertwined with cost on both the producing and consuming ends. For the very same database, it may be possible to achieve high precision and high accuracy in processing the data, but only at high cost. When we ask for a database with high precision and high accuracy, we mean at a reasonable price.
Precision: The depth of knowledge encoded by the data.
Accuracy: The likelihood that the information extracted from the data is correct.
Data quality: Whether the data has an acceptable level of precision and accuracy for a particular purpose within an acceptable processing cost.
Let’s say the intended purpose of some data requires displaying the last name of each named individual in the dataset. In some cases, even if the names are given in a single combined field (“Debbie Wasserman Schultz”) it may be possible to determine the last name (“Wasserman Schultz”) with 100 percent accuracy. However, to do so might require hiring an intern to call up each individual and ask what exactly his or her last name is. In this case, precision and accuracy are very high, but so is cost. If the intern calls only half of the individuals and guesses for the rest, the precision stays the same, but accuracy would be reduced, meaning there is some chance the interpretation of the data will contain errors. But that’s what you can get by halving the cost.
When government records are released as scanned images, the content of those records might be converted to text with high precision and accuracy using expensive OCR software (or human transcribers) or with low precision and accuracy using open source OCR software, a lower cost.
When the cost of high precision and accuracy becomes prohibitive, data intended to expand government transparency becomes of limited value. Data quality is the suitability of data for a purpose taking into account the cost of obtaining acceptable levels of precision and accuracy.