Requesting metadata with electronic records

Why “data about data” is important
Feature
Page Number: 
17

From the Spring 2011 issue of The News Media & The Law, page 17.

A federal court in New York City ruled in February that some metadata is presumed releasable under the federal Freedom of Information Act, the first time a court has ruled that the federal government must provide metadata with associated electronic records requested under FOIA.

The case, National Day Laborer Organizing Network v. U.S. Immigration and Customs Enforcement Agency, was brought after the Immigration and Customs Enforcement Agency and other federal agencies responded to a FOIA request by providing nearly 3,000 pages to the National Day Laborer Organizing Network in five unsearchable — and, according to the network, virtually unusable — PDF files.

The network, which advocates for day laborer rights, sued, arguing that the associated metadata should have been included with the released documents and that failing to include the data is a violation of FOIA.

Getting documents in searchable PDF form, rather than the static image format the National Day Laborer Organizing Network received, would enable the network to search the documents for relevant information through key word searches.

It also would enable the network to distinguish between individual documents, which would allow the network to jump to particular sections, rather than having to sift through 3,000 pages with no idea of what information is contained on each page.

The government argued that the network never asked for the metadata in its FOIA request and, in order to receive metadata, it must be specifically requested.

In addition, the government argued that to produce the metadata would take expansive time and money to search for and review it to apply any appropriate exemptions. The government also claimed it had to provide static image PDF files in order to prevent requesters from being able to reproduce the redacted portions of the document.

As electronic records creation and storage becomes increasingly prevalent, the issue of metadata and FOIA will only become more common.

And journalists need to know why it may be important to specifically include a request for metadata when they request electronic documents.

 

What is metadata?

Metadata is “data about data,” according to David Donald, data editor at the Center for Public Integrity, an organization of investigative reporters who focus on government accountability issues. Metadata includes, for example, the information automatically created by a computer when a document is created or altered, such as time, date or author.

But it can also be manually created. This data, often called “embedded metadata,” can be some of the most helpful information because “tagging” documents for key words and topics can tell a journalist if a document contains information he or she is looking for and save hours in research time, Donald said.

Tagging databases is not something readily seen by those who view the documents, but it is important, Donald said.

When looking at multiple databases, tagging is essential to allow a researcher to search a document for “GM,” for example, and having that tagged to include references to “General Motors” and “GMC,” he explained. Without tagging, it would be impossible to do such searches, he said.

Metadata is also a general term that covers a wide swath of electronic information, Donald said. According to the National Information Standards Organization, a nonprofit association that identifies and maintains technical standards to manage information and has published information concerning metadata and its uses, there are three main types of metadata: descriptive, structural and administrative.

Descriptive metadata includes information about a document, including authorship, title and keywords. This metadata is used when looking at memoranda or email messages to determine authorship and chronology of the exchanges.

Structural metadata concerns the makeup of a document, including pagination and if and how multiple files were combined. These data allow the reader to see information regarding page numbering, and graph and table insertions.

Administrative metadata helps manage a document by, among other things, containing information used to catalog and preserve the document. These data can be used to determine accuracy because a journalist can look at this information to see which of two similar graphics is more recent or which statistic or poll has a larger margin of error.

Using metadata to help archive and preserve a document is also achieved by updating and merging software of documents as technology advances.

Administrative metadata allow the document to be updated and preserved so that, even when the software it was created on is no longer in existence, the information within the document remains accessible and uncorrupted because it preserves the information by “track[ing] the lineage of a digital object (where it came from and how it has changed over time) . . . detail[ing] its physical characteristics, and . . . document[ing] its behavior in order to emulate it on future technologies,” according to the National Information Standards Organization.

The National Day Laborer Organizing Network requested metadata that fits all three types. The network requested the file names, custodians and modification dates of the files, which are descriptive metadata.

It requested the source path and production path of the documents: the path from where the item was collected and the path to the document from the production media, which are administrative metadata. The network also requested page number information and searchable PDF files, which include the structural metadata that allows maneuverability throughout the documents.

A document stripped of metadata “is totally worthless” because metadata is used to search massive amounts of information and tables and spreadsheets to find out if there is something of interest within, Donald said.

Donald said he uses “data dictionaries,” a type of structural metadata, before he even looks at the actual documents to determine if and where there is relevant information within the larger document. A data dictionary, or a record layout, is a document that lists the tables, fields and columns in a database. It also contains a codebook to decode the table fields because most databases are just numbers. A requester needs the list of codes to know even the names of the columns because they are generally all represented by numbers in the table, he said.

The metadata makes sense of the data that alone is “unusable,” Donald said. The most important metadata for him is the data dictionary because most of his work is looking at databases made up almost entirely of tables. One such table he has used is 1,600 columns wide and everything is in code. “There’s no way I could understand it without the data dictionary,” he said.

Donald also said he uses data dictionaries that are posted online to determine what documents he should request through FOIA, specifically referencing the Research Data Assistance Center.

ResDAC is a website where the Centers for Medicare & Medicaid Services posts its data dictionaries online. Before any FOIA requests are submitted, Donald and other journalists look up the data dictionaries for the agency’s documents to see which documents contain the information. Donald said this is a practice all agencies should adopt because it can reduce the number of FOIA requests, make the requests more specific and easier to process, and reduce the need for requesters to file follow-ups.

David Herzog, an associate professor at the Missouri School of Journalism and academic advisor to the National Institute for Computer-Assisted Reporting at Investigative Reporters and Editors Inc. said data dictionaries are “the road map to help understand [a document].” Without metadata like data dictionaries, you can get lost easily, he said.

Metadata is also used to create aggregations of information from multiple databases and facilitate interoperability of the information by allowing the information to be understood across multiple platforms. This is done through tagging, which allows computers to recognize when even slightly different information is actually the same, Donald said.

Metadata helps the information be understood from computer to computer, with “minimal loss of content and functionality, according to the National Information Standards Organization. Interoperability is achieved by having “a common core set of elements” that can be recognized by computers and allow the computers to read to the documents, even if the documents are originally from a vastly different network or system, the organization says.

Interoperability is especially important when information is coming from government systems and it needs to be understood by a FOIA requester. Metadata is also important to determine the authenticity of documents, by determining who created the document, who edited it along the way and who altered it last.

Prior to the district court decision in National Day Laborer Organizing Network, which has been appealed in the U.S. Court of Appeals in New York City (2nd Cir.), courts had discussed access to metadata only in the context of discovery under the Federal Rules of Civil Procedure.

 

Metadata and the rules of discovery

The Sedona Conference, a nonprofit organization that focuses on complex litigation issues and has done considerable work on electronic records issues, first addressed metadata in its 2004 best practices manual. That guide, “Sedona Principles: Best Practices Recommendations and Principles for Addressing Electronic Document Production,” concluded there was little evidentiary value in metadata.

Courts addressing discovery conflicts during civil litigation repeatedly cited the Sedona Conference when they denied plaintiffs access to metadata because, as stated in the guide, parties to litigation were under “no obligation to preserve and produce metadata absent agreement of the parties.”

Issues involving such information, though not necessarily recognized as “metadata” at the time, arose as early as 1993 in Armstrong v. Executive Office of the President. The U.S. Court of Appeals in the District of Columbia held in Armstrong that paper copies of email did not satisfy requirements under the Federal Records Act, which allows original documents to be destroyed if there were copies, because paper printouts would lack “important information” like the sender, recipient and time of retrieval.

The issue in Armstrong was not directly about metadata, and the information was not identified as metadata, but the court acknowledged that information contained in electronic records does not always translate into hardcopy and that information can be important to preserving the integrity of official documents.

It was not until 2005 that a court first ordered a party to produce metadata and not until 2007 that the Sedona Conference changed its position on the value of metadata. The view of courts finally began to change in earnest after the Sedona Conference’s revisions.

From 2004 to 2007, courts generally found that litigants failed to demonstrate a need for metadata. There was a presumption that metadata was irrelevant, so parties seeking the information often failed when they could not produce a persuasive argument as to the specific need for the information. Federal courts in Wyeth v. Impax Laboratories, Inc. (2006), Kentucky Speedway, L.L.C. v. National Association of Stock Car Racing, Inc. (2006), and Michigan First Credit Union v. Cumis Insurance Society, Inc. (2007) followed the first Sedona Conference guidelines, often quoting the guide itself.

During that time frame, however, there were courts that recognized the potential value in metadata. In 2005, Williams v. Sprint/United Management Company was the first federal case where a court ordered the production of metadata. The plaintiffs in Williams received a spreadsheet during discovery in which metadata had been stripped and certain cells were locked, restricting plaintiffs’ ability to access much of the information contained within the documents.

The U.S. District Court in Kansas held that, because the discovery request was for documents “in the manner in which it was maintained,” defendants had to either provide the metadata or “show cause for its actions in scrubbing the metadata from the electronic spreadsheets prior to producing them.”

The defendants’ argument that metadata could be used to recover privileged information that had been redacted from the spreadsheet was dismissed by the court as circumstantial. The court also said defendants failed to show evidence of what privileged information could be exposed. As such, the defendants were ordered to produce the spreadsheets with metadata intact and the cells unlocked.

Using metadata to reveal redacted information, as argued by the defense, is not completely unheard-of. Metadata can be used to preserve prior versions of documents and can move with a file when it is copied.

In 2006, a year after the Williams decision, the Pentagon posted a redacted report on its website about an incident in Iraq where a U.S. soldier accidentally killed an Italian secret service agent. Readers were able to use the document’s metadata to reveal blacked-out information in the PDF file when the portions were copied and pasted into a Word document.

The key to the Williams decision is that the court found that the phrase “as they are maintained” implied that metadata was included in the request. Courts following the Sedona Conference guidelines generally found the presumption was that metadata was not part of a discovery request unless the parties specifically included metadata in the agreement.

However, the Williams court found that requesting an electronic document “as it is maintained” should include the metadata because the removal of metadata “requires an affirmative act by the producing party that alters the electronic document.”

The 2007 edition of the Sedona Conference’s guide reflects a change in the treatment of metadata that was first expressed by the Williams court. The guide uses nearly identical language in saying that metadata is presumed part of a request for electronic documents and can only be removed upon court protective order or agreement of the parties.

 

Metadata and state records laws

Prior to February’s decision in the federal district court, a few state courts addressed the issue of metadata and public records laws. Courts in New York, Arizona and Washington have held that metadata is subject to disclosure under their state public records laws.

In February 2010, the appellate court in New York held, in Irwin v. Onondaga County Resource Recovery Agency, that “system metadata,” which the court defined as automatically created metadata that identifies the file name, sizes, creation dates and modification dates of electronically stored documents, is a public record and therefore subject to the state’s Freedom of Information Law.

The court held that the metadata in question “is at its core the electronic equivalent of notes on a file folder.” The court did not discuss whether other forms of metadata would also be considered public records under the law.

In the 2009 decision in Lake v. City of Phoenix, the Arizona Supreme Court held that, not only are electronic records subject to the state’s public records law, but the embedded data within an electronic record is also disclosable.

“When a public officer uses a computer to make a public record, the metadata forms part of the document as much as the words on a page,” the court held.

Finally, the Supreme Court of Washington held last October that metadata within a public record is a public record under the Public Records Act in O’Neill v. City of Shoreline. The court found that when the record itself is a public record, there is “no doubt” that “its embedded metadata is also a public record and must be disclosed.”

 

Metadata and federal FOIA

The federal FOIA does not specifically address the issue of metadata and when or if the data is presumed open under the law.

The court in National Day Laborer Organizing Network held that the result in a FOIA case should be the same as in other courts that have examined metadata, in discovery litigation and state records laws: “By now, it is well accepted, if not indisputable, that metadata is generally considered to be an integral part of an electronic record.”

The National Day Laborer Organizing Network requested access to documents pertaining to the Secure Communities program from several federal agencies: U.S. Immigration and Customs Enforcement, the U.S. Department of Homeland Security, the Federal Bureau of Investigation and the U.S. Department of Justice Office of Legal Counsel.

Secure Communities, which operates in 38 states and is expected to go nationwide by 2013, is a program that allows ICE to use local police records to identify persons subject to deportation.

The network requested the documents to determine if localities may “opt-out” of the program when it goes nationwide.

The court ruled that the network had requested that electronic documents be provided in their “native format” because the network specifically asked that Excel workbooks not be provided as PDF screenshots. As such, the court was satisfied that the government was on notice that metadata must be included with the electronic files.

Instead of getting the information in their native format as requested, the network received nearly 3,000 pages stripped of all metadata.

The files included PDF files of paper documents and screen shots of electronic files, merged indiscriminately with other electronic files.

There were no page numbers and, because the PDFs were screenshots of files and not text, they were unsearchable, rendering the files virtually unusable to the network for research purposes.

The standard in FOIA is that records be provided in the format requested if “readily producible by the agency in that form or format.” The court held that the standard refers to technical ability. The government did not argue it was unable to produce the records in the format desired, only that reviewing all the metadata would be burdensome, the court found. An increase in time and cost of retrieval does not mean the records are not readily producible, according to the court.

The Federal Rules of Civil Procedure should “inform highly experience litigators as to what is expected of them when making a document production in the twenty-first century,” the court held.

The Federal Rules require that electronically stored information, stored in an electronically-searchable form, “should not be produced in a form that removes or significantly degrades” the requester’s ability to search the document. Producing “records in a form that makes it difficult or burdensome for the requesting party to use the information efficiently” is not acceptable under the Federal Rules or FOIA, the court held.

Under FOIA, metadata is “presumptively producible,” the court held. Producing records that cannot be electronically searched is “inappropriate,” the court reiterated.

The court left room for metadata to be redacted in some instances, depending on the type of record and how the records are generally maintained by the agency. The government must now rebut the presumption that metadata is readily producible.

The court does not cite specific examples of how this could be shown other than to posit that there are situations when metadata may not be maintained as part of the official record.

“It is no longer acceptable for any party, including the Government, to produce a significant collection of static images of [electronically stored information] without accompanying load files,” the court said.

 

Moving forward

The court in National Day Laborer Organizing Network placed emphasis on the fact that the plaintiffs had put the government “on notice” that metadata was part of the request. Because the government was not “on notice” from the initial request, but only from a subsequent email, the court held that only documents requested after that email must include metadata.

There’s a lesson in the court’s ruling. “Ask for [metadata] up front,” Donald said. He noted that, often, missing metadata is not necessarily intentional and some problems can be resolved through a couple of conversations. Asking for metadata from the beginning can prevent a lot of conflicts later, Donald said.

Herzog said he tells journalists to be specific in their requests and ask for the metadata. “Ask for supportive documents and anything necessary for understanding the database,” he said.

“Metadata should almost always be FOIA-able,” Donald said. “There aren’t privacy concerns . . . There’s a difference between who created the database and who ends up in the database.”

Metadata is the information telling you who created the database and who worked on it later, Donald said. “You should be able to track who created the documents,” he said.

Donald also expressed concern about the often poor quality of metadata in government documents and the general lack of metadata that he finds. While some metadata is automatically generated, other metadata is the result of technicians establishing key words and tags in the many databases.

The most important tags are those that allow a researcher to search and get responses not only to his keyword, but also to the many different abbreviations and short forms it may take within the databases, Donald said.

The problem is there is not a uniform system of tagging and there is not enough tagging done to be truly helpful for journalists and researchers.

Creating bad metadata, or not creating enough metadata, can significantly harm the functionality of documents and databases, Donald said.

“Without a tag, there’s no way to tell if information refers to the same thing. Metadata allows databases to talk to each other.”

And without tagging, it is hard for journalists to even tell if the database they are looking at contains the information they need or want, he added.

Donald commended Data.gov, a federal website established to provide a centralized location for the public to find government documents, for including metadata in the files posted on the site.

However, the recent push by the government to get documents online quickly has resulted in many documents being posted without metadata, he said.

Thus, despite the document being posted on the Internet, one may still have to file a FOIA request to truly understand information that the government already deemed open to the public, Donald said.