Nov 25, 2022 7:00 AM

Redacted Documents Are Not as Secure as You Think

Popular redaction tools don’t always work as promised, and new attacks can reveal hidden information, researchers say.

For years, if you wanted to protect sensitive text in a document, you could grab a pair of scissors or a scalpel and cut out the information. If this didn’t work, a chunky black marker pen would do the job. Now that most documents are digitized, securely redacting their contents has become harder. The majority of redactions—by government officials and courts—involve placing black boxes over text in PDFs.

When this redaction is done incorrectly, people’s safety and national security can be put at risk. New research from a team at the University of Illinois looked at the most popular tools for redacting PDF documents and found many of them wanting. The findings, from researchers Maxwell Bland, Anushya Iyer, and Kirill Levchenko, say two of the most popular tools for redacting documents offer no protection to the underlying text at all, with the text accessible by copying and pasting it. Plus, a new attack method they devised makes it possible to extract secret details from the redacted text.

The flaws aren’t just theoretical. After examining millions of publicly available documents with blacked-out redactions—including from the US court system, the US Office of the Inspector General, and Freedom of Information Act requests—the researchers found thousands of documents that exposed people’s names and other sensitive details. “I’ve been in lots of discussions with the US court system, I provided them 710 different documents that were just trivial copy-paste style redactions,” says Bland, the paper’s lead author.

Officials usually redact sections of text in documents because those parts contain people’s personal information, or they decide the information shouldn’t be released to protect an organization’s interests. Court documents may redact names of confidential informants or whistleblowers; policy documents may redact information that could damage national security if it is made public.

During the new research, which has been published as a preprint, the team analyzed 11 popular redaction tools. They discovered that PDFzorro and PDFescape Online allowed full access to text that had allegedly been redacted. All they needed to do to access the text was copy and paste it. The researchers registered CVE numbers—used to catalog unique security vulnerabilities—for both of the issues.

PDFzorro did not respond to WIRED’s request for comment. When we tested the tool, it was possible to access PDFzorro redactions by highlighting them. However, if you click on an option to “lock” the PDF before you download it, the text can’t be accessed. Meanwhile, a customer service representative from PDFescape Online said the software has been recently acquired by a new company and they have “rolled out an update for PDFescape Online” that includes security fixes. “The mentioned redaction tool has been removed and will be reworked to be fully compliant,” they said.

The Illinois research goes further than copy and paste. It also demonstrates a new way to attack PDF documents and use hidden fingerprints to reveal names that have been redacted. The team focused on names, Bland says, as they are commonly redacted and sensitive. It does not appear possible to unredact large blocks of text, the researchers say. To reveal people’s names, the team built a tool, dubbed Edact-Ray, that can “identify, break, and fix redaction information leaks.”

“Even if you do the redaction, supposedly correctly, even if you remove the text, there’s a lot of latent information that is dependent on the content that was redacted, and even that can leak information,” Levchenko says. “If you redact a name in a PDF, if the attacker has any context—they know this is an American—they will be able to, with high probability, either recover that name or narrow it down to a very small list of candidates.”

Edact-Ray focuses on the size of glyphs (broadly, characters or letters) and their positioning. “It’s pretty clear to a lot of people that the letter ‘L’ is skinnier than a letter ‘M,’ and that if you redacted just the letter ‘L,’ then you might be able to tell it is different from a redaction with just the letter ‘M,’” Bland says. The tool is essentially able to automatically compare the size of the redaction and the position of the letters with a predefined “dictionary” of words to estimate what has been replaced.

The software is constructed by inferring how the original document was produced—for instance, in Microsoft Word—and then reverse engineering the specifics of the document. “That tells us about how the text was laid out,” Levchenko says. “Once we know that, we have a model for how that tool laid out the text and how and what information it deposited throughout the rest of the document.” From here, it is ultimately possible to simulate what the original text may have been and produce a series of potential, or likely, matches. During testing, the team was able to eliminate 80,000 guesses per second.

“We found, for example, that redacting a surname from a PDF generated by Microsoft Word set using 10-point Calibri leaves enough residual information to uniquely identify the name in 14 percent of all cases,” the team’s research paper concludes, adding that this is likely to be a “lower bound on the extent of vulnerable redactions.”

Daniel Lopresti, a professor of computer science at Lehigh University who has studied redaction techniques, says the research is impressive. It “presents a comprehensive study of redaction tools and the ways in which they can be broken, including exploiting nearly invisible aspects of a document’s typography,” says Lopresti, who was not involved with the research. “The picture it paints is scary; too often redaction is done badly.”

The vast majority of the organizations impacted by real-world redaction failures highlighted in the research—including the US Department of Justice, the US courts system, the Office of Inspector General, and Adobe—did not respond to WIRED’s request for comment. Bland and the research paper say that many of the organizations have engaged with the team’s research.

Microsoft did not address data being leaked from Word documents that are converted to PDFs. “Customers can save a document as a PDF, but it is the role of the redaction tool to censor or obscure information,” says Jeff Jones, senior director, Microsoft. Jones adds that people should “review” data and their files before converting them to a format that is going to be shared.

Meanwhile, Mike Lissner, executive director of the Free Law Project, a nonprofit that helps open up court data and provided access to legal documents for the research, says the organization has developed a system that can help identify badly redacted documents. “This works well, but by the time a document is published in a court’s filing system, the secret is out, so we’re working on tools that will integrate with document management systems that lawyers use,” Lissner says.

Digital document redaction has proved challenging for years, with unnumbered examples of failures to properly secure sensitive information. Sometimes it is human error; other times, technical failings are at fault. “It’s hard to redact something as complicated as a PDF to completely remove the information,” Levchenko says. PDFs can contain text, images, tables, metadata, and more information.

Multiple high-profile redaction failures have exposed information that someone wanted to keep secret. These have involved mistakes in the redaction process, failure to properly protect the information, and the inclusion of enough details to allow people to decipher what the redactions were meant to be.

For instance, in 1991 researchers used a “desktop computer” to reverse engineer the Dead Sea Scrolls to reveal their full text and open the documents up to more people. Back in 2008, details about secret wiretapping agreements between the US government and telecoms firms could be accessed using copy and paste. In 2016, Edward Snowden was revealed as the target of US spying following a failure to redact his personal details. In October 2020, journalists were able to decipher redactions in Ghislaine Maxwell’s court deposition. And in February 2021, the European Commission published a version of its Covid-19 contract for the AstraZeneca vaccine that it didn’t properly redact.

When it comes to effectively redacting documents and protecting people’s information, the Illinois researchers hope their work will highlight another way PDFs can be attacked and encourage the creators of software to include measures that prevent hidden information from being leaked. They say that for now the NSA’s guidelines for redacting documents are perhaps the best way to protect redactions. The guide says if you redact Word documents, you should change the content of the original document before redacting the resulting PDF. Change someone’s name to a row of “x” characters or the word “redacted,” just to be safe.

You Might Also Like …