github vault

GitHub Arctic Code Vault has likely captured sensitive patient medical records from multiple healthcare facilities in a data leak attributed to MedData.

The private data was leaked on GitHub repositories last year whose contributors carry the "Arctic Code Vault" badge.

This means, these repositories could now be a part of a huge open-source repo collection bound to last a 1,000 years.

Although in the gray area of international copyright law and regulations pertaining to protection of patients' personally identifiable information (PII), the archived data might be a bit of a daunting task for anyone to extract and remove.

Leaked patient medical data to sit for 1,000 years in the Vault

Last year, GitHub came out with an archival initiative titled Arctic Code Vault that focused on preserving the vast majority of open-source artifacts published on the website, by porting these onto physical media that could stand the test of time.

To preserve the open-source community's contributions over the last few decades, billions of lines of code from GitHub repositories, current as of February 2nd, 2020, were printed on a hardened film designed to last for a thousand years.

These rolls of films were then shipped off to the GitHub Arctic Code Vault, situated in a remote coal mine, deep under an Arctic mountain in Svalbard, Norway, which is relatively close to the North Pole.

But, given its popularity and vast adoption rate, GitHub has been used in all kinds of situations: from developers storing legitimate software code, to attackers abusing GitHub for hosting malware like Gitpaste-12, to repositories that were later found to be leaking passwords and API keys that shouldn't have made their way on GitHub to begin with.

Should these artifacts also get their place in the history?

In an ironic twist of fate, a Dutch researcher Jelle Ursem, in collaboration with Dissent Doe of DataBreaches.net, discovered this could be the case with patient medical records associated with the MedData data leak.

This week, multiple medical facilities including Memorial HermannUniversity of ChicagoAspirusOSF HealthcareKing’s Daughters and SCL Health have come forward, issuing privacy incident and HIPAA breach notices related to the MedData PII leak.

According to these notices, confidential patient records kept by MedData, a national provider of healthcare revenue cycle management solutions, were uploaded by one of their former employees to GitHub during or before September 2019.

Although the files were removed by GitHub on December 17th, 2020, considering the Arctic Vault archive was finalized on February 2nd, 2020, the data very likely made its way into the historic collection:

GitHub Arctic Vault repository
Contributor(s) of GitHub repository with patient data have the Arctic Code Vault Contributor badge
Source: Databreaches.net

In August 2020, Ursem and Doe had jointly published details on the nine healthcare data leaks on GitHub that impacted medical records of 150,000 to 200,000 patients.

The researchers shortly identified another data leak from on GitHub which they traced to MedData.

They then informed MedData of this leak on December 10, 2020.

But it wasn't until now that impacted patients have been notified by the company: 

"Impacted covered entities whose patient's data was affected were notified on February 8, 2021. Letters were mailed to impacted individuals and applicable regulatory agencies on March 31, 2021," states MedData in an incident notice, which continues:

From our investigation, it appears that impacted information may have included individuals’ names, in combination with one or more of the following data elements: physical address, date of birth, Social Security number, diagnosis, condition, claim information, date of service, subscriber ID (subscriber IDs may be Social Security numbers), medical procedure codes, provider name, and health insurance policy number.

MedData asks GitHub to remove data from vault

Last year, when Ursem had informed MedData of this data leak, and the possibility that this data had slipped into GitHub's Arctic Vault, MedData further contacted GitHub asking for logs of the vault, and to discuss removal of such data from the vault, say the researchers. 

"We do not know what transpired after that, although there had been some muttering that MedData might sue GitHub to get the logs," say Ursem and Doe in a report published April 1st, which the researchers wished was an April Fools' Day joke.

Ursem had asked GitHub in 2020, what would happen if a repository containing PII or other sensitive data had made its way into the Arctic Code Vault.

He wondered, if GitHub could just go in and extract a single repository or would someone's medical data now be a part of the 1,000-year strong collection? 

The researcher told BleepingComputer:

"GitHub indeed didn't get back to me, possibly for legal reasons. I don't even think anyone had remotely considered this might happen."

"This is actually the first occurrence of something that I noticed may have ended up in the vault, but there's no telling how much more data that's not supposed to be there is in there, because there is no public way to verify this unfortunately."

"Imagine if a current day researcher stumbled upon an archive from a thousand years ago today that detailed people's medical issues from an era, described so thoroughly."

"They would have a field day," Ursem told BleepingComputer in an email interview.

Although realistically, nobody might go through the trouble of getting to the grand Vault to retrieve leaked materials now purged from GitHub, it does open up a question for what course of action exists for GitHub and companies when incidents such as this recent MedData leak take place.

Regulations around the world such as HIPAA, UK Data Protection Act, and GDPR strictly dictate how healthcare records and patient PII data are supposed to be handled, and the steps that need to be taken in the event of a data breach.

Last year, GitHub removed the YouTube-DL source code following a report of DMCA (copyright) violation, only to reinstate it later.

But, this code being fairly old very likely got archived in the Arctic Code Vault, according to the criteria specified by GitHub on what repositories get archived. 

The Arctic Code Vault FAQ also states that repositories deleted from GitHub, may not be deleted from all warm storage partners:

"Keeping a historic view is an important part of each archive. If you have a concern about your repository continuing to be a part of the archive, please contact the archives."

"For the GitHub Arctic Code Vault, we are unable to remove data that has already been stored."

But, according to GitHub, archives have a special status under GDPR, giving them some safe harbor:

"Warm storage contains more thorough information, but archives have a special legal status under GDPR which protects them. GitHub’s Legal Team has approved the Archive Program," states the FAQ section.

This indicates copyrighted works or otherwise legally objectionable material, although removed from GitHub, could continue to sit in the remote Vault for a millennium.

"We hope that GitHub cooperated with MedData, but we raise the issue here because we will bet you that many developers and firms have never even considered what might happen that could go so very wrong," the researchers concluded in their latest report.

Update 7:46 AM ET: Changed the headline and parts of the article to make it clear it is likely patient records from the MedData leak have been archived in the Vault.

Related Articles:

Chipmaker Nexperia confirms breach after ransomware gang leaks data

Hacker claims Giant Tiger data breach, leaks 2.8M records online

AT&T now says data breach impacted 51 million customers

AT&T faces lawsuits over data breach affecting 73 million customers

Shopping platform PandaBuy data leak impacts 1.3 million users