Deleted a GitHub repository? That data might not actually be gone forever, even if the site’s UI indicates otherwise, per a recent report by Truffle Security.
According to Truffle CEO and co-founder Dylan Ayrey, the issue seems to arise from the fork function—which allows users to create downstream clones of repositories that share code and visibility settings with the original, upstream repository. Users are likely to assume forking creates a “completely separate, isolated copy” of the original repository, Ayrey told IT Brew, but it’s not commonly known that it actually does something else.
“As it turns out, the process of creating that fork created something called the fork network,” Ayrey said. “And the fork network, the way it works under the hood—all of the different forks can share the same underlying pool data.”
When a repository is forked, it’s mirrored to the underlying pool data. Although the owner of the original repository can click a button to delete it, they can’t delete the underlying pool data, which remains accessible via any forks in perpetuity. If a fork is public, that means the underlying pool data is, too, explained Ayrey.
Here’s one way that can create a problem, according to Truffle researchers. Company A creates a repository, and Company B forks it, becoming part of Company A’s fork network. Company A continues to update the code but eventually abandons and deletes it, assuming the data is gone forever. But Company B retains access to a hidden copy of the original repository, including all updates added to the project after it was forked.
“A lot of companies had operated under the assumption that when they clicked that delete button, if the other forks hadn’t explicitly pulled that new code in, it was gone,” Ayrey said.
Top insights for IT pros
From cybersecurity and big data to cloud computing, IT Brew covers the latest trends shaping business tech in our 4x weekly newsletter, virtual events with industry experts, and digital guides.
The same process applies in reverse: If Company B deletes their fork, Company A can continue to access a hidden copy of Company B’s version of the data. Ayrey said forking can also create bridges between supposedly private, internal code repositories (which might contain sensitive data) and public release versions.
“We found a ton of instances from big companies where you could access internal versions of code that was never meant to be made public, that the company never realized was public,” Ayrey said. That included 40 valid API keys from deleted forks derived from just three repositories of a “large AI company,” the report stated.
Destructive actions in the network “remove references to commit data from the standard GitHub UI and normal git operations,” the researchers wrote. But other users can still access the data if they know the commit hash, which they added can easily be queried or brute-forced in many cases.
This is all explained in GitHub’s documentation, but is “very surprising and counterintuitive” if a user was unaware, Ayrey added. He compared the functionality to the nonprofit Sunlight Foundation’s Politwoops, a project that archived public officials’ deleted tweets.
“It’s not that every time a politician deletes their tweet, something interesting is there, but often that is the case,” Ayrey said.
GitHub told IT Brew in a statement (also shared with other outlets) that everything was working as intended, referencing the company’s documentation.
“GitHub is committed to investigating reported security issues,” the statement read. “We are aware of this report and have validated that this is expected and documented behavior inherent to how fork networks work.”