Microsoft Copilot continues to expose private GitHub repositories


In August 2024, a LinkedIn post caused alarm by alleging that ChatGPT (and, by association, Microsoft Copilot) was capable of accessing data from private GitHub repositories. Such a claim, if true, could have significant ramifications for data security and privacy.

Eager to uncover the truth behind the claim, the research team at Lasso, a digital security company, undertook a thorough investigation. What they found was a digital conundrum involving cached, publicly exposed, and now private data—a phenomenon they have since dubbed “Zombie Data.”

Beginning the investigation

The investigation began with the LinkedIn post, which hinted at ChatGPT potentially leveraging data from a GitHub repository that had been made private. Lasso’s team conducted a quick search, discovering that the repository in question was indexed by Bing during its public phase but was no longer accessible directly on GitHub.

When querying ChatGPT about the repository, it became apparent that the AI tool wasn’t pulling data from direct access but from hypothetical or indexed content. As Lasso noted, ChatGPT relies on Bing for web indexing when crafting replies, which provided an explanation: repositories that were once public but later made private had their indexed data captured by Bing’s cache.

However, this discovery prompted two pressing questions: What happens to the data within repositories that were turned private or deleted? And how many other repositories might be affected by this phenomenon?

A close-to-home discovery

As part of the investigation, Lasso decided to test their own systems. A quick Bing search revealed that one of their organisational repositories had been indexed despite being made private on GitHub. Internal audits showed that this repository had been mistakenly made public for a brief period before it was secured.

Testing whether the cached data was retrievable, the team probed ChatGPT. While ChatGPT could only infer the repository’s existence through Bing’s cache, it did not provide actionable data. However, another of Microsoft’s AI tools, Copilot, presented a far more concerning result.

Unlike ChatGPT, Microsoft Copilot was able to extract actual data from the time the repository was public. This suggested that Copilot was accessing a cached snapshot of the repository’s contents—the aforementioned “Zombie Data,” information users believe to be private or deleted but which remains accessible if cached by external tools or systems.

Microsoft Copilot highlights risks of ‘Zombie Data’

This revelation raised serious questions about data privacy on platforms as ubiquitous as GitHub. Key issues identified include:

  • “Zombie Data” persistence: Data that was momentarily public can remain retrievable indefinitely via caches like Bing’s, even after being set to private. In Lasso’s words: “Any information that was ever public, even for a short period, could remain accessible and distributed by Microsoft Copilot.”
  • Private code at risk: Sensitive organisational data stored in repositories — especially those accidentally made public before being secured — are particularly at risk. These repositories may contain credentials, tokens, and other critical assets that could be exploited.
  • Microsoft’s role: The issue was compounded by Microsoft Copilot’s ability to access cached snapshots via Bing. This connection raised questions about whether tools developed by the tech giant are adequately handling user safeguards, particularly given that GitHub, Bing, and Copilot are all part of Microsoft’s ecosystem.

Systematic investigation finds widespread exposure

Using Google BigQuery’s GitHub activity dataset, Lasso compiled a list of all repositories that had been public at some point during 2024 but were now set to private.

Their research workflow included the following steps:

  1. Identifying public activity: They isolated repositories that were public but no longer accessible, either due to deletion or being set to private.
  1. Probing Bing’s cache: For each repository flagged as “missing,” the team conducted Bing searches for cached records associated with the repository.
  1. Scanning exposed data: Extracted cached data underwent analysis for sensitive information, including secrets, tokens, keys, and unlisted dependencies.

Lasso’s findings were startling:

  • Over 20,580 GitHub repositories were identified as accessible through Bing’s cache despite being private or deleted.  
  • 16,290 organisations were affected, including major players like Microsoft, Google, Intel, Huawei, PayPal, IBM, and Tencent.  
  • 100+ vulnerable packages and 300+ private credentials or secrets (to platforms like GitHub, OpenAI, and Google Cloud) were exposed, illustrating the sheer depth of the issue.

Response from Microsoft and the partial Copilot fix

Alerted to the findings, Lasso contacted Microsoft to report the vulnerability. While Microsoft acknowledged the issue, it categorised it as “low severity,” citing limited impact. Nevertheless, the company acted swiftly to mitigate the problem.

Within two weeks, Bing’s cached link feature was removed, and the cc.bingj.com domain – which stores cached pages – was disabled for all users. However, the fix was only surface-level. Cached results continued to appear in Bing searches and, most alarmingly, Copilot retained access to sensitive data hidden from human users.

In January 2025, Lasso tested the situation once more after learning of a GitHub repository associated with a TechCrunch report. Despite the repository being deleted by Microsoft following legal grounds, Copilot still managed to retrieve its content—reaffirming concerns that Bing-powered systems could sidestep human safeguards.

Implications of the findings

The explosion of LLMs has introduced an entirely new threat vector to organisational data security. Unlike traditional breaches that result from leaks or hacking, Copilot’s ability to surface cached “Zombie Data” has exposed vulnerabilities that few organisations were prepared for.

Based on their research, Lasso outlined several key takeaways:  

  • Assume data is compromised once public: Organisations should treat any data that becomes public as potentially compromised forever, as it may be harnessed by indexing engines or AI systems for future training and retrieval.
  • Evolving threat intelligence: Security monitoring should extend to LLMs and AI copilots to assess whether they expose sensitive data through permissive engagements.
  • Enforcing strict permissions: AI systems’ eagerness to respond can overstep boundaries, leading to oversharing. Organisations must ensure such tools respect strict permissions and access controls.
  • Foundational hygiene still matters: Despite emerging risks, basic cyber hygiene practices remain invaluable. Keeping sensitive repositories private, avoiding hardcoding tokens, and securing internal packages through official repositories are essential measures.

Lasso’s findings, coupled with Microsoft’s partial response, highlight the ongoing challenge posed by “Zombie Data” and the growing influence of generative AI tools. In an era when data is king and LLMs are voracious consumers, organisations must manage every byte leaving their networks—once it’s out, it may never truly come back.

(Photo by Saradasish Pradhan)

See also: AI coding tools: Productivity gains, security pains

Want to learn more about cybersecurity and the cloud from industry leaders? Check out Cyber Security & Cloud Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Digital Transformation Week, IoT Tech Expo, Blockchain Expo, and AI & Big Data Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

Tags: AI, artificial intelligence, coding, copilot, cybersecurity, development, github, infosec, microsoft, programming, security, tools




Source link

Show Comments (0) Hide Comments (0)
Leave a comment

Your email address will not be published. Required fields are marked *