3 min Devops

Users will share data for AI training with GitHub Copilot, unless they opt out

Users will share data for AI training with GitHub Copilot, unless they opt out

GitHub reports that, starting April 24, it will be changing how it uses data for its AI assistant Copilot. Interaction data from users of Copilot Free, Pro, and Pro+ will now be used by default to train and improve AI models, unless users explicitly opt out. The change does not apply to Copilot Business and Copilot Enterprise.

Neowin adds that in practice, this change amounts to an opt-out model. Users who do not take action before April 24 will be automatically included in the training program. This explicitly shifts the responsibility to the user to actively adjust privacy settings. This could potentially lead to debate regarding transparency and informed consent.

With this move, GitHub, a Microsoft subsidiary, is following a broader trend within the AI sector. Real-world data is becoming increasingly important for improving model performance. According to the company, using real interactions leads to more accurate and context-aware suggestions. It is intended to help developers write code more efficiently and securely.

The data GitHub intends to use includes, among other things, Copilot input and output, code snippets, context around the cursor position, and user feedback on suggestions. Information such as file structures and interactions with features like chat and inline suggestions may also be included. This effectively covers virtually all interactions a user has with Copilot.

Distinction Between Stored and Active Data

It is noteworthy that GitHub explicitly distinguishes between data at rest and active interactions. Content from private repositories is not used unless it is actively processed via Copilot. Once a user uses Copilot within a private repository, that interaction data may be used for model training. This applies unless the user has opted out.

Users who do not want their data to be used can disable this via the privacy settings. GitHub states that existing preferences will be respected. Users who have previously chosen not to share data for product improvement will automatically remain excluded from this new training program.

The decision is partly based on previous experiments within Microsoft, where employee interaction data was already used to improve models. According to the company, this has led to higher acceptance rates for suggestions and better performance across various programming languages. The company expects that expanding to a broader user group will reinforce this trend.

In addition, Microsoft emphasizes that the collected data may be shared with affiliated companies within its own organization, but not with external AI model providers. In doing so, the company aims to alleviate concerns about data sharing with third parties. Nevertheless, the use of developer data for training purposes remains a sensitive topic.

GitHub states that the future of AI-assisted software development depends on real-world input. By training models with actual development workflows, the company aims to further position Copilot as a reliable and productive assistant for programmers.

Also read: Criticism surrounding the integration of Grok into GitHub Copilot