DeepSeek AI Models Trained: Google Gemini Data Involved?

DeepSeek's AI models may have used Google Gemini data, sparking a debate on data ethics and AI training boundaries.

In the fast-moving world of artificial intelligence, where breakthroughs happen almost weekly, the latest controversy centers on the training data behind DeepSeek’s newest models. As of June 4, 2025, DeepSeek, the Chinese AI lab making waves with its advanced language and reasoning models, is under intense scrutiny—not just for its impressive performance on math and coding benchmarks, but for speculation that its latest models may have been trained using data from Google’s Gemini AI.

Let’s face it: the AI arms race is heating up, and everyone wants to know where the boundaries lie—especially when it comes to training data. DeepSeek’s rapid rise has been fueled by massive computational resources and a staggering 14.8 trillion tokens of training data[5]. But recent revelations have raised questions about the sources of that data, with independent developers and researchers pointing to possible influences from Google’s Gemini models. This isn’t just a technical curiosity—it’s a story about the ethics, transparency, and future of AI development.

The Rise of DeepSeek: A Brief History

DeepSeek isn’t a household name like OpenAI or Google, but in AI circles, it’s quickly becoming synonymous with cutting-edge research and resource-intensive training. Founded in China, DeepSeek built its own data center clusters from day one, giving it direct control over its training infrastructure[1]. This approach has paid off: the lab’s models have consistently ranked among the top performers on global benchmarks, especially in math and coding tasks[3].

But what really sets DeepSeek apart is the scale of its operations. Training its latest models required 2.788 million GPU hours on around 2,000 Nvidia H800 chips—a staggering investment in hardware and energy that few companies outside the tech giants can match[5]. The result? Models with advanced language comprehension and the ability to tackle complex reasoning tasks.

The Latest Development: Did DeepSeek Train on Google’s Data?

On May 28, 2025, DeepSeek released an updated version of its R1 reasoning model, dubbed R1-0528. The model quickly gained attention for its strong performance on math and coding benchmarks—but also for some curious quirks in its output[2].

Independent researchers, including Sam Paeach, a Melbourne-based developer known for his “emotional intelligence” evaluations of AI, noticed that the new DeepSeek model “prefers words and expressions similar to those Google’s Gemini 2.5 Pro favors.” In a post on X, Paeach speculated that DeepSeek may have switched from training on synthetic outputs from OpenAI to synthetic outputs from Gemini[2].

Another developer, known by the pseudonym SpeechMap, observed that the “thoughts” generated by DeepSeek’s model—the internal reasoning traces it produces as it works toward an answer—read remarkably like those generated by Google’s Gemini models[2].

DeepSeek has not officially disclosed the sources of its training data, leaving the door open for speculation. This isn’t the first time the company has faced such questions. In December, developers noticed that DeepSeek’s V3 model sometimes identified itself as ChatGPT, OpenAI’s chatbot, suggesting that it had been trained on ChatGPT chat logs—a practice that raises ethical and legal questions about data provenance[2].

The Broader Context: Training Data and AI Ethics

The use of data from rival AI models is a contentious issue in the AI community. While it’s common for researchers to train models on publicly available datasets, using outputs from proprietary models—especially those from direct competitors—can blur the lines of fair use and intellectual property.

Let’s be real: the AI ecosystem thrives on open collaboration, but also on competition. When a model like DeepSeek’s shows signs of being trained on Google’s data, it raises questions about the boundaries of acceptable practice. Is it fair game to use outputs from another company’s model? What are the risks of “model contamination,” where one model’s biases or quirks are inherited by another?

These questions aren’t just academic. They have real-world implications for the reliability, safety, and fairness of AI systems. And as AI models become more powerful, the stakes only get higher.

DeepSeek’s Training Infrastructure and Scale

To understand why DeepSeek is a force to be reckoned with, it’s worth looking at the numbers behind its training process:

Training Dataset: 14.8 trillion tokens—an enormous volume that exposes the model to a vast range of linguistic structures and nuances[5].
GPU Hours: 2.788 million H800 GPU hours—a colossal computational effort that reflects the intensity and rigor of the training process[5].
Hardware: Around 2,000 Nvidia H800 chips—state-of-the-art hardware that enables parallel processing and accelerated learning[5].

This level of investment is rare outside the biggest tech companies, and it gives DeepSeek a significant edge in the race to develop more capable AI models.

Industry Reactions and Expert Perspectives

The AI community is divided on the issue of data provenance. Some argue that as long as the data is publicly available or synthetically generated, it’s fair game. Others worry about the risks of “model contamination” and the erosion of intellectual property rights.

Sam Paeach’s observations have sparked lively debate on social media. “If you’re wondering why new deepseek r1 sounds a bit different, I think they probably switched from training on synthetic openai to synthetic gemini outputs,” he tweeted[2]. This kind of forensic analysis is becoming increasingly common as developers seek to understand the inner workings of black-box models.

Interestingly enough, DeepSeek is not alone in facing these questions. Other AI labs have been accused of using data from rival models, and the issue is likely to become more prominent as AI adoption grows.

Real-World Applications and Impacts

DeepSeek’s models are not just academic curiosities—they have real-world applications. The R1 model’s strong performance on math and coding benchmarks suggests it could be used for everything from automated tutoring to code generation and data analysis[3].

But the controversy over training data raises questions about the reliability and fairness of these applications. If a model inherits biases or quirks from another company’s model, how can users trust its outputs? And what are the legal and ethical implications for companies that deploy such models in sensitive domains like healthcare or finance?

Comparing DeepSeek and Google’s Gemini

To put the current controversy in context, here’s a quick comparison of DeepSeek’s latest model and Google’s Gemini:

Feature/Aspect	DeepSeek R1-0528	Google Gemini 2.5 Pro
Training Data	14.8 trillion tokens (sources unclear, may include Gemini outputs)	Proprietary, large-scale datasets
Hardware	~2,000 Nvidia H800 GPUs	Google’s proprietary TPUs
Benchmark Performance	Strong in math, coding	Strong in math, coding
Transparency	Limited, sources not disclosed	Limited, but Google is more established in public disclosures
Controversy	Accusations of using rival data	None currently

This table highlights the similarities and differences between the two models, as well as the ongoing questions about data provenance.

Future Implications: Where Do We Go From Here?

The DeepSeek controversy is a microcosm of the broader challenges facing the AI industry. As models become more powerful and more widely used, questions about data provenance, intellectual property, and ethical responsibility will only become more urgent.

For companies like DeepSeek, the path forward is clear: greater transparency about training data sources and a commitment to ethical practices. For the industry as a whole, this episode should serve as a wake-up call to establish clearer guidelines and standards for data use.

As someone who’s followed AI for years, I’m thinking that the next big breakthrough won’t just be technical—it will be cultural. The way we think about data, collaboration, and competition in AI will shape the future of the field.

A Preview of What’s to Come

DeepSeek’s latest models are pushing the boundaries of what’s possible in AI, but not without controversy. The debate over training data sources is just beginning, and it’s sure to spark more discussion—and maybe even some new rules—in the months ahead.