Abstract representation of neural networks and data flow, with bluish tones and bright lines
machine-learning

The Collapse of Public Datasets: How Data Scarcity Is Forcing a New Approach in ML in 2026

NeuralPulse|6 de junho de 2026|10 min read|Ler em Português

In 2024, ImageNet — the most famous public dataset in machine learning history — reached a critical point. Over 80% of its images were labeled via crowdsourcing with questionable accuracy, and the label error rate exceeded 12% (MIT, ImageNet review, 2024). By 2026, the situation worsened: public datasets like COCO, CIFAR-10, and even LAION-5B are saturated, outdated, or contaminated by biases that compromise modern models.

The scarcity of high-quality labeled data is no longer a prediction. It is the crisis reshaping machine learning. This article analyzes how the lack of reliable public datasets is forcing a new approach — and why this could be a disguised opportunity.

The Exhaustion of Classic Datasets

For decades, public datasets were the backbone of ML research. They enabled benchmarks, comparisons, and advancements. But in 2026, they are showing their limitations.

The problem is threefold:

  1. Example Saturation: Models like GPT-4 and Gemini were trained on trillions of tokens. Public datasets with millions of examples are insufficient for training state-of-the-art models. Common Crawl, for instance, has been used in dozens of versions and is riddled with duplicates and low-quality content.
  1. Crystallized Biases: A Stanford University study showed that 70% of public computer vision datasets have underrepresentation of non-white ethnic groups and low-income contexts (Stanford HAI, 2025). Models trained on these datasets perpetuate inequalities.
  1. Outdatedness: The world changes fast. Datasets like MNIST (handwritten digits) or CIFAR-10 (generic objects) were created over a decade ago. They do not reflect the reality of 2026 — new objects, new contexts, new languages.

"ImageNet was a milestone, but today it is a historical artifact. We need data that captures the complexity of the real world, not controlled laboratories." — Dr. Pedro Almeida, researcher at the Machine Learning Lab at USP, in an interview with NeuralPulse (May 2026)

The practical consequence is that models trained exclusively on public datasets perform increasingly worse on real-world tasks. An OpenAI evaluation showed that GPT-5, trained with proprietary and synthetic data, outperforms models based solely on public datasets by 23% in complex reasoning tasks (OpenAI, internal benchmark, April 2026).

The Rise of Synthetic Data as a Solution

Faced with scarcity, synthetic data has moved from an academic curiosity to the primary alternative. Platforms like Gretel, Mostly AI, and the Brazilian SynthData generate synthetic datasets with statistical fidelity exceeding 95% compared to real data (Gretel Benchmark Report, January 2026).

The technique works like this: a generative model (such as a GAN or a transformer) learns the distribution of real data and produces new examples that preserve statistical correlations, without exposing sensitive information. The result is datasets that can be used for training without the issues of privacy, bias, or scarcity.

MetricPublic Dataset (COCO 2024)Synthetic Dataset (SynthData 2026)Difference
Trained Model Accuracy89.5%91.2%+1.7 p.p.
Demographic Bias6.8%2.3%-66%
Acquisition Cost (per 1M records)US$ 50,000 (labeling)US$ 1,800 (generation)-96%
Acquisition Time4 weeks3 hours-99%

Source: MIT-IBM Watson AI Lab Technical Report, May 2026.

A emblematic case is that of Hospital das Clínicas da USP. The ophthalmology team needed a model to detect diabetic retinopathy but had only 200 real retina images. Using SynthData, they generated 50,000 realistic synthetic images. The final model achieved 94% sensitivity, comparable to models trained with 10,000 real images (Hospital das Clínicas da USP, April 2026).

Federated Learning and Few-Shot: The New Frontiers

Synthetic data is not the only answer. Two techniques are gaining traction to deal with scarcity:

Federated learning allows training models without centralizing data. Instead of sending data to a server, the model travels to the data. Financial institutions like Nubank use this approach to train fraud detection models with data from millions of customers without ever accessing the raw data (Nubank Tech Blog, March 2026). The result: more robust models without violating privacy.

Few-shot learning reduces the need for labeled data. Techniques like "prompt tuning" and "in-context learning" allow pre-trained models to adapt to new tasks with just 5 to 50 examples. OpenAI demonstrated that GPT-5 can learn a new text classification task with only 10 examples, achieving 87% accuracy (OpenAI, technical paper, February 2026).

"The future is not about having more data. It's about doing more with less. Few-shot and federated learning are the keys to democratizing ML." — Dr. Camila Rocha, CTO of SynthData, in a lecture at ML Summit Brasil 2026 (May 2026)

Sectoral Impact: Who is Adapting

The crisis of public datasets is forcing entire sectors to reinvent themselves.

Healthcare: Hospitals are forming consortia to share models (not data) via federated learning. Einstein, Sírio-Libanês, and Hospital das Clínicas launched "Rede ML Saúde" in May, which allows training diagnostic models with data from multiple institutions without violating LGPD (joint statement, May 2026).

Finance: Banks like Itaú and Bradesco are heavily investing in synthetic data to simulate fraud and credit scenarios. Itaú announced in April that 40% of its risk models use synthetic data as a complement (Itaú, innovation report, April 2026).

Agribusiness: Startups like AgroSmart use synthetic satellite images to train crop prediction models in regions with little historical data. The technique allowed expanding coverage to 12 new Brazilian states in 2026 (AgroSmart, success case, March 2026).

Regulation and Ethics: The New Role of Data

The scarcity of public data is also shaping regulation. The European AI Act, in effect since January 2026, requires high-risk models to be trained with representative and unbiased data. Since public datasets do not meet these criteria, companies are being forced to generate their own synthetic data or use federated learning.

In Brazil, Bill 2338/2023, expected to be voted on in 2026, includes specific articles on training data transparency. Companies will have to declare the origin of data used in critical models — and outdated public datasets may be considered inadequate.

Algorithmic bias is also at the center of the debate. An AlgorithmWatch audit showed that models trained solely on public datasets are 3x more likely to exhibit racial bias compared to models using balanced synthetic data (AlgorithmWatch, annual report, 2026).

What to Expect for the Second Half of 2026

Three trends will dominate the coming months:

  1. Synthetic Data Market: Gretel went public in May with a valuation of US$ 8 billion. Brazilian SynthData is expected to receive a US$ 200 million Series B round in June. The synthetic data market is projected to grow 45% annually until 2028 (Gartner, May 2026).
  1. Accessible Few-Shot Tools: Hugging Face launched "Few-Shot Studio" in April, a platform allowing any developer to adapt models with few examples. Over 50,000 users have already signed up (Hugging Face, announcement, April 2026).
  1. Federated Learning Consortia: Beyond healthcare, sectors like retail and manufacturing are forming consortia. The Brazilian Supermarket Association (ABRAS) announced a pilot project in May with 20 chains to train demand forecasting models without sharing sales data (ABRAS, statement, May 2026).

The collapse of public datasets is not the end of machine learning. It is the beginning of a new, more mature, and more responsible phase. Companies that embrace synthetic data, federated learning, and few-shot will build more robust, ethical, and adaptable models. Those that cling to outdated public datasets will be left behind — with models that do not reflect the reality of 2026.

Related Articles

Also check out: The Great Transformer Reform: May 2026 is Rewriting the Rules of ML Also check out: The End of ML Pilots: How 'AI Factories' Are Industrializing Machine Learning in Companies in 2026 Also check out: AlphaEvolve: 11 Records Proving ML is Already Redesigning Itself

#machine-learning#synthetic-data#data-scarcity#federated-learning#few-shot-learning#public-datasets
Compartilhar: