Research on ML Dataset Search Trends (2019–2024)

In this study, we analyzed trends and statistics related to the search for machine learning (ML) datasets over the past five years. To do this, we selected the top 600 most popular dataset-related queries from the Semrush database. The full list of queries can be found in the file.

We collected data on search trends for these queries from Google Trends for the period January 2019 to December 2024, using worldwide data with no country-specific filtering.

The resulting dynamics allow us to identify the most popular dataset-related queries over time, as well as broader topic areas of interest.

These insights can be valuable for entrepreneurs launching businesses in the ML/AI and tech space, ML engineers, as well as educational institutions.

Key Findings

Here are the most consistent and notable trends in dataset-related searches over the past five years:

  1. Retail Sector – Strong upward trend
    • The search term “retail datasets” shows a rapid increase in interest (+602%).
    • The term “online retail dataset” grew from zero to 70 search points, indicating a sharp rise in demand for datasets related to online transactions.
  2. Medical and Healthcare Sector – Significant growth in search volume:
    • “mimic dataset” increased by +1082%.
    • “healthcare datasets” rose by +321%.

Other terms in this group also showed notable growth from zero, including:

  • “cardiovascular disease dataset”
  • “covid 19 dataset”
  • “public health dataset”
  1. Other Emerging High-Growth Queries – Terms that started from zero in 2019 and gained notable traction by 2024:
    • NLP datasets and frameworks:
      • “natural questions dataset” (50 search volume)
      • “alpaca dataset” (67)
      • “sharegpt dataset” (4)
    • Geospatial and Satellite Imagery:
      • “aid dataset” (58)
      • “geospatial datasets” (59)
      • “google earth engine datasets” (41)
    • Transportation and Logistics:
      • “truck dataset” (37)
      • “droid dataset” (8)
    • Emotion and Face Recognition:
      • “emotion detection dataset” (69)
    • Credit Card Fraud Detection:
      • “credit card fraud detection dataset” (59)

To sum up, over the past five years, dataset-related searches in the machine learning (ML) field have revealed several key trends reflecting evolving interests and priorities. Notably, retail datasets have experienced a strong upward trajectory, with searches for terms like “online retail dataset” increasing sharply by over 600%, indicating growing demand for data related to e-commerce and online transactions. Meanwhile, medical and healthcare datasets have seen significant growth, with terms such as “mimic dataset” and “healthcare datasets” rising by more than 300%, alongside increased interest in specific health-related datasets like those for cardiovascular disease and COVID-19.

In addition to these established areas, several emerging high-growth queries have appeared since 2019, gaining traction by 2024. These include datasets focused on natural language processing (NLP), such as the “natural questions dataset” and “alpaca dataset,” as well as geospatial and satellite imagery datasets like “geospatial datasets” and “google earth engine datasets.”

Other notable growth areas include transportation and logistics datasets, emotion and face recognition datasets, and credit card fraud detection datasets.

Cluster IDAvg Volume 2019Avg Volume 2024% Growth
load_dataset10.1877.6342890.00%
mimic dataset103.9246.311082.00%
Retail dataset810.5373.92602.00%
Kaggle dataset1111.7154.67367.00%
healthcare datasets1418.577.81321.00%
Power BI dataset313.2455.34318.00%
IMDB dataset918.3558.38218.00%
image dataset229.0676.79164.00%
Free datasets1327.5669.67153.00%

Presenting the keyword clusters over time, month by month from 2019 to 2024, reveals the following picture:

  • "load_dataset" showed the most dramatic increase, with a growth of over +42,890%. This query represents a cluster related to Python commands used to load and visualize data from datasets.
  • The "retail dataset" cluster (+602%) also indicates a strong and growing interest in commercial and e-commerce datasets.
  • There is notable and sustained interest in medical datasets, particularly around:
    • "mimic dataset" (+1082%)
    • "healthcare datasets" (+321%)

Zero-to-Growth Queries

We gave special attention to search queries with zero volume in 2019 but showed meaningful search interest by 2024. These are emerging topics that gained visibility and relevance over time.

KeywordClusterAvg Volume 2019Avg Volume 2024
adni dataset0087.75
ncbi datasets15086.83333333
llm datasets15084.16666667
pip install datasets15081.75
huggingface dataset1081.66666667
hugging face dataset0080.75
roboflow dataset0080.08333333
gsm8k dataset0079.33333333
hugging face datasets15078.83333333
airflow dataset0078.5
load_dataset huggingface1077.66666667
huggingface load_dataset1077.66666667
huggingface datasets15077.41666667
datasets huggingface15077.41666667
plantvillage dataset0076.41666667
supply chain dataset0075.16666667
automate power bi dataset refresh3075
create feature for dataset0074.66666667
datasets load_dataset1072.25
dataset load_dataset1070.66666667
online retail dataset8070.16666667
emotion detection dataset0069.75
sample superstore dataset12068.25
ai training datasets15067.33333333
alpaca dataset0067.16666667
mmlu dataset0066.33333333
common crawl dataset0065.83333333
imdb reviews dataset9061.58333333
credit card fraud detection dataset0059.66666667
airflow datasets15058.58333333
aid dataset0058.5
real-time datasets15058.08333333
from datasets import load_dataset16057
geospatial datasets15056.91666667
mr dataset0056.75

Full list of zero-to-growth queries

Key Observations from Emerging Dataset Queries

Upon analyzing the search queries, we identified a strong presence of topics related to dataset libraries, repositories, and frameworks, as well as several thematic clusters:

Dataset Libraries, Repositories, and Frameworks:

  • huggingface dataset (81 search volume)
  • ncbi datasets (86)
  • roboflow dataset (80)
  • alpaca dataset (67)
  • pile dataset (54)
  • airflow dataset (78)
  • cleaning dataset in python (17)

NLP Datasets:

  • alpaca dataset (67)
  • mmlu dataset (66)
  • common crawl dataset (65)
  • pile dataset (54)
  • c4 dataset (51)
  • natural questions dataset (50)
  • laion dataset (46)
  • the pile dataset (41)
  • humaneval dataset (37)
  • gsm8k dataset (79)

Clinical and Medical Data:

  • adni dataset (87 search volume)
  • cdc datasets (34)
  • cardiovascular disease dataset (30)
  • covid 19 dataset (14)
  • mimic iii dataset (28)
  • public health datasets (43)

Geospatial and Satellite Imagery Data:

  • aid dataset (58)
  • geospatial datasets (59)
  • google earth engine datasets (41)

Automotive and Transportation Data:

  • truck dataset (37)
  • droid dataset (8)

Emotion and Facial Recognition:

  • emotion detection dataset (69)

Anti-Fraud and Credit Risk:

  • credit card fraud detection dataset (59)

Datasets Word Cloud 2024

If we break down all dataset-related search queries from 2024 into individual words and sort them by frequency, we get the following picture:

The most common keywords include R, health, NeurIPS, public, and other popular technologies and topics.

Insights into the Digital World

Research on the Most Stressful Driving Regions in the UK

Introduction Every year, over 100,000 road accidents occur in the UK. Some of these incidents result in injuries or fatalities. […]

Research on ML Dataset Search Trends (2019–2024)

In this study, we analyzed trends and statistics related to the search for machine learning (ML) datasets over the past […]

Validation Dataset in Machine Learning: What it is and Why it Matters

Let’s face it — training a machine learning model without a validation dataset is like prepping for a marathon but […]

What Is Object Detection in Computer Vision?

What Is Object Detection?  Object Detection is a computer vision task aimed at identifying and localizing individual objects within an […]

Panoptic Segmentation – Data Annotation Guide

Over the past few decades, computer vision has made remarkable progress. What once involved recognizing simple geometric shapes has evolved […]

3D Cuboid Annotation: Features and Applications

What is a 3D Cuboid? A 3D cuboid is a volumetric bounding box in the shape of a rectangular prism […]

What Is NLP? A Complete Guide

Ever wondered how Siri answers your questions? Or how Gmail filters out spam? Natural language processing (NLP) makes this possible. […]

Regularization in Machine Learning: Keeping Your Models in Check

Machine learning models can sometimes behave like overly enthusiastic musicians in a band—they want to hit every note perfectly, even […]

What is Text Annotation?

1. Introduction: What is Text Annotation? Ever tried reading an ancient script with no translation? The symbols look interesting, but […]

POS (Parts-of-Speech) Tagging in NLP: The Grammar Behind Smart Machines

1. Introduction: Why POS Tagging Still Matters in the Age of LLMs Language is alive. It breathes, evolves, and resists […]

Ready to get started?

Tell us what you need — we’ll reply within 24h with a free estimate

    What service are you looking for? *
    What service are you looking for?
    Data Labeling
    Data Collection
    Ready-made Datasets
    Human Moderation
    Medicine
    Other (please describe below)
    What's your budget range? *
    What's your budget range?
    < $1,000
    $1,000 – $5,000
    $5,000 – $10,000
    $10,000 – $50,000
    $50,000+
    Not sure yet
    Where did you hear about Unidata? *
    Where did you hear about Unidata?
    Head of Client Success
    Andrew
    Head of Client Success

    — I'll guide you through every step, from your first
    message to full project delivery

    Thank you for your
    message

    It has been successfully sent!

    This website uses cookies to enhance your experience, analyze traffic, and deliver personalized content and ads. By clicking "Accept", you consent to the use of cookies, as described in our Cookie Policy. Please choose your cookie preference.