In this study, we analyzed trends and statistics related to the search for machine learning (ML) datasets over the past five years. To do this, we selected the top 600 most popular dataset-related queries from the Semrush database. The full list of queries can be found in the file.
We collected data on search trends for these queries from Google Trends for the period January 2019 to December 2024, using worldwide data with no country-specific filtering.
The resulting dynamics allow us to identify the most popular dataset-related queries over time, as well as broader topic areas of interest.
These insights can be valuable for entrepreneurs launching businesses in the ML/AI and tech space, ML engineers, as well as educational institutions.
Key Findings
Here are the most consistent and notable trends in dataset-related searches over the past five years:
- Retail Sector – Strong upward trend
- The search term “retail datasets” shows a rapid increase in interest (+602%).
- The term “online retail dataset” grew from zero to 70 search points, indicating a sharp rise in demand for datasets related to online transactions.
- Medical and Healthcare Sector – Significant growth in search volume:
- “mimic dataset” increased by +1082%.
- “healthcare datasets” rose by +321%.
Other terms in this group also showed notable growth from zero, including:
- “cardiovascular disease dataset”
- “covid 19 dataset”
- “public health dataset”
- Other Emerging High-Growth Queries – Terms that started from zero in 2019 and gained notable traction by 2024:
- NLP datasets and frameworks:
- “natural questions dataset” (50 search volume)
- “alpaca dataset” (67)
- “sharegpt dataset” (4)
- Geospatial and Satellite Imagery:
- “aid dataset” (58)
- “geospatial datasets” (59)
- “google earth engine datasets” (41)
- Transportation and Logistics:
- “truck dataset” (37)
- “droid dataset” (8)
- Emotion and Face Recognition:
- “emotion detection dataset” (69)
- Credit Card Fraud Detection:
- “credit card fraud detection dataset” (59)
- NLP datasets and frameworks:
To sum up, over the past five years, dataset-related searches in the machine learning (ML) field have revealed several key trends reflecting evolving interests and priorities. Notably, retail datasets have experienced a strong upward trajectory, with searches for terms like “online retail dataset” increasing sharply by over 600%, indicating growing demand for data related to e-commerce and online transactions. Meanwhile, medical and healthcare datasets have seen significant growth, with terms such as “mimic dataset” and “healthcare datasets” rising by more than 300%, alongside increased interest in specific health-related datasets like those for cardiovascular disease and COVID-19.
In addition to these established areas, several emerging high-growth queries have appeared since 2019, gaining traction by 2024. These include datasets focused on natural language processing (NLP), such as the “natural questions dataset” and “alpaca dataset,” as well as geospatial and satellite imagery datasets like “geospatial datasets” and “google earth engine datasets.”
Other notable growth areas include transportation and logistics datasets, emotion and face recognition datasets, and credit card fraud detection datasets.
Cluster ID | Avg Volume 2019 | Avg Volume 2024 | % Growth | |
---|---|---|---|---|
load_dataset | 1 | 0.18 | 77.63 | 42890.00% |
mimic dataset | 10 | 3.92 | 46.31 | 1082.00% |
Retail dataset | 8 | 10.53 | 73.92 | 602.00% |
Kaggle dataset | 11 | 11.71 | 54.67 | 367.00% |
healthcare datasets | 14 | 18.5 | 77.81 | 321.00% |
Power BI dataset | 3 | 13.24 | 55.34 | 318.00% |
IMDB dataset | 9 | 18.35 | 58.38 | 218.00% |
image dataset | 2 | 29.06 | 76.79 | 164.00% |
Free datasets | 13 | 27.56 | 69.67 | 153.00% |
Presenting the keyword clusters over time, month by month from 2019 to 2024, reveals the following picture:
- "load_dataset" showed the most dramatic increase, with a growth of over +42,890%. This query represents a cluster related to Python commands used to load and visualize data from datasets.
- The "retail dataset" cluster (+602%) also indicates a strong and growing interest in commercial and e-commerce datasets.
- There is notable and sustained interest in medical datasets, particularly around:
- "mimic dataset" (+1082%)
- "healthcare datasets" (+321%)
Zero-to-Growth Queries
We gave special attention to search queries with zero volume in 2019 but showed meaningful search interest by 2024. These are emerging topics that gained visibility and relevance over time.
Keyword | Cluster | Avg Volume 2019 | Avg Volume 2024 |
---|---|---|---|
adni dataset | 0 | 0 | 87.75 |
ncbi datasets | 15 | 0 | 86.83333333 |
llm datasets | 15 | 0 | 84.16666667 |
pip install datasets | 15 | 0 | 81.75 |
huggingface dataset | 1 | 0 | 81.66666667 |
hugging face dataset | 0 | 0 | 80.75 |
roboflow dataset | 0 | 0 | 80.08333333 |
gsm8k dataset | 0 | 0 | 79.33333333 |
hugging face datasets | 15 | 0 | 78.83333333 |
airflow dataset | 0 | 0 | 78.5 |
load_dataset huggingface | 1 | 0 | 77.66666667 |
huggingface load_dataset | 1 | 0 | 77.66666667 |
huggingface datasets | 15 | 0 | 77.41666667 |
datasets huggingface | 15 | 0 | 77.41666667 |
plantvillage dataset | 0 | 0 | 76.41666667 |
supply chain dataset | 0 | 0 | 75.16666667 |
automate power bi dataset refresh | 3 | 0 | 75 |
create feature for dataset | 0 | 0 | 74.66666667 |
datasets load_dataset | 1 | 0 | 72.25 |
dataset load_dataset | 1 | 0 | 70.66666667 |
online retail dataset | 8 | 0 | 70.16666667 |
emotion detection dataset | 0 | 0 | 69.75 |
sample superstore dataset | 12 | 0 | 68.25 |
ai training datasets | 15 | 0 | 67.33333333 |
alpaca dataset | 0 | 0 | 67.16666667 |
mmlu dataset | 0 | 0 | 66.33333333 |
common crawl dataset | 0 | 0 | 65.83333333 |
imdb reviews dataset | 9 | 0 | 61.58333333 |
credit card fraud detection dataset | 0 | 0 | 59.66666667 |
airflow datasets | 15 | 0 | 58.58333333 |
aid dataset | 0 | 0 | 58.5 |
real-time datasets | 15 | 0 | 58.08333333 |
from datasets import load_dataset | 16 | 0 | 57 |
geospatial datasets | 15 | 0 | 56.91666667 |
mr dataset | 0 | 0 | 56.75 |
Full list of zero-to-growth queries
Key Observations from Emerging Dataset Queries
Upon analyzing the search queries, we identified a strong presence of topics related to dataset libraries, repositories, and frameworks, as well as several thematic clusters:
Dataset Libraries, Repositories, and Frameworks:
- huggingface dataset (81 search volume)
- ncbi datasets (86)
- roboflow dataset (80)
- alpaca dataset (67)
- pile dataset (54)
- airflow dataset (78)
- cleaning dataset in python (17)
NLP Datasets:
- alpaca dataset (67)
- mmlu dataset (66)
- common crawl dataset (65)
- pile dataset (54)
- c4 dataset (51)
- natural questions dataset (50)
- laion dataset (46)
- the pile dataset (41)
- humaneval dataset (37)
- gsm8k dataset (79)
Clinical and Medical Data:
- adni dataset (87 search volume)
- cdc datasets (34)
- cardiovascular disease dataset (30)
- covid 19 dataset (14)
- mimic iii dataset (28)
- public health datasets (43)
Geospatial and Satellite Imagery Data:
- aid dataset (58)
- geospatial datasets (59)
- google earth engine datasets (41)
Automotive and Transportation Data:
- truck dataset (37)
- droid dataset (8)
Emotion and Facial Recognition:
- emotion detection dataset (69)
Anti-Fraud and Credit Risk:
- credit card fraud detection dataset (59)
Datasets Word Cloud 2024
If we break down all dataset-related search queries from 2024 into individual words and sort them by frequency, we get the following picture:
The most common keywords include R, health, NeurIPS, public, and other popular technologies and topics.