20 Best Free Healthcare Datasets for ML in 2025

14 minutes read

Top 20 healthcare datasets for machine learning—free, diverse, and ready to train. Includes EHRs, X-rays, dialogues, audio, and commercial-grade data.

Healthcare Datasets

Why Healthcare Data Matters in ML

Medical data is messy, complex, and often high-stakes—exactly the kind of challenge machine learning was born to handle.

With the right training set, ML models can:

  • flag anomalies on X-rays in milliseconds,
  • predict ICU readmission days in advance,
  • transcribe or translate physician-patient conversations,
  • or even detect COVID from a simple cough recording.

But without clean, diverse, and context-rich datasets? Those same models collapse under bias, false positives, or clinical irrelevance. That’s why great healthcare ML begins not with modeling—but with data. 

What Makes a Dataset “Good”?

Not all “free medical data” is worth your training cycles. Here's what sets great datasets apart:

Realistic

The best datasets reflect real-world noise: typos, missing fields, patient diversity, low-resolution images, regional coding. Clean is great—but not too clean.

Representative

Bias in healthcare kills. If your training data mostly includes young white males, your model will fail on everyone else. Good datasets represent age, sex, ethnicity, and geography.

Well-labeled

Whether it’s ICD codes, expert annotations on X-rays, or dialogue intents in transcripts—good labels mean better learning.

Legal to Use

All datasets in this list are free, with clear academic or commercial use licenses. No license? No training.

20 Best Free Healthcare Datasets (Ranked & Explained)

These 20 datasets are free to access, actively used in 2025, and sorted by relevance across five key categories: EHR, imaging, audio, dialogue, and surveys. Each one includes a direct use case for ML teams, from academic experiments to real-world deployments.

Electronic Health Records (EHR) 

1. MIMIC-IV 

MIMIC-IV dataset

Type: EHR, ICU

Format: Tabular + Clinical Notes

Volume: 40,000+ patients (2008–2019)

Access: Free (academic use only)

Use Case: Mortality prediction, time-series modeling, clinical NLP

The most cited clinical dataset in the world, combining ICU monitoring data, vitals, lab results, prescriptions, and de-identified clinical notes. It’s a go-to foundation for time-series forecasting and patient outcome prediction. 

Dataset Spotlight (click to expand)
name: MIMIC-IV
type: EHR, ICU
format: Tabular + Clinical Notes
volume: 40,000+ patients (2008–2019)
access: Free (academic use only)
use_case: Mortality prediction, time-series modeling, clinical NLP
  

2. EHRSHOT

EHRSHOT

Type: Longitudinal EHR

Format: Tabular

Volume: 6,739 patients, 41M events

Access: Free (academic use only)

Use Case: Few-shot training, foundation model pretraining, sequence modeling

Built by Stanford for benchmarking general-purpose EHR models, this dataset includes longitudinal patient timelines with diagnoses, labs, and medications. It’s structured to stress-test generalization and is ideal for prompt-based or cross-task transfer learning. 

Medical Imaging

3. MIMIC-CXR-JPG

MIMIC-CXR-JPG

Type: Chest X-rays

Format: Image + Report

Volume: 377,000+ images

Access: Free (academic use only)

Use Case: Diagnostic classification, report generation, attention maps 

One of the largest public chest radiograph datasets, sourced from real hospital systems and annotated with 14 common clinical findings (e.g., edema, effusion, infiltration). Comes with patient metadata and linked radiology reports.

Dataset Spotlight (click to expand)
name: MIMIC-CXR-JPG
type: Chest X-rays
format: Image + Report
volume: 377,000+ images
access: Free (academic use only)
use_case: Diagnostic classification, report generation, attention maps
  

4. The Cancer Imaging Archive (TCIA)

The Cancer Imaging Archive (TCIA)

Type: CT, MRI, PET

Format: DICOM + Segmentation

Volume: 100+ collections

Access: Free (academic use only)

Use Case: Tumor segmentation, radiomics, multi-modal training 

TCIA hosts expertly curated datasets covering various cancer types, often with segmentation masks, genomic data, and clinical annotations. Frequently used for research into 3D segmentation, survival analysis, and image-genome alignment. 

5. MedPix 

MedPix dataset

Type: Multimodal clinical imaging

Format: Photo + Metadata

Volume: 59,000+ diagnostic cases

Access: Free (open access)

Use Case: Cross-modal retrieval, medical image classification, education

This image-centric case library spans radiology, dermatology, and pathology, with rich metadata including diagnosis, history, and key image findings. Used for model pretraining and clinical decision support prototypes. 

6. PatchCamelyon (PCam)

PatchCamelyon (PCam)

Type: Histopathology

Format: 96×96 image patches

Volume: 327,680 samples

Access: Free (open access)

Use Case: Binary classification, patch-based detection, fast prototyping 
Derived from whole-slide scans of lymph node tissue, PCam offers a highly efficient benchmark for evaluating image classification architectures. Despite its compact size, it’s heavily used in real-world cancer detection tasks. 

7. CAMELYON17

CAMELYON17

Type: Whole-slide histopathology

Format: WSI (DICOM) + Annotations

Volume: 100+ annotated slides from 5 hospitals

Access: Free (academic use only)

Use Case: Tumor segmentation, domain generalization, weak supervision

CAMELYON17 includes whole-slide lymph node scans with detailed labels for metastatic regions. Used in grand challenges, it’s become a benchmark for weakly supervised tumor detection across domains.

8. PAD-UFES-20

PAD-UFES-20

Type: Dermatology (mobile-acquired)

Format: Photo

Volume: 2,298 images, 6 disease classes

Access: Free (academic use only)

Use Case: Skin disease classification, fairness testing, mobile ML

Collected via smartphone in Brazilian clinics, this dataset emphasizes real-world variance: different lighting, skin tones, and resolutions. It’s great for building inclusive models or mobile apps targeting skin lesion triage. 

9. LC25000

LC25000

Type: Histopathology (lung and colon)

Format: Microscopy images

Volume: 25,000 images

Access: Free (academic use only)

Use Case: Cancer detection, visual diagnosis, 2D tissue classification

Includes balanced samples from benign and malignant lung and colon tissues. Clean, easy to preprocess, and widely adopted for training CNNs on microscopy data. Especially useful for model benchmarking or augmentation studies. 

10. COVID-19 Radiography Dataset

COVID-19 Radiography Dataset

Type: X-rays and CT scans

Format: Image + Metadata

Volume: 21,000+ images

Access: Free (open access)

Use Case: Infection detection, multi-class classification, public health modeling

Combines COVID, pneumonia, and healthy X-rays from multiple global sources, with clear class balance and metadata. Frequently used in rapid-response ML projects during the pandemic, it remains relevant for studying transfer learning and medical robustness. 

Audio & Signal Data

11. COUGHVID

COUGHVID

Type: Cough recordings

Format: Audio

Volume: 25,000+ samples

Access: Free (open access)

Use Case: Disease detection, acoustic modeling, mobile diagnostics 

Collected via crowdsourcing and reviewed by clinicians, this dataset includes coughs from healthy and symptomatic individuals. It’s a benchmark for respiratory sound classification and real-time detection in telehealth settings. 

Dataset Spotlight (click to expand)
name: COUGHVID
type: Cough recordings
format: Audio
volume: 25,000+ samples
access: Free (open access)
use_case: Disease detection, acoustic modeling, mobile diagnostics
  

12. Coswara

Coswara dataset

Type: Speech, cough, and breath audio

Format: Audio + Metadata

Volume: Thousands of samples

Access: Free (open access)

Use Case: Voice biometrics, COVID screening, multilingual signal analysis

Designed for multilingual research, Coswara provides sustained phonation, breath cycles, and coughing from diverse Indian populations. Ideal for audio-based symptom detection, especially in noisy real-world conditions. 

13. MedDialog (EN)

MedDialog (EN) dataset

Type: Doctor–patient Q&A (English)

Format: Text (structured dialogues)

Volume: 300,000+ dialogues across 96 diseases

Access: Free (open access)

Use Case: Medical chatbot training, disease-specific Q&A, few-shot NLP

Originally in Chinese and machine-translated to English, this dataset contains over 300K doctor–patient dialogues covering 96 diseases. Each interaction follows a structured Q&A format, making it suitable for training domain-specific chatbots and building disease-focused dialogue agents. 

Dataset Spotlight (click to expand)
name: MedDialog (EN)
type: Doctor–patient Q&A (English)
format: Text (structured dialogues)
volume: 300,000+ dialogues across 96 diseases
access: Free (open access)
use_case: Medical chatbot training, disease-specific Q&A, few-shot NLP
  

14. MedDialog (CN)

MedDialog (CN) dataset

Type: Doctor–patient Q&A (Chinese)

Format: Text (multi-turn dialogues)

Volume: 1.1 million+ dialogues

Access: Free (open access)

Use Case: Multilingual NLP, clinical NLU, cross-lingual benchmarking 

The Chinese version of MedDialog contains anonymized patient consultations, making it ideal for training multilingual large language models for healthcare use cases in Chinese-speaking populations. 

Population Surveys & Multimodal Panels

15. DHS Program

DHS Program

Type: Global household health surveys

Format: Tabular + Questionnaire

Volume: 90+ countries, thousands of indicators

Access: Free (open access)

Use Case: Epidemiological modeling, health equity, feature correlation

The Demographic and Health Surveys provide data on fertility, nutrition, HIV, and maternal health across the Global South. It’s ideal for public health prediction and regional analysis.

Dataset Spotlight (click to expand)
name: DHS Program
type: Global household health surveys
format: Tabular + Questionnaire
volume: 90+ countries, thousands of indicators
access: Free (open access)
use_case: Epidemiological modeling, health equity, feature correlation
  

16. HINTS

HINTS

Type: U.S. health communication survey

Format: Tabular

Volume: 5 major waves, 20+ years

Access: Free (open access)

Use Case: Behavioral modeling, patient tech adoption, segmentation

The Health Information National Trends Survey tracks how adults in the U.S. access and trust medical information. Valuable for training personalization engines and behavioral prediction tools.

17. MEPS 

MEPS dataset

Type: U.S. medical expenditure survey

Format: Tabular

Volume: National sample, thousands of records

Access: Free (open access)

Use Case: Cost modeling, policy simulation, socioeconomic impact

This dataset tracks medical spending, insurance coverage, and service utilization in American households. Great for economic ML modeling and predictive healthcare cost estimation. 

18. OpenSAFELY

OpenSAFELY

Type: UK primary care + COVID records

Format: Tabular + Clinical codes

Volume: Millions of records (pseudonymized)

Access: Free (academic request required)

Use Case: Risk factor analysis, vaccine effectiveness, real-world evidence

Built for pandemic-time research, OpenSAFELY includes secure NHS-linked data on medications, diagnoses, and mortality. It supports statistical modeling and large-scale medical inference. 

19. UK Biobank 

UK Biobank 

Type: Multimodal biomedical cohort

Format: Genomic + Imaging + Tabular

Volume: 500,000+ participants

Access: Free (academic application required)

Use Case: Imaging-genetics fusion, biomarker discovery, LLM fine-tuning 

Combines genomics, MRI, blood biomarkers, and lifestyle surveys. Widely used for multi-modal ML in aging, oncology, cardiology, and cognitive science. Requires academic application for access. 

20. HealthData.gov

HealthData.gov dataset

Type: Open U.S. government health data portal

Format: Tabular + Multiformat

Volume: Thousands of datasets

Access: Free (open access)

Use Case: Exploratory modeling, regional trend analysis, synthetic data

Aggregates open datasets on Medicare, hospital quality, opioid use, vaccine uptake, and more. A flexible source for feature engineering and cross-domain model development.

Commercial-Grade Datasets (For Production Use)

Free datasets are great for research—but production systems need more. At Unidata, we offer curated medical datasets built for real-world deployment: annotated CT scans, surgical video, and multimodal patient records from Eastern Europe. 

Unidata Chest CT Collection

Unidata Chest CT Collection

Type: Thoracic CT scans

Format: DICOM + XML annotations

Volume: 150,000+ slices (7,435 patients, 25 clinics)

Access: Paid (commercial license)

Use Case: Lung cancer detection, tuberculosis screening, segmentation model pretraining

A multi-institutional dataset of thoracic CT slices collected from 25 clinics across Eastern Europe and Central Asia. Contains over 150,000 manually annotated DICOM images labeled by radiologists for nodules, consolidations, cavities, and other pathologies. Includes scanner metadata and demographic details, making it ideal for domain adaptation and model

Dataset Spotlight (click to expand)
name: Unidata Chest CT Collection
type: Thoracic CT scans
format: DICOM + XML annotations
volume: 150,000+ slices (7,435 patients, 25 clinics)
access: Paid (commercial license)
use_case: Lung cancer detection, tuberculosis screening, segmentation model pretraining
url: https://unidata.pro/datasets/ct-scan-chest/

Unidata Brain MRI Collection

Unidata Brain MRI Collection

Type: Brain MRI scans

Format: DICOM + XML annotations

Volume: 2,000,000+ images across 50+ studies

Access: Paid (commercial license)

Use Case: Tumor segmentation, brain pathology detection, neuroimaging research

This dataset includes over 2 million DICOM images from 50+ detailed brain MRI studies collected across Eastern Europe. Each study features high-resolution scans in T1, T2, and FLAIR sequences with expert-created XML annotations highlighting tumors, lesions, and structural anomalies. Suitable for fine-grained segmentation models, tumor classification, and neuro-AI research requiring full DICOM fidelity. 

Dataset Spotlight (click to expand)
name: Unidata Brain MRI Collection
type: Brain MRI scans
format: DICOM + XML annotations
volume: 2,000,000+ images across 50+ studies
access: Paid (commercial license)
use_case: Tumor segmentation, brain pathology detection, neuroimaging research
url: https://unidata.pro/datasets/brain-mri-image-dicom/

Unidata Spine MRI Collection

Unidata Spine MRI Collection

Type: Spine MRI scans

Format: DICOM + XML annotations

Volume: 2,400,000+ images across 67+ studies

Access: Paid (commercial license)

Use Case: Disc herniation detection, spinal alignment analysis, orthopedic model training

This collection contains over 2.4 million spine MRI images sourced from 67+ fully annotated studies. It covers cervical, thoracic, and lumbar regions, with pixel-level segmentation masks and diagnostic metadata. Designed to support ML pipelines in orthopedic imaging, including spinal abnormality classification and pretraining for radiology foundation models. 

Dataset Spotlight (click to expand)
name: Unidata Spine MRI Collection
type: Spine MRI scans
format: DICOM + XML annotations
volume: 2,400,000+ images across 67+ studies
access: Paid (commercial license)
use_case: Disc herniation detection, spinal alignment analysis, orthopedic model training
url: https://unidata.pro/datasets/spine-mri-image-dicom/
  

Final Takeaways

Great healthcare AI starts with great data. Public datasets help you build fast and learn fast—but real-world performance needs real-world diversity.

Cheat Sheet: All 23 Healthcare Datasets

Use this list to find your baseline. When you're ready to scale, train on data that reflects the patients you serve.

Best Free Healthcare Datasets for Machine Learning (2025) (click to expand)
- name: MIMIC-IV
  type: EHR, ICU
  format: Tabular + Clinical Notes
  volume: 40,000+ patients (2008–2019)
  access: Free (academic use only)
  use_case: Mortality prediction, time-series modeling, clinical NLP

- name: EHRSHOT
  type: Longitudinal EHR
  format: Tabular
  volume: 6,739 patients, 41M events
  access: Free (academic use only)
  use_case: Few-shot training, foundation model pretraining, sequence modeling

- name: MIMIC-CXR-JPG
  type: Chest X-rays
  format: Image + Report
  volume: 377,000+ images
  access: Free (academic use only)
  use_case: Diagnostic classification, report generation, attention maps

- name: The Cancer Imaging Archive (TCIA)
  type: CT, MRI, PET
  format: DICOM + Segmentation
  volume: 100+ collections
  access: Free (academic use only)
  use_case: Tumor segmentation, radiomics, multi-modal training

- name: MedPix
  type: Multimodal clinical imaging
  format: Photo + Metadata
  volume: 59,000+ diagnostic cases
  access: Free (open access)
  use_case: Cross-modal retrieval, medical image classification, education

- name: PatchCamelyon (PCam)
  type: Histopathology
  format: 96×96 image patches
  volume: 327,680 samples
  access: Free (open access)
  use_case: Binary classification, patch-based detection, fast prototyping

- name: CAMELYON17
  type: Whole-slide histopathology
  format: WSI (DICOM) + Annotations
  volume: 100+ annotated slides from 5 hospitals
  access: Free (academic use only)
  use_case: Tumor segmentation, domain generalization, weak supervision

- name: PAD-UFES-20
  type: Dermatology (mobile-acquired)
  format: Photo
  volume: 2,298 images, 6 disease classes
  access: Free (academic use only)
  use_case: Skin disease classification, fairness testing, mobile ML

- name: LC25000
  type: Histopathology (lung and colon)
  format: Microscopy images
  volume: 25,000 images
  access: Free (academic use only)
  use_case: Cancer detection, visual diagnosis, 2D tissue classification

- name: COVID-19 Radiography Dataset
  type: X-rays and CT scans
  format: Image + Metadata
  volume: 21,000+ images
  access: Free (open access)
  use_case: Infection detection, multi-class classification, public health modeling

- name: COUGHVID
  type: Cough recordings
  format: Audio
  volume: 25,000+ samples
  access: Free (open access)
  use_case: Disease detection, acoustic modeling, mobile diagnostics

- name: Coswara
  type: Speech, cough, and breath audio
  format: Audio + Metadata
  volume: Thousands of samples
  access: Free (open access)
  use_case: Voice biometrics, COVID screening, multilingual signal analysis

- name: MedDialog (EN)
  type: Doctor–patient Q&A (English)
  format: Text (structured dialogues)
  volume: 300,000+ dialogues across 96 diseases
  access: Free (open access)
  use_case: Medical chatbot training, disease-specific Q&A, few-shot NLP

- name: MedDialog (CN)
  type: Doctor–patient Q&A (Chinese)
  format: Text (multi-turn dialogues)
  volume: 1.1 million+ dialogues
  access: Free (open access)
  use_case: Multilingual NLP, clinical NLU, cross-lingual benchmarking

- name: DHS Program
  type: Global household health surveys
  format: Tabular + Questionnaire
  volume: 90+ countries, thousands of indicators
  access: Free (open access)
  use_case: Epidemiological modeling, health equity, feature correlation

- name: HINTS
  type: U.S. health communication survey
  format: Tabular
  volume: 5 major waves, 20+ years
  access: Free (open access)
  use_case: Behavioral modeling, patient tech adoption, segmentation

- name: MEPS
  type: U.S. medical expenditure survey
  format: Tabular
  volume: National sample, thousands of records
  access: Free (open access)
  use_case: Cost modeling, policy simulation, socioeconomic impact

- name: OpenSAFELY
  type: UK primary care + COVID records
  format: Tabular + Clinical codes
  volume: Millions of records (pseudonymized)
  access: Free (academic request required)
  use_case: Risk factor analysis, vaccine effectiveness, real-world evidence

- name: UK Biobank
  type: Multimodal biomedical cohort
  format: Genomic + Imaging + Tabular
  volume: 500,000+ participants
  access: Free (academic application required)
  use_case: Imaging-genetics fusion, biomarker discovery, LLM fine-tuning

- name: HealthData.gov
  type: Open U.S. government health data portal
  format: Tabular + Multiformat
  volume: Thousands of datasets
  access: Free (open access)
  use_case: Exploratory modeling, regional trend analysis, synthetic data

- name: Unidata Chest CT Collection
  type: Thoracic CT scans
  format: DICOM + XML annotations
  volume: 150,000+ slices (7,435 patients, 25 clinics)
  access: Paid (commercial license)
  use_case: Lung cancer detection, tuberculosis screening, segmentation model pretraining

- name: Unidata Brain MRI Collection
  type: Brain MRI scans
  format: DICOM + XML annotations
  volume: 2,000,000+ images across 50+ studies
  access: Paid (commercial license)
  use_case: Tumor segmentation, brain pathology detection, neuroimaging research

- name: Unidata Spine MRI Collection
  type: Spine MRI scans
  format: DICOM + XML annotations
  volume: 2,400,000+ images across 67+ studies
  access: Paid (commercial license)
  use_case: Disc herniation detection, spinal alignment analysis, orthopedic model training
 

Insights into the Digital World

20 Best Free Healthcare Datasets for ML in 2025

Top 20 healthcare datasets for machine learning—free, diverse, and ready to train. Includes EHRs, X-rays, dialogues, audio, and commercial-grade data. […]

20 Best Financial Datasets for Machine Learning

Why Financial Data Powers ML Most datasets are static snapshots. Financial data? It’s alive. Markets move. Policies shift. Consumers panic. […]

AI for Image Recognition: How Machines Learned to See—and Why It Matters 

Your phone sorts photos by face. Your car knows when you’re not paying attention. And warehouses spot defects in milliseconds. […]

Automatic Speech Recognition (ASR): How Machines Learn to Listen

1. What Is Automatic Speech Recognition? Talk to your phone. Rant to your car. Whisper to your smart speaker. And […]

Lidar vs Radar: Complete Guide 2025

They both “see” the world — but in totally different ways. Lidar sketches every curve and corner in laser-sharp detail. […]

Facial Recognition – What is It and How It Works

Facial recognition has quietly slipped into our everyday lives. It helps you unlock your phone, breeze through airport security, or […]

Research on the Most Stressful Driving Regions in the UK

Over 100,000 road accidents take place across the UK each year — a toll that includes injuries and fatalities. In […]

ML Dataset Trends Research and Statistics

Research on ML Dataset Search Trends (2019–2024)

In this study, we analyzed trends and statistics related to the search for machine learning (ML) datasets over the past […]

Validation Dataset in Machine Learning: What it is and Why it Matters

Let’s face it — training a machine learning model without a validation dataset is like prepping for a marathon but […]

What Is Object Detection in Computer Vision?

What Is Object Detection?  Object Detection is a computer vision task aimed at identifying and localizing individual objects within an […]

Ready to get started?

Tell us what you need — we’ll reply within 24h with a free estimate

    What service are you looking for? *
    What service are you looking for?
    Data Labeling
    Data Collection
    Ready-made Datasets
    Human Moderation
    Medicine
    Other (please describe below)
    What's your budget range? *
    What's your budget range?
    < $1,000
    $1,000 – $5,000
    $5,000 – $10,000
    $10,000 – $50,000
    $50,000+
    Not sure yet
    Where did you hear about Unidata? *
    Where did you hear about Unidata?
    Head of Client Success
    Andrew
    Head of Client Success

    — I'll guide you through every step, from your first
    message to full project delivery

    Thank you for your
    message

    It has been successfully sent!

    We use cookies to enhance your experience, personalize content, ads, and analyze traffic. By clicking 'Accept All', you agree to our Cookie Policy.