20 Best Free Sports Datasets for ML 2025

9 minutes read
20 Best Free Sports Datasets for ML 2025

Sports data is your playbook: choose right, win fast. This multi-sport, ML-ready shortlist includes free + paid options, a quick comparison matrix, and clear notes on how to plug each dataset into live pipelines (prediction, CV, tracking). 

Open Football APIs for Real-Time Modeling

1. StatsBomb Open Data

StatsBomb Open Data

Volume: 30+ competitions, thousands of matches since 2018
Access: Free with attribution
Format: JSON event files + CSV match data
Task Fit: Event classification, player analysis, match prediction

If “context is king,” this is royalty—pressures, pass heights, shot freeze-frames, the lot. It punishes lazy features and rewards smart ones (xG, buildup chains, pitch zones). If your model reads the game instead of raw rows, it’ll shine here.

2. Open Football Data API

Open Football Data API

Volume: Live & historical match results, fixtures, odds
Access: Free API (registration required)
Format: REST API with JSON responses
Task Fit: Predictive modeling, betting analytics, match classification

Plug-and-play football feeds without the plumbing drama. Great for spinning up live win-probability, odds-driven features, and alerting dashboards. Mind the rate limits, cache smartly, and your models stay real-time sharp.

3. College Football Data API

College Football Data API

Volume: 1,000+ games per season
Access: Free API
Format: REST API with JSON (games, drives, plays, rosters)
Task Fit: Win prediction, recruitment analysis, player performance

Saturday chaos, structured. Play-by-play, rosters, and drive data let you model tempo, field position, and coaching tendencies. If your features capture scheme and pace, expect serious lift on win-probability curves. 

Player Tracking Datasets (Basketball & Football)

4. Kaggle: NBA Shot Logs (2014–15)

Kaggle: NBA Shot Logs (2014–15)

Volume: 128,000+ shots from 2014–15 NBA season
Access: Free (Kaggle account required)
Format: CSV (shot location, outcome, context)
Task Fit: Shot prediction, spatial analysis, player efficiency

A slam-dunk playground for spatial models: shot location, outcome, defender context. Perfect for heatmaps, shot quality, and player profiles without chasing proprietary feeds. If distance and angle make it into your features, buckets follow.

5. SoccerNet 

SoccerNet

Volume: 500+ full broadcast matches with event labels
Access: Free (research registration required)
Format: Video frames, bounding boxes, JSON event annotations
Task Fit: Player tracking, action recognition, event detection

The gold standard for football video ML. Synchronized multi-camera footage, broadcast commentary, and precise event tags make it ideal for benchmarking. If your detector can survive motion blur and crowd noise here, it’s ready for prime time. 

6. Metrica Sports Sample Data

Metrica Sports Sample Data

Volume: Full-match tracking + event logs
Access: Free (GitHub)
Format: CSV/JSON tracking coordinates + synchronized event data
Task Fit: Player tracking, tactical analysis, computer vision

Think of it as GPS for 22 dots sprinting, passing, and colliding. You get both event logs and full-match positional streams, perfectly synced. A sandbox for anyone testing CV models or tactical visualizations beyond static stats. 

Historical Box Scores for Outcome Prediction

7. Sports Reference 

Sports Reference 

Volume: Decades of MLB, NBA, NFL, NHL data
Access: Free
Format: Web tables + downloadable CSVs
Task Fit: Trend analysis, win prediction, player projections

The encyclopedia every U.S. sports analyst secretly bookmarks. Box scores, advanced stats, and historical leaders make it prime territory for long-range forecasting. If your model can’t find signal here, it probably won’t find it anywhere.

8. Lahman Baseball Database

Lahman Baseball Database

Volume: Over 150 years of MLB stats
Access: Free download
Format: CSV/SQL database files
Task Fit: Historical trend analysis, performance prediction

Baseball’s memory palace, digitized. From dead-ball era oddities to modern OPS+, it’s all in structured tables. A dream dataset for time-series experiments that span generations of players and shifting styles of play.

9. Division III Basketball Play-by-Play 

Division III Basketball Play-by-Play

Volume: 300,000+ plays from multiple Division III games
Access: Free (Kaggle)
Format: CSV logs with timestamps, players, and events
Task Fit: Sequence modeling, outcome prediction, time-series

A raw look into small-college basketball where structure meets chaos. Every pass, foul, and run of play is timestamped—perfect for training models that understand momentum and clutch shifts. Ideal for testing RNNs, LSTMs, or transformers built for sports flow. 

10. NHL Play-by-Play Data 

NHL Play-by-Play Data

Volume: 10+ years of NHL logs
Access: Free (Kaggle)
Format: CSV event logs with shots, penalties, goals
Task Fit: Shot analysis, win prediction, efficiency metrics

Hockey isn’t chaos—it’s structured chaos, and this dataset proves it. Play-by-play sequences let you analyze shot quality, penalty impact, and even goalie hot streaks. A sturdy launchpad for predictive hockey analytics.

Event-Level Sports Data for xG & Tactics

11. FIFA 23 Player Dataset 

FIFA 23 Player Dataset 

Volume: 19,000+ players, 100+ attributes
Access: Free (Kaggle)
Format: CSV (player attributes, positions, clubs, nations)
Task Fit: Classification, clustering, scouting

Ratings, traits, and roles—enough signal to build a scouting engine that actually feels smart. Slice by league, position group, or age curve and surface “hidden gems” your rivals overlook. Great playground for similarity search, role archetyping, and squad planning.

12. Football Manager Complete Dataset

Football Manager Complete Dataset

Volume: 150,000+ players
Access: Free (Kaggle)
Format: CSV (player stats, attributes, positions, nations)
Task Fit: Recommendation, scouting analysis

A cult dataset reborn — clean, deep, and refreshingly current. Attribute-rich player profiles make it perfect for training recommender systems or similarity searches. Whether you’re matching midfield archetypes or ranking potential signings, this one’s pure transfer gold. 

13. WTA & ATP Tennis Stats and Results

WTA & ATP Tennis Stats and Results

Volume: WTA and ATP matches from 1949–2021
Access: Free (Kaggle)
Format: CSV (match results, player stats, tournament metadata)
Task Fit: Outcome prediction, ranking models

Seven decades of tennis history — Grand Slams, upsets, and dominance cycles captured in one dataset. Ideal for modeling Elo-style ratings, predicting match outcomes, or studying era-based performance trends. If your model respects surface and fatigue, this set rewards nuance. 

Multi-Sport APIs and Data Sources

14. balldontlie NBA API 

balldontlie NBA API 

Volume: Historical & current NBA games, players, and stats
Access: Free (no key required; sensible rate limits)
Format: REST API with JSON responses
Task Fit: Real-time dashboards, trend analysis, prediction features

Clean, consistent NBA endpoints without scraping drama. Pull games, box scores, players, and season splits straight into notebooks or BI tools. Great for building live tiles, baseline models, and stat pipelines in a single afternoon. 

15. Sports Stats API

Sports Stats API

Volume: Covers football, basketball, hockey, tennis
Access: Free tier + paid plans
Format: REST API with JSON (multi-sport endpoints)
Task Fit: Multi-sport modeling, visualization, predictions

One doorway, many sports. Pull consistent JSON across leagues, wire it into your ETL, and ship a unified analytics layer fast. Ideal for teams that need breadth without juggling five different vendor schemas. 

16. ESPN Sports Data via Flipside LiveQuery 

ESPN Sports Data via Flipside LiveQuery 

Volume: Scores, schedules, and player stats across major U.S. sports
Access: Free (requires Flipside account)
Format: SQL-based API queries returning JSON/CSV
Task Fit: Trend analysis, visualization, performance tracking

Finally—ESPN data without the scraping pain. Query real game stats, schedules, and leaderboards directly through SQL endpoints. Ideal for analysts who want clean pipelines from ESPN’s ecosystem into BI dashboards or ML notebooks in minutes. 

17. FiveThirtyEight Sports Data

FiveThirtyEight Sports Data

Volume: Multiple datasets (NBA, NFL, MLB, more)
Access: Free (GitHub)
Format: CSV with documentation/READMEs
Task Fit: Prediction, sports betting, storytelling

The datasets behind headline-grabbing forecasts, packaged for immediate use. Clean columns, sensible dictionaries, and repeatable structures make baselines quick to build. Great for demos, benchmarks, and explainable models your PM can love.

18. DataHub Football Data Collection

DataHub Football Data Collection

Volume: 60K+ match results from global leagues and tournaments
Access: Free (open source, downloadable CSV/JSON)
Format: CSV/JSON (team stats, results, goals, standings)
Task Fit: Experimental modeling, benchmarking, reproducibility

A clean, structured, and open dataset that brings worldwide football stats to your fingertips. No scraping, no rate limits—just tidy data ready for ML models, dashboards, or quick EDA. Ideal for testing match outcome prediction or transfer learning across leagues.

19. Match Charting Project – Tennis Data 

Match Charting Project – Tennis Data 

Volume: 60K+ ATP & WTA matches (1968–2024)
Access: Free (open GitHub repo)
Format: CSV (match results, players, stats)
Task Fit: Outcome prediction, ranking models, time-series

A long-running open tennis dataset curated by Jeff Sackmann. Clean, consistent columns for player, surface, round, and result — perfect for building predictive models or ranking algorithms without any preprocessing. 

20. UCI Sports Datasets

UCI Sports Datasets

Volume: Small-to-mid datasets (athletics, gym, swimming)
Access: Free
Format: CSV/ARFF; some sensor streams
Task Fit: Classification, biomechanics, activity recognition

A classic playground for quick experiments and teaching notebooks. Sensor-rich tasks like activity recognition let you test pipelines without heavy ETL. When you need clean, compact data to prove a point, start here. 

🏅 Click to expand Sports Dataset Cheat-Sheet

🏅 Sports Dataset Cheat-Sheet (2025)

⚽ Football / Soccer Analytics

DatasetVolumeData TypeSpecial FeaturesIdeal Use CaseLicense / Access
SoccerNet v3500+ full matchesVideo + JSON annotationsMulti-camera, event tags, sync audioVideo action detection, temporal localizationFree (research)
StatsBomb Open Data30+ competitionsJSON events, CSV matchesDetailed events, pressures, xGTactical modeling, xG pipelinesFree (attribution)
Open Football Data APILive + historicalREST/JSON APIResults, fixtures, oddsReal-time prediction, betting analyticsFree (registration)
Understat xG DataTop 5 leagues (2014–2024)JSON (shots, players, teams)Shot locations + xG valuesxG modeling, form trackingFree (public)
DataHub Football Data60K+ matches worldwideCSV/JSONClean schema, global coverageOutcome prediction, cross-league benchmarkingFree (open source)

🏀 Basketball Analytics

DatasetVolumeData TypeSpecial FeaturesIdeal Use CaseLicense / Access
NBA Shot Logs (2014–15)128K+ shotsCSVShot location, defender contextSpatial models, shot predictionFree (Kaggle)
balldontlie NBA APIAll seasons since 1979REST/JSON APIGames, players, statsDashboards, forecasting, live featuresFree (public)
Division III Basketball Play-by-Play300K+ playsCSVTimestamps, sequential play dataSequence modeling, win predictionFree (Kaggle)

🎾 Tennis Analytics

DatasetVolumeData TypeSpecial FeaturesIdeal Use CaseLicense / Access
Jeff Sackmann Tennis Data60K+ ATP & WTA matchesCSVClean stats, surfaces, tournamentsRanking, match predictionFree (GitHub)
WTA & ATP Stats (1949–2021)~72 years of resultsCSVPlayers, tournaments, rankingsOutcome prediction, era analysisFree (Kaggle)
Match Charting Project10K+ charted matchesCSVManual shot sequencesTactics, sequence modelingFree (open)

⚾ Baseball Analytics

DatasetVolumeData TypeSpecial FeaturesIdeal Use CaseLicense / Access
Lahman Baseball Database150+ years of MLB statsCSV/SQLStructured tables by seasonPerformance forecasting, sabermetricsFree (public)
Retrosheet Play-by-Play100+ seasonsCSV / event textPitch-by-pitch, substitutionsGame simulation, strategy modelingFree (public)

🏈 Multisport & General Analytics

DatasetVolumeData TypeSpecial FeaturesIdeal Use CaseLicense / Access
OpenSports Dataset (DataHub)50K+ recordsCSV/JSONUnified schema across sportsCross-sport analytics, feature engineeringFree (open source)
SportsMOT240 video sequencesVideo + JSONMulti-object trackingObject detection, motion trackingFree (research)

Conclusion

From detailed football event logs to real-time APIs spanning dozens of sports, these datasets cover the full spectrum of analytics needs. Whether you’re modeling match outcomes, building scouting engines, or training CV models, there’s a dataset here to fuel your project.

Insights into the Digital World

20 Best Free Sports Datasets for ML 2025

Sports data is your playbook: choose right, win fast. This multi-sport, ML-ready shortlist includes free + paid options, a quick […]

Best ML Datasets for Object Detection

Training an object detector isn’t a photo shoot — it’s crowd control in a hurricane. Frames smear, subjects overlap, lighting […]

Lidar Annotation Guide

Introduction: Why Lidar Needs Annotation Lidar data without annotations is like a raw blueprint without labels — you see the […]

3D Point Cloud – What Is It?

What is a 3D Point Cloud? Imagine you’re looking at a sculpture — but instead of marble, it’s made of […]

Sensor Fusion: Combining Multiple Data Sources for AI Training

What Is Sensor Fusion? Think of sensor fusion as the AI equivalent of having five senses instead of one. Each […]

What is Sentiment Analysis?

What Is Sentiment Analysis?  Ever overheard someone arguing passionately about pineapple on pizza? That’s sentiment analysis right there, in its […]

What is Word Sense Disambiguation (WSD)?

Quick Summary Your model hits the word “cell.” Biology? Prison? Power source? That instant hesitation — that’s Word Sense Disambiguation […]

20 Best Face Recognition Datasets for ML in 2025

Your model won’t guess a face out of thin air. It learns. From pixels, patterns — and the datasets you […]

20 Best Handwriting Datasets for Machine Learning

Handwriting is messy. It loops, smudges, and slants in a hundred different ways depending on who’s holding the pen. And […]

What Is Entity Linking? The NLP Trick That Connects the Dots

Imagine reading “Paris” in a sentence. Are we talking about the capital of France, Paris Hilton, or the ancient hero […]

Ready to get started?

Tell us what you need — we’ll reply within 24h with a free estimate

    What service are you looking for? *
    What service are you looking for?
    Data Labeling
    Data Collection
    Ready-made Datasets
    Human Moderation
    Medicine
    Other (please describe below)
    What's your budget range? *
    What's your budget range?
    < $1,000
    $1,000 – $5,000
    $5,000 – $10,000
    $10,000 – $50,000
    $50,000+
    Not sure yet
    Where did you hear about Unidata? *
    Where did you hear about Unidata?
    Head of Client Success
    Andrew
    Head of Client Success

    — I'll guide you through every step, from your first
    message to full project delivery

    Thank you for your
    message

    It has been successfully sent!

    We use cookies to enhance your experience, personalize content, ads, and analyze traffic. By clicking 'Accept All', you agree to our Cookie Policy.