20 Best Handwriting Datasets for Machine Learning

14 minutes read
Handwriting Datasets for Machine Learning

Handwriting is messy. It loops, smudges, and slants in a hundred different ways depending on who’s holding the pen. And yet, we expect machines to read it — accurately, instantly, and at scale.

Whether you’re building an OCR engine, transcribing historical archives, or training a model to spot forged signatures, it all starts with the right data. A good handwriting dataset doesn’t just show characters — it captures the quirks of human expression in ink.

In this guide, we’ll walk through:

  1. What makes a handwriting dataset useful (and what to avoid)
  2. How to choose one based on task, format, and language
  3. A curated list of 20 datasets worth your time in 2025

Let’s get into it.

How to Choose the Right Dataset

Picking a handwriting dataset isn’t about size alone. You want the right mix of realism, annotation quality, and task alignment. Here’s what actually matters:

Writing style
Cursive or printed? Block letters or slanted scrawl? Match it to your application.

Granularity
Do you need just characters, or full sentences with layout data? Some datasets go pixel-deep. Others just tag words.

Language and script
Latin, Cyrillic, Chinese, Bangla? Multilingual datasets are out there — but not all are balanced.

License
Open-source is great, but some historical datasets need permissions. Always check before you train.

Capture format
Offline (scanned paper) or online (stylus-tracked strokes)? Choose based on how your model will be used. 

Digits and Basic Characters

Let’s start simple. If your model can’t reliably tell a 3 from an 8, you’ve got bigger problems down the line. These datasets are built for that first step: recognizing single digits and characters in controlled settings. Clean, compact, and surprisingly competitive when pushed to the edge.

1. MNIST

MNIST

Type: Digits 

Language: English 

Volume: 70,000 images (28×28)

Format: Offline scans

Access: Free (commercial use permitted)

Annotations: Single-digit labels

This is the “hello world” of handwriting datasets. Despite its age, MNIST remains a staple for testing classification models. Its simplicity is a strength: everything’s clean, centered, and well-labeled. If your system can’t nail MNIST, it’s not ready for the real world.

2. EMNIST

EMNIST

Type: Digits + Letters 

Language: English

Volume: 800,000+ samples

Format: Offline

Access: Free (commercial use permitted)

Annotations: Characters (upper/lower/digits), multiple splits

EMNIST picks up where MNIST stops. It includes letters — upper and lower case — and keeps the digit class for continuity. With multiple splits available, you can tailor the dataset to your use case. It’s a better fit for mixed-input tasks like license plate readers or digit-letter combo fields. 

🧠 Dataset Spotlight: EMNIST

dataset_name: EMNIST
type: Digits + Letters
access: Free (commercial use permitted)
format: Offline scanned images
ideal_for: Digit-letter classification, OCR training, license plate parsing

3. Chars74K

Chars74K dataset

Type: Characters 

Languages: English, Kannada 

Volume: 74,000+ images

Format: Offline (scene + hand-drawn) 

Access: Free (commercial use permitted)

Annotations: Character labels by source (scene, synthetic, handwriting)

If MNIST is tidy, Chars74K is wild. Pulled from photos, digital drawings, and scanned docs, this set challenges your model with real-world noise. It’s a great dataset to move from lab conditions to unpredictable environments — especially in multilingual settings or domain adaptation.

4. HASYv2

HASYv2 Kaggle dataset

Type: Symbols 

Language: LaTeX-style Math

Volume: 168,000 images

Format: Offline (binary 32×32) 

Access: Free (commercial use permitted)

Annotations: 369 math symbol classes

Most OCR tools break down when math enters the picture. HASYv2 gives you a head start. Each image is a crisp 32×32 black-and-white render of handwritten math symbols. Great for academic tools, equation parsing, or any workflow involving formulas.

Words and Sentences

Once your model graduates from isolated digits, it needs to tackle full words and connected cursive. These datasets teach it how humans actually write — across lines, with varied pressure, spacing quirks, and all the delightful chaos of real handwriting. Ideal for building transcription systems, smart OCR, or full-page text recognition.

5. IAM Handwriting Database

IAM Handwriting Database

Type: Sentences, Words, Characters

Language: English 

Volume: 1,539 pages from 657 writers

Format: Offline (scanned forms)

Access: Free (academic only)

Annotations: Sentences, words, characters (bounding boxes + transcripts)

This is the most cited offline handwriting dataset — and for good reason. IAM is the gold standard for full-line recognition in English. It’s neatly segmented, well-annotated, and large enough to train deep models. If you’re building anything from a smart scanner to a handwriting-to-text pipeline, start here.

📝 Dataset Spotlight: IAM Handwriting

dataset_name: IAM Handwriting Database
type: Sentences, Words, Characters
access: Free (academic only)
format: Offline scans with bounding boxes and transcripts
ideal_for: Full-line OCR, handwriting-to-text models, segmentation tasks

6. RIMES

RIMES

Type: Documents, Sentences, Words 

Language: French 

Volume: Thousands of letters and forms

Format: Offline (realistic business letters) 

Access: Free (academic only)

Annotations: Word-level segmentation, transcripts, metadata

Need French handwriting data? RIMES delivers. It’s built from simulated business mail — letters, invoices, forms — and mimics how handwriting appears in real-world documents. It’s excellent for training document-level models that work across layouts and writing styles, especially in multilingual deployments.

7. READ 2016 (ICFHR)

READ 2016 (ICFHR) dataset

Type: Full documents 

Language: Primarily German + multilingual 

Volume: Historical archives

Format: Offline scans (historic papers) 

Access: Free (academic only)

Annotations: Lines, baselines, region segmentation, full transcripts

Designed for the handwriting competition at ICFHR 2016, this dataset is a playground for layout-aware models. The documents are historical, messy, and packed with structure. If you’re doing handwritten layout analysis, line segmentation, or want to simulate archival OCR — this is a strong candidate. 

Historical Manuscripts

Modern handwriting is messy. Old handwriting? It’s a puzzle. From faded ink to century-old flourishes, these datasets are built for models that need to work with historical documents. Whether you're digitizing archives, training transcription engines, or building paleographic tools — this is where you train them to read the past. 

8. Bentham Dataset

Bentham Dataset

Type: Pages, Lines, Words 

Language: English (18th–19th century) 

Volume: 6,000+ page images
Format: Offline scans (manuscripts) 

Access: Free (via Transkribus platform)
Annotations: Full transcripts, structured line segmentation

Handwriting gets harder when it’s 200 years old. The Bentham dataset gives you digitized manuscripts from philosopher Jeremy Bentham’s archive — complete with dense cursive, old spellings, and vintage quirks. A strong testbed for training models on historical material, especially with a clean transcription pipeline already in place. 

🏛️ Dataset Spotlight: Bentham

dataset_name: Bentham Dataset
type: Pages, Lines, Words
access: Free (via Transkribus)
format: Scanned historical manuscripts with structured annotation
ideal_for: Historical OCR, cursive transcription, digital humanities

9. Saint Gall

Saint Gall dataset

Type: Pages, Lines 

Language: Latin 

Volume: 60 manuscript pages (9th century)

Format: Offline scans 

Access: Free (research use only)

Annotations: 1,410 lines, full transcripts, word boundaries

This dataset brings medieval Latin to machine-readable form. Sourced from a 9th-century monastic manuscript, Saint Gall is dense, handwritten, and historically significant. It’s ideal for training models on ancient scripts, historical OCR systems, or building tools for digital humanities research.

10. Konzil and Patzig

Konzil
Patzig

Type: Full documents 

Language: German (old variants) 

Volume: 100+ archival files

Format: Offline scans 

Access: Paid / Restricted (project-based academic access)

Annotations: Region layouts, lines, transcripts, metadata

These are raw, handwritten German archival documents — legal records, council notes, historical letters. Konzil and Patzig challenge your model with shifting layouts, inconsistent line spacing, and period-specific language. Best for testing layout robustness and handling

11. Digital Peter

Digital Peter dataset

Type: Lines, Pages 

Language: Russian cursive (18th century) 

Volume: 9,694 images

Format: Offline (digitized manuscripts) 

Access: Free (project-based open access)

Annotations: Line-level segmentation, transcript metadata

Peter the Great’s handwriting — yes, literally. This dataset captures early modern Russian script across nearly 10,000 scanned lines. It’s excellent for training models in Cyrillic cursive, working with historical Slavic texts, or enhancing language-specific document digitization pipelines.

Online Handwriting (Stylus-based)

Instead of scanning paper, these datasets capture how handwriting is created — stroke by stroke. Used for digital ink recognition, handwriting input systems, and behavioral biometrics, they’re ideal when timing, pen pressure, and pen paths matter more than pixels.

12. BRUSH

BRUSH dataset

Type: Online strokes 

Language: English 

Volume: 27,649 word samples from 170 writers

Format: Stylus-tracked trajectories 

Access: Free (commercial use permitted)

Annotations: Character and word-level transcripts, stroke order

BRUSH records how people write, not just what they write. It tracks pen movement, speed, and order — making it a solid foundation for building stylus input engines or gesture-aware handwriting systems. If you’re working on smart tablets, signature apps, or custom keyboards, this is a practical start.

✍️ Dataset Spotlight: BRUSH

dataset_name: BRUSH
type: Online strokes
access: Free (commercial use permitted)
format: Stylus-tracked stroke trajectories
ideal_for: Stylus input engines, stroke-aware OCR, behavioral biometrics

13. DeepWriting

DeepWriting dataset

Type: Online strokes 

Language: English 

Volume: ~30,000 sequences (varied tasks)

Format: Pen trajectories (x, y, pressure, time) 

Access: Free (research use only)

Annotations: Stroke-level time series, writing samples across tasks

Built for writer-specific modeling, DeepWriting is less about decoding text and more about understanding how people write. It’s a great fit for behavioral authentication, stroke prediction, and training personalized handwriting input models.

14. CASIA-OLHWDB

CASIA-OLHWDB

Type: Online strokes 

Language: Chinese 

Volume: 1.5M+ characters

Format: Stylus-captured stroke sequences 

Access: Free (academic only)

Annotations: Character-level with stroke timing, direction

For Chinese handwriting, CASIA is a powerhouse. It captures both the order and dynamics of writing over a massive vocabulary. If your system needs to handle non-Latin scripts, complex stroke composition, or multilingual handwriting on touchscreens — start here.

Multilingual and Non-Latin Datasets

Latin script isn’t the whole story. Models deployed in the real world need to handle complex curves, stacked characters, and unfamiliar writing systems. These datasets are built for that challenge — whether it’s decoding Bengali newspapers, handwritten Chinese homework, or mixed Kazakh-Russian text.

15. BanglaLekha-Isolated 

BanglaLekha-Isolated 

Type: Isolated characters 

Language: Bangla 

Volume: 166,105 samples from 2,000+ writers

Format: Offline scans 

Access: Free (commercial use permitted)

Annotations: Unicode character labels, demographics (age, gender, district)

BanglaLekha is one of the largest open handwriting datasets in Bengali. It includes digits, basic characters, and compound forms — all labeled and scanned from real handwritten samples. For Bangla OCR systems, it’s the most accessible starting point available today.

16. BN-HTRd

BN-HTRd dataset

Type: Documents and words 

Language: Bangla 

Volume: 786 pages, 108,181 words

Format: Offline scans of printed and handwritten text 

Access: Free (academic only)

Annotations: Word and line-level transcriptions

Where BanglaLekha isolates characters, BN-HTRd goes full-document. It combines dense paragraph-level data with accurate ground truth — making it better suited for full-page OCR, paragraph recognition, and real-world layout handling in Bengali text workflows.

17. CASIA-HWDB 

CASIA-HWDB dataset

Type: Characters (offline) 

Language: Chinese 

Volume: 1.17 million images

Format: Offline scans 

Access: Free (academic only)

Annotations: Character labels, writer ID, style metadata

If you’re working with Chinese, this is your training ground. CASIA-HWDB contains over a million handwritten characters covering thousands of classes. It’s the default for handwritten Chinese recognition and scales well from digit-level to large vocabulary models.

🌐 Dataset Spotlight: CASIA-HWDB

dataset_name: CASIA-HWDB
type: Chinese Characters (offline)
access: Free (academic use only)
format: Scanned handwritten images with writer and style metadata
ideal_for: Non-Latin OCR, Chinese script training, large-vocabulary handwriting recognition

18. HKR (Kazakh-Russian Handwriting) 

HKR (Kazakh-Russian Handwriting) 

Type: Sentences 

Languages: Kazakh, Russian 

Volume: 63,000+ sentences

Format: Offline scans 

Access: Free (research use only)

Annotations: Sentence-level transcriptions, mixed-script tags

HKR is one of the few modern handwriting datasets that mixes languages and scripts. Collected from over 200 native speakers, it reflects how Kazakh and Russian are used side by side — perfect for multilingual recognition systems and research into script-switching behavior.

Scene Text and Natural Images

Not all handwriting sits neatly on paper. In the wild, it appears on whiteboards, shop signs, menus, product labels — usually under bad lighting and weird angles. These datasets teach your model to recognize text when it's embedded in the visual noise of real life.

19. IIIT 5K-Word 

IIIT 5K-Word dataset

Type: Word images from scene photos 

Language: English 

Volume: 5,000 word crops
Format: Cropped natural scene photos 

Access: Free (commercial use permitted)
Annotations: Word-level transcripts, bounding boxes

Sourced from Google image search, IIIT 5K is a compact but widely used dataset for scene-text recognition. It’s great for evaluating how well your system handles diverse fonts, occlusions, and background clutter — without having to process full-page documents.

20. Street View Text (SVT) 

Street View Text (SVT) dataset

Type: Street sign words 

Language: English 

Volume: 647 images, 8,000+ words

Format: Scene text from Google Street View 

Access: Free (commercial use permitted)

Annotations: Word crops, transcripts, location metadata 

Pulled from Google Street View, SVT captures text as it really appears in public spaces — on signs, storefronts, and billboards. Ideal for building AR systems, mobile OCR apps, or smart navigation tools that read the environment on the fly.

🧾 Dataset Spotlight: Street View Text (SVT)

dataset_name: Street View Text (SVT)
type: Scene Text (street signs)
access: Free (commercial use permitted)
format: Natural image crops with bounding boxes and metadata
ideal_for: Scene text recognition, smart navigation, AR text detection

Final Takeaways

Handwriting comes in many forms — digits, sentences, scripts, symbols. The best dataset depends on what your model needs to learn. Start with clean data, match the format to your use case, and scale up from there. The right training set will do more for your system than any tweak downstream.

🗂️ Cheat Sheet: All 20 Handwriting Datasets

- dataset_name: MNIST
  type: Digits
  access: Free (commercial use permitted)
  format: Offline scans (28×28)
  ideal_for: Digit classification, baseline testing
- dataset_name: EMNIST
  type: Digits + Letters
  access: Free (commercial use permitted)
  format: Offline
  ideal_for: Digit-letter recognition, OCR training
- dataset_name: Chars74K
  type: Characters
  access: Free (commercial use permitted)
  format: Offline (scene + hand-drawn)
  ideal_for: Multilingual OCR, character classification
- dataset_name: HASYv2
  type: Symbols
  access: Free (commercial use permitted)
  format: Offline (binary 32×32)
  ideal_for: Math OCR, symbol classification
- dataset_name: IAM Handwriting Database
  type: Words + Sentences
  access: Free (academic only)
  format: Offline scans
  ideal_for: Full-line OCR, handwriting-to-text models
- dataset_name: RIMES
  type: Documents, Sentences, Words
  access: Free (academic only)
  format: Offline (business letters)
  ideal_for: French handwriting OCR, document modeling
- dataset_name: READ 2016
  type: Full documents
  access: Free (academic only)
  format: Offline scans
  ideal_for: Layout-aware OCR, archival transcription
- dataset_name: Bentham Dataset
  type: Pages, Lines, Words
  access: Free (via Transkribus platform)
  format: Offline scans (manuscripts)
  ideal_for: Historical OCR, digital humanities
- dataset_name: Saint Gall
  type: Pages, Lines
  access: Free (research use only)
  format: Offline scans
  ideal_for: Latin OCR, medieval script recognition
- dataset_name: Konzil and Patzig
  type: Full documents
  access: Paid / Restricted (project-based academic access)
  format: Offline scans
  ideal_for: Layout analysis, old German script OCR
- dataset_name: Digital Peter
  type: Lines, Pages
  access: Free (project-based open access)
  format: Offline (digitized manuscripts)
  ideal_for: Cyrillic cursive OCR, Slavic text recognition
- dataset_name: BRUSH
  type: Online strokes
  access: Free (commercial use permitted)
  format: Stylus-tracked trajectories
  ideal_for: Stylus input, behavioral biometrics
- dataset_name: DeepWriting
  type: Online strokes
  access: Free (research use only)
  format: Pen trajectories
  ideal_for: Stroke modeling, personalized handwriting
- dataset_name: CASIA-OLHWDB
  type: Online strokes
  access: Free (academic only)
  format: Stylus-captured sequences
  ideal_for: Chinese stroke recognition, handwriting input
- dataset_name: BanglaLekha-Isolated
  type: Isolated characters
  access: Free (commercial use permitted)
  format: Offline scans
  ideal_for: Bangla OCR, character classification
- dataset_name: BN-HTRd
  type: Documents and words
  access: Free (academic only)
  format: Offline scans
  ideal_for: Full-page Bangla OCR, layout-aware systems
- dataset_name: CASIA-HWDB
  type: Characters (offline)
  access: Free (academic only)
  format: Offline scans
  ideal_for: Chinese OCR, large vocabulary modeling
- dataset_name: HKR
  type: Sentences
  access: Free (research use only)
  format: Offline scans
  ideal_for: Multilingual OCR, script-switching recognition
- dataset_name: IIIT 5K-Word
  type: Scene word crops
  access: Free (commercial use permitted)
  format: Cropped natural scene photos
  ideal_for: Scene-text OCR, image word spotting
- dataset_name: Street View Text (SVT)
  type: Street sign words
  access: Free (commercial use permitted)
  format: Scene text from Google Street View
  ideal_for: AR navigation, street OCR

FAQs

1. What is the best handwriting dataset for OCR training?
It depends on your language and task. For English OCR, the IAM Handwriting Database is the gold standard. For digits, start with MNIST or EMNIST.
2. Are there any handwriting datasets with both printed and cursive writing?
Yes. Datasets like IAM and RIMES include a mix of printed and cursive styles, which makes them ideal for training flexible recognition systems.
3. Can I use handwriting datasets for commercial projects?
Some datasets are free for commercial use (like MNIST, EMNIST, Chars74K), while others are limited to academic or research use. Always check the license before training.
4. Is there a handwriting dataset for stylus or online input?
Yes. BRUSH, DeepWriting, and CASIA-OLHWDB are designed for online handwriting — capturing strokes, pressure, and timing from stylus input.
5. Which handwriting datasets support multiple languages or scripts?
Multilingual datasets include CASIA-HWDB (Chinese), BanglaLekha (Bangla), Digital Peter (Russian), and HKR (Kazakh–Russian). These are useful for non-Latin OCR training.

Insights into the Digital World

20 Best Handwriting Datasets for Machine Learning

Handwriting is messy. It loops, smudges, and slants in a hundred different ways depending on who’s holding the pen. And […]

What Is Entity Linking? The NLP Trick That Connects the Dots

Imagine reading “Paris” in a sentence. Are we talking about the capital of France, Paris Hilton, or the ancient hero […]

20 Best Free Healthcare Datasets for ML in 2025

Top 20 healthcare datasets for machine learning—free, diverse, and ready to train. Includes EHRs, X-rays, dialogues, audio, and commercial-grade data. […]

20 Best Financial Datasets for Machine Learning

Why Financial Data Powers ML Most datasets are static snapshots. Financial data? It’s alive. Markets move. Policies shift. Consumers panic. […]

AI for Image Recognition: How Machines Learned to See—and Why It Matters 

Your phone sorts photos by face. Your car knows when you’re not paying attention. And warehouses spot defects in milliseconds. […]

Automatic Speech Recognition (ASR): How Machines Learn to Listen

1. What Is Automatic Speech Recognition? Talk to your phone. Rant to your car. Whisper to your smart speaker. And […]

Lidar vs Radar: Complete Guide 2025

They both “see” the world — but in totally different ways. Lidar sketches every curve and corner in laser-sharp detail. […]

Facial Recognition – What is It and How It Works

Facial recognition has quietly slipped into our everyday lives. It helps you unlock your phone, breeze through airport security, or […]

Research on the Most Stressful Driving Regions in the UK

Over 100,000 road accidents take place across the UK each year — a toll that includes injuries and fatalities. In […]

ML Dataset Trends Research and Statistics

Research on ML Dataset Search Trends (2019–2024)

In this study, we analyzed trends and statistics related to the search for machine learning (ML) datasets over the past […]

Ready to get started?

Tell us what you need — we’ll reply within 24h with a free estimate

    What service are you looking for? *
    What service are you looking for?
    Data Labeling
    Data Collection
    Ready-made Datasets
    Human Moderation
    Medicine
    Other (please describe below)
    What's your budget range? *
    What's your budget range?
    < $1,000
    $1,000 – $5,000
    $5,000 – $10,000
    $10,000 – $50,000
    $50,000+
    Not sure yet
    Where did you hear about Unidata? *
    Where did you hear about Unidata?
    Head of Client Success
    Andrew
    Head of Client Success

    — I'll guide you through every step, from your first
    message to full project delivery

    Thank you for your
    message

    It has been successfully sent!

    We use cookies to enhance your experience, personalize content, ads, and analyze traffic. By clicking 'Accept All', you agree to our Cookie Policy.