Commercial

Portuguese Speech Recognition Dataset

The dataset contains 406 hours of annotated telephone dialogues from 590 native Portuguese speakers, offering detailed audio recordings, transcriptions, and speaker metadata to train speech recognition systems, NLP models, and machine learning applications with diverse Portuguese speech datasets

Get in touch Download sample
  • Hours
    406
  • Speakers
    590
  • Word Accuracy Rate
    98%
  • NLP
  • LLM
  • Machine Learning
  • Audio Processing
  • ASR
  • Voice Recognition

The dataset contains 406 hours of annotated telephone dialogues from 590 native Portuguese speakers, offering detailed audio recordings, transcriptions, and speaker metadata to train speech recognition systems, NLP models, and machine learning applications with diverse Portuguese speech datasets

Get in touch Download sample
  • NLP
  • LLM
  • Machine Learning
  • Audio Processing
  • ASR
  • Voice Recognition
  • Hours
    406
  • Speakers
    590
  • Word Accuracy Rate
    98%

Dataset Info

Characteristic Data
Description Audio of telephone dialogues in Portuguese for training NLP models in real-world conversational scenarios.
Data types Audio
Tasks Speech recognition, NLP
Country Portugal(PRT)
Hours of telephone dialogue 406
Number of speakers 590
Labeling Annotation (text content, speaker's ID, gender, age and other attributes)
Gender Male (46%), Female (54%)
Recording device Android smartphone, iPhone
Download sample

Statistics

Distribution by gender

Technical
Characteristics

Characteristic Data
Audio Format Wav
Sampling Rate 16kHz
Number of Channels Mono
Bit Depth 16 bit
Recording condition Low background noise (indoor)
Source and collection methodology: Data was collected by a partner of Unidata.

Dataset Use Cases

  • Call Centers and Customer Service

    Improving Speech Recognition in Telephone Conversations

    Portuguese Speech Recognition Dataset provides telephone dialogues dataset samples collected from native speakers with various accents. These audio recordings and transcripts help train recognition systems for call centers, enabling accurate transcribing of speech, detecting customer intent, and supporting language processing tasks with higher accuracy in real-world commercial use cases.

  • Healthcare and Telemedicine

    Enhancing Voice Technology for Remote Consultations

    This Portuguese audio dataset supports medical applications by offering speech samples suited for healthcare communication systems. Automatic speech recognition and language processing models trained on this diverse dataset achieve better accuracy in understanding spoken words, helping doctors and patients interact seamlessly in telemedicine platforms using natural language.

  • AI and Machine Learning Development

    Training Models for Speech Processing Tasks

    The dataset consists of high-quality Portuguese telephone dialogues and audio samples used for machine learning and deep learning projects. With diverse accents and natural speech signals, it improves recognition technology, builds multilingual speech models, and supports custom datasets for voice assistants, transcription services, and automatic speech translation systems.

  • Education and Language Technology

    Building Tools for Portuguese Language Learning

    This dataset enables the creation of speech processing applications for language learning platforms. By using annotated audio files with accurate transcriptions, learning models can better process natural speech, recognize voice patterns, and adapt to various accents, helping learners gain proficiency in Portuguese through interactive voice technology.

FAQs

What is included in the Portuguese Speech Recognition Dataset?
This dataset consists of 406 hours of telephone speech recordings from 590 speakers in Portugal. It includes audio files in WAV format with annotations such as transcriptions, speaker ID, gender, and age.
What types of annotations are provided?
The dataset includes fully transcribed dialogues along with metadata annotations for speaker ID, gender, and age. These labels are essential for building speech recognition models with high accuracy.
Is the dataset suitable for commercial use?
Yes, the dataset is licensed for both research and commercial usage. It can be applied in customer service systems, voice recognition software, speech-to-text solutions, and other language processing technologies.
What should I consider before buying this dataset?
When purchasing it, check the audio format, sampling rate, and the number of native Portuguese speakers included. Make sure the annotations and speech samples match your goals for speech recognition, NLP models, or machine learning projects.
Do Unidata datasets follow GDPR or other data privacy regulations?
Yes. All Unidata datasets are curated in compliance with GDPR and international privacy laws. The audio recordings were collected only from legally permissible sources to ensure ethical data collection and safe commercial usage.
How long does it take to receive the dataset?
After submitting a request, Unidata will review the details and complete the necessary documentation with you. Once the agreement is signed and payment is received, the dataset will be delivered within 3-10 days.
Is it unique data?
Yes. This dataset consists of telephone speech collected directly from real participants through Unidata’s partner, making it unique and not available in open-source repositories.
Is this a real-world dataset or synthetic data?
This is a real-world dataset. Portuguese Speech Recognition Dataset was collected from native speakers using Android and iPhone devices in natural recording conditions with low background noise. It provides authentic speech signals, audio files, and voice patterns for training sets in automatic speech recognition and NLP applications.
Still have questions about using Unidata datasets? Read our user-guides

Similar Datasets

What our clients are saying

UniData

4 3 Reviews

PA

Paul 2025-02-21

Very Positive Experience!

The team was very responsive when requesting a specific dataset, and was able to work with us on what data we specifically needed and custom pricing for our use case. Overall a great experience, and would recommend them to others!

TH

Thorsten 2025-01-09

Very good experience

We got in touch with UniData to buy several datasets from them. Communication was very cooperative, quick, and friendly. We were able to find contract conditions that suited both parties well. I also appreciate the team's dedication to understand and address the needs of the customer. And the datasets we bought from UniData matched with our expectations.

Max Crous 2024-10-08

Data purchase

Our team got in touch with UniData for purchasing video data. The team at UniData was transparent, timely, and pleasant to communicate and negotiate with. Their samples and descriptions aligned well with the data we received. We will certainly reach out to UniData again if we're in search of 3rd party video data.

Abhijeet Zilpelwar 2025-02-26

Data is well organized and easy to…

Data is well organized and easy to consume. We could download and use it for training within few hours of receiving the data links.

Why Choose Us

Unidata offers unparalleled expertise in AI data solutions, delivering superior data quality and optimized workflows

Expertise

Our team consists of industry-leading experts in AI data solutions

Quality

We ensure superior data quality to maximize your AI project's potential

Efficiency

Our optimized workflows accelerate your model training processes

Proven Results

Our track record of case studies demonstrates our ability to deliver outstanding outcomes

Customization

Our track record of case studies demonstrates our ability to deliver outstanding outcomes

Support

We provide ongoing support and consultation to ensure continuous success
background
team
1000 +
full-time assessors

Ready to get started?

Tell us what you need — we’ll reply within 24h with a free estimate

    What service are you looking for? *
    What service are you looking for?
    Data Labeling
    Data Collection
    Ready-made Datasets
    Human Moderation
    Medicine
    Other (please describe below)
    What's your budget range? *
    What's your budget range?
    < $1,000
    $1,000 – $5,000
    $5,000 – $10,000
    $10,000 – $50,000
    $50,000+
    Not sure yet
    Where did you hear about Unidata? *
    Where did you hear about Unidata?
    Head of Client Success
    Andrew
    Head of Client Success

    — I'll guide you through every step, from your first
    message to full project delivery

    Thank you for your
    message

    It has been successfully sent!

    We use cookies to enhance your experience, personalize content, ads, and analyze traffic. By clicking 'Accept All', you agree to our Cookie Policy.