What Is Data Curation? 

Introduction

Definition of Data Curation

Data curation is part of the data management process. It involves a comprehensive approach to managing, organizing, and enhancing the value of data throughout its lifecycle. The goal is to make sure that data remains accurate and accessible over time.

Data curation involves collecting, organizing, annotating, preserving, and sharing data to make it more useful for individuals, groups of people (e.g., researchers, policymakers, businesses), or the general public. Data curation is not just about storing data – it’s about ensuring data quality, context, and relevance.

Why is Data Curation Important?

Today, the volume of generated data is expanding rapidly: International Data Corporation (IDC) estimates that 175 zettabytes of data will be created annually by 2025. This immense volume of data demands an all-encompassing and effective approach to storing and curating data. 

Data curation is crucial for several reasons:

Quality Control

It ensures that the data is accurate and reliable, creating an environment for informed decision-making. 

Efficiency

Properly curated data is easier to access and analyze – this saves time and resources, as well as expedites business processes.

Long-Term Value

Curation adds value to data, making it a reusable asset for future research and analysis.

Data verification

In the academic and scientific fields, data curation is essential for verifying research results and facilitating peer review. Researchers can replicate studies and build upon previous work instead of starting from scratch.

The Data Curation Process

Data curation is a multistage, meticulously structured process. Its complexity depends on the type of data and the intended use, but it generally includes several key stages.

Main Steps in the Data Curation Process

1. Data Collection

First, data is gathered from various sources, including sensors, surveys, transactions, or online interactions. During this phase, it’s crucial to ensure that data collection meets ethical and legal requirements. 

2. Data Assessment

Once collected, the data undergoes an assessment phase where its quality and accuracy are evaluated. This involves checking for completeness, reliability, and relevance to the intended purpose.

3. Data Cleaning

Next, data undergoes cleaning: errors are rectified and inconsistencies are identified during this phase. Data cleaning ensures the integrity of the dataset.

4. Metadata Creation

During this stage, metadata is added to the dataset, to provide deeper understanding of the data's origin, context, and format. 

5. Data Transformation

Data needs to be transformed into a standardized format. This will make it easy to find and analyze data. Data transformation may involve standardizing naming conventions for variables or fields within your dataset.

5. Data Storage

After undergoing all these steps, data is then stored in a suitable format and location – for example, in databases, data warehouses, or cloud storage solutions.

6. Data Preservation

Long-term preservation is necessary for data to remain accessible. This includes maintaining data’s format compatibility and safeguarding against data loss.

7. Data Sharing and Access

The final step is to make sure the data is available to users. This requires establishing access controls and distribution mechanisms that maintain openness and privacy balance. 

Who is Involved in Data Curation?

Several key roles are typically involved in the data curation process:

Data Curators

These are the professionals who lead the curation process. They have a strong understanding of both the subject matter and the technical aspects of data management.

Data Scientists and Analysts

These individuals focus on analyzing the data, deriving insights, and ensuring the data is ready for analysis. They play a key role in data cleaning and annotation processes. 

Data Stewards

Data stewards and curators work together to maximize the value of data. Stewards are responsible for the data governance aspect: they ensure that data is managed according to the organization’s policies, compliance requirements, and ethical standards.

IT and Data Management Professionals

These professionals focus on the technical aspects of storing, preserving, and securing the data. They manage databases, storage systems, and data security protocols.

Subject Matter Experts (SMEs)

SMEs provide essential knowledge about the context and use of the data, assisting in its annotation and ensuring its relevance and accuracy.

Responsibilities of Data Curators

Data curators have a comprehensive set of responsibilities that span the entire data lifecycle:

Assessment and Acquisition

Evaluating potential data sources and overseeing the data collection process to make sure it meets all the requirements.

Quality Assurance

Implementing processes to check and maintain the quality of the data.

Metadata Management

Creating and managing metadata so the data is well-documented and its lineage is clear.

Access and Sharing

Establishing protocols for data access, sharing, and distribution. Data curators also ensure compliance with data privacy and security policies.

Preservation

Implementing strategies for the long-term preservation of data, choosing the right storage solutions and formats.

Community Engagement

Communicating with end-users, stakeholders, and the broader community to ensure the data meets their needs.

Data curation vs Data management: what is the difference?

Data curation and data management are two critical processes in data science. While they overlap in certain areas, each process has distinct objectives, methods, and outcomes. 

AspectData CurationData Management
GoalEnhancing the usability and value of data for specific research or analysisOverseeing the entire lifecycle of data to ensure its availability and quality
FocusIn-depth handling of specific datasetsBroad oversight of all data assets
ActivitiesSelection, annotation, enrichment, and preservation of dataData architecture design, integration, storage, security, governance, and policy implementation
OutcomeA dataset that is ready for specific analytical or research needsA structured environment where data is securely stored, accessible, and managed
Tools UsedData analysis software, metadata management tools, data archiving systemsDatabase management systems, data governance tools, security software

Challenges in Data Curation

Data Volume and Variety

The sheer volume and diversity of data types can be overwhelming, making it difficult to curate data effectively. Innovative methods in artificial intelligence (AI) and machine learning (ML) can help overcome this challenge by automating parts of the curation process. Automation saves time and alleviates the workload, allowing data curators to focus on the important parts of the process.

Data Quality

Ensuring data quality is a widespread challenge, given the various sources and potential for errors. To maintain high data quality, curators should implement strict QA protocols, including regular audits and validation checks.

Resource Constraints

Data curation is a multistage process which often requires significant resources for hiring including skilled personnel and employing advanced technology. Investing in automation from the start can significantly cut expenses in the future. 

Compliance and Security

Adhering to data privacy regulations and ensuring the security of data is becoming increasingly challenging. Compliance standards are hard to implement, while failures can be quite costly: GDPR fines can reach 20 million euros or 4% of a company’s global annual turnover. That’s why comprehensive data governance policies that align with legal requirements should be developed from the get-go. Moreover, implementing robust security measures will only facilitate compliance with legal requirements in the future.

Tools and Technologies for Data Curation

Software and Platforms Used in Data Curation

Several software solutions and platforms have been developed to support the various aspects of data curation:

Data Management Platforms

Platforms like CKAN and Dataverse provide solutions for data publishing, sharing, and management. These platforms offer useful features such as metadata creation, data storage, and access control.

Data Cleaning Tools

Software such as OpenRefine and Trifacta are designed to clean and transform data, helping curators ensure accuracy and consistency.

Metadata Management Tools

Dublin Core The Dublin Core or Dublin Core Metadata Element Set (DCMES), is a set of 15 main metadata items used to describe digital or physical resources. It allows for the detailed documentation and annotation of datasets. Also, MODS (Metadata Object Description Schema) is a valuable resource in data curation: it’s a bibliographic metadata standard. It’s used as a basis for the representation of bibliographic data in machine-readable form.

Digital Preservation Systems

Tools like Archivematica and Preservica ensure the long-term preservation of digital data, ensuring its ongoing relevance. 

Automation in Data Curation

Data scientists complain that 80% of their time is spent preparing data for analysis and only 20% of the time is used for the actual analysis. Automation plays a critical role in enhancing the efficiency and accuracy of the data curation process. Machine learning algorithms and AI can automate repetitive tasks such as data cleaning, classification, and annotation. For example, ML models can be trained to identify and rectify inconsistencies in datasets, reducing the manual workload and minimizing human errors.

Automation also extends to the extraction and analysis of data. Natural Language Processing (NLP) technologies, for instance, can automatically analyze textual data, extract relevant information and insights. This significantly speeds up the data curation process.

The integration of these automated tools and technologies into data curation workflows not only streamlines the process but also enables data curators to focus on more strategic aspects of data management, such as quality analysis and decision-making. With the increasing complexity and volume of data, automation in data curation is becoming not just beneficial but essential in managing data efficiently.

Case Studies: Data Curation Across Different Industries

Healthcare: Genomic Data Curation at NCBI

The National Center for Biotechnology Information (NCBI) curates genomic data through its GenBank database – a comprehensive public database of nucleotide sequences and supporting bibliographic and biological annotation. This curation is vital for research in genomics, medicine, and biology, facilitating scientific discoveries and advancements in healthcare.

Finance: Bloomberg's Financial Data Services

Bloomberg is a prominent example in the financial industry, providing extensive data curation through its financial data services. Bloomberg collects, integrates, and delivers high-quality financial information, including market data, pricing, analytics, and news, to support investment and financial decisions worldwide.

Retail: Walmart's Data Café

Walmart has established a data analytics hub known as the Data Café (Collaborative Analytics Facilities for Enterprise), where vast amounts of data from over 200 sources, including sales, finance, social media, and logistics, are curated and analyzed to improve decision-making and operational efficiency in real-time. This private cloud processes 2.5 PB of data every hour. More than 200 streams of external and internal data along with 40 PB of transactional data can be managed, modeled and visualized.

Conclusion

Data curation plays a pivotal role in managing and employing data across various industries. By systematically collecting, organizing, cleaning, and preserving data, organizations can ensure its relevance and accessibility over time. 

Insights into the Digital World

Automated Data Annotation – Complete Guide

Introduction to Data Annotation Automatic annotation has significantly changed how companies handle and analyze vast datasets. This leap forward is […]

Ensuring Data Quality in AI

Why Is Data Quality Important in AI? The performance of intelligence (AI) systems greatly depends on the quality of the […]

Human-on-the-loop in Machine Learning: What is it and What it isn’t

Getting deeper into machine learning, we come across the concept of Human-on-the-Loop (HOTL). This is an approach where human intelligence […]

AI Content Moderation: How To Benefit From It?

What Is AI in Content Moderation? Content moderation is the practice of reviewing user-generated content (UGC) on internet platforms – […]

6 types of content moderation with examples

What is content moderation? Content moderation is the process of monitoring, filtering, and managing user-generated content (UGC) on digital platforms. […]

Validation Dataset in Machine Learning

Validation of data serves the purpose of gauging the efficiency of the machine learning (ML) model, which will consequently enhance […]

What is liveness detection? How Does It Work?

How can we be sure that the person accessing sensitive data is truly who they claim to be? Traditional biometric […]

Content Moderation: a Complete Guide

What Is Content Moderation? Content moderation is the practice of reviewing user-generated content (UGC) on internet platforms – social media, […]

Testing Data in Machine Learning

In the world of machine learning (ML), the effectiveness of a model significantly relies on the quality and characteristics of […]

Deep learning for computer vision

Deep learning has greatly impacted the field of computer vision, enabling computers and systems to analyze and interpret the visual […]

employer

Ready to work with us?