The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
-
Updated
Apr 10, 2025 - Python
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Refine high-quality datasets and visual AI models
A Doctor for your data
The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.
Interactively explore unstructured datasets from your dataframe.
Resources for Data Centric AI
A curated, but incomplete, list of data-centric AI resources.
Automatically find issues in image datasets and practice data-centric computer vision.
Curated list of open source tooling for data-centric AI on unstructured data.
Lab assignments for Introduction to Data-Centric AI, MIT IAP 2024 👩🏽💻
Official PyTorch implementation of the paper "Dataset Distillation with Neural Characteristic Function: A Minmax Perspective" (NCFM) in CVPR 2025 (Highlight).
Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"
[NeurIPS 2021] WRENCH: Weak supeRvision bENCHmark
[NeurIPS 2023] This is the code for the paper `Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias`.
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
Introduction to Data-Centric AI, MIT IAP 2023 🤖
OpenDataVal: a Unified Benchmark for Data Valuation in Python (NeurIPS 2023)
Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning
nbsynthetic is simple and robust tabular synthetic data generation library for small and medium size datasets
Papers about training data quality management for ML models.
Add a description, image, and links to the data-centric-ai topic page so that developers can more easily learn about it.
To associate your repository with the data-centric-ai topic, visit your repo's landing page and select "manage topics."