Cocoon uses LLMs to to explore your tables, document their structures, and generate SQL for cleaning. The outputs include (1) SQL queries for cleaning, (2) a YAML file for table documentation, and (3) an HTML report for summaries
Cocoon is open-source. Try out Cocoon in Google Colab.
This requires an LLM API (e.g., GPT-4, Claude-3, Gemini-Ultra, or your local LLMs) but offers an interactive experience with no size or column limitations. It also supports databases (e.g., Snowflake, Duckdb...).Need support or have questions? Contact Us
More example results, from Kaggle datasets
Data Cleaning is based on the following research papers:
@inproceedings{huang2024relationalizing,
title={Relationalizing Tables with Large Language Models: The Promise and Challenges},
author={Huang, Zezhou and Wu, Eugene},
booktitle={2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW). IEEE},
year={2024}
}