Apr 14, 2021
Data cleaning is naturally framed as probabilistic inference in a generative model, combining a prior distribution over ground-truth databases with a likelihood that models the noisy channel by which the data are filtered and corrupted to yield incomplete, dirty, and denormalized datasets. Based on this view, we present PClean, a probabilistic programming language for leveraging dataset-specific knowledge to clean and normalize dirty data. PClean is powered by (1) a non-parametric model of relational database instances, customizable via probabilistic programs, (2) a sequential Monte Carlo inference algorithm that exploits the model's structure, and (3) near-optimal SMC proposals and blocked Gibbs rejuvenation moves constructed on a per-dataset basis. We show empirically that short (<50-line) PClean programs can be faster and more accurate than generic PPL inference on multiple data-cleaning benchmarks; perform comparably in terms of accuracy and runtime to state-of-the-art data-cleaning systems (unlike generic PPL inference given the same runtime); and scale to real-world datasets with millions of records.
The 24th International Conference on Artificial Intelligence and Statistics was held virtually from Tuesday, 13 April 2021 to Thursday, 15 April 2021.
Total of 1 viewers voted for saving the presentation to eternal vault which is 0.1%
Presentations on similar topic, category or speaker