Oral: PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming

Apr 14, 2021

Sprecher:innen

Über

Data cleaning is naturally framed as probabilistic inference in a generative model, combining a prior distribution over ground-truth databases with a likelihood that models the noisy channel by which the data are filtered and corrupted to yield incomplete, dirty, and denormalized datasets. Based on this view, we present PClean, a probabilistic programming language for leveraging dataset-specific knowledge to clean and normalize dirty data. PClean is powered by (1) a non-parametric model of relational database instances, customizable via probabilistic programs, (2) a sequential Monte Carlo inference algorithm that exploits the model's structure, and (3) near-optimal SMC proposals and blocked Gibbs rejuvenation moves constructed on a per-dataset basis. We show empirically that short (<50-line) PClean programs can be faster and more accurate than generic PPL inference on multiple data-cleaning benchmarks; perform comparably in terms of accuracy and runtime to state-of-the-art data-cleaning systems (unlike generic PPL inference given the same runtime); and scale to real-world datasets with millions of records.

Organisator

Kategorien

Über AISTATS 2021

The 24th International Conference on Artificial Intelligence and Statistics was held virtually from Tuesday, 13 April 2021 to Thursday, 15 April 2021.

Präsentation speichern

Soll diese Präsentation für 1000 Jahre gespeichert werden?

Wie speichern wir Präsentationen?

Ewigspeicher-Fortschrittswert: 1 = 0.1%

Freigeben

Empfohlene Videos

Präsentationen, deren Thema, Kategorie oder Sprecher:in ähnlich sind

Interessiert an Vorträgen wie diesem? AISTATS 2021 folgen