Documentation

API Snapshots: Core, Memory, Pig, Hive, Misc

Table of Contents

Sketch Origins

Sketching is a relatively recent development in the theoretical field of Stochastic Streaming Algorithms1, which deals with algorithms that can extract information from a stream of data in a single pass (sometimes called “one-touch” processing) using various randomization techniques.

Sketching is a synergistic blend of theoretical mathematics, statistics and computer science, refers to a broad range of algorithms, and has experienced a great deal of interest and growth since the mid 1990’s coinciding with the growth of the Internet and the need to process and analyze Big Data. The term sketch, with its allusion to an artist’s sketch, has become the popular term to describe these algorithms and associated data structures that implement the theory.

SketchOrigins

Philippe Flajolet is often refered to as the father of sketching with his research in analytic combinatorics and analysis of algorithms. His 1985 paper Probabilistic counting Algorithms for Data Base Applications co-authored with G. Nigel Martin is one of the earliest papers that outlines the sketching concepts. The recent book, Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches by Graham Cormode, et al, is an excellent review of this field.

At this point it is useful to describe the sketch elements of a common sub-class of sketching algorithms used for solving the count-distinct problem.


1Also known as “Approximate Query Processing”, see Sketch Techniques for Approximate Query Processing