Documentation

API Snapshots: Core, Memory, New Memory, Pig, Hive, Misc

Table of Contents

Key Features

Common Sketch Properties

  • Single-pass, “one-touch” algorithms enable efficient processing in either real-time or batch.
  • Mergeable algorithms enable parallel processing, which is critical for large systems.
  • Space Sub-linear algorithm not only start small but grow very slowly or not at all as the size of the input stream grows.
  • Query results are approximate but within well defined error bounds that are user configurable by trading off sketch size with accuracy.
  • Designed for Large-scale computing environments that must handle Big Data.( e.g., Hadoop, Pig, Hive, Druid, Spark), and are heavily used within Yahoo.
  • Maven deployable and registered with The Central Repository.
  • Comprehensive unit tests and testing tools are provided.
  • Extensive documentation with the systems developer in mind.

Built-In, General Purpose Functions

  • General purpose Memory Package for managing data off the Java Heap. This enables systems designers the ability to manage their own large data heaps with dedicated processor threads that would otherwise put undue pressure on the Java heap and its garbage collection.
  • General purpose implementaion of Austin Appleby’s 128-bit MurmurHash3 algorithm, with a number of useful extensions.

Robust, High Quality Implementations.

  • Extensive test code leveraging TestNG.
  • Benchmarking, speed and accuracy characterization and performance testing code included in the sketches-misc repository.
  • High Test Code coverage ( Core, Pig, Hive,) as measured by JaCoCo and published by Coveralls.
  • Comprehensive Javadocs that satisfy Java JDK8 standards.
  • Suitable for production environments.

Opportunities to Extend

  • There is ample opportunity for interested parties to contribute additional algorithms in this exciting area.

Key Algorithms

Count Distinct / Count Unique

Solves Computational Challenges Associated with Unique Identifiers

  • Estimating cardinality of a stream with many duplicates
  • Performing set operations (e.g., Union, Intersection, and Difference) on sets of unique identifiers
  • Estimates of the error bounds of the result can be obtained directly from the result sketch
  • Two families of Count Unique algorithms:

Quantiles

  • Get normal or inverse PDFs or CDFs of the distributions of any numeric value from your raw data in a single pass.
  • Well defined error bounds on the result.

Frequent Items

  • Get the most frequent items from a stream of items.

Tuple Sketch

  • Associative sketches that are useful for performing approximate join operations and extracting other kinds of behavior associated with unique identifiers.