Table of Contents
Common Sketch Properties
- Single-pass, “one-touch” algorithms enable efficient processing in either real-time or batch.
- Mergeable algorithms enable parallel processing, which is critical for large systems.
- Space Sub-linear algorithm not only start small but grow very slowly or not at all as the size of the input stream grows.
- Query results are approximate but within well defined error bounds that are user
configurable by trading off sketch size with accuracy.
- Designed for Large-scale computing environments
that must handle Big Data.( e.g.,
and are heavily used within Yahoo.
- Maven deployable and registered with
The Central Repository.
- Comprehensive unit tests and testing tools are provided.
- Extensive documentation with the systems developer in mind.
Built-In, General Purpose Functions
- General purpose Memory Package for managing data off the Java Heap.
This enables systems designers the ability to manage their own large data heaps with
dedicated processor threads that would otherwise put undue pressure on the Java heap and
its garbage collection.
- General purpose implementaion of Austin Appleby’s 128-bit MurmurHash3 algorithm,
with a number of useful extensions.
Robust, High Quality Implementations.
- Extensive test code leveraging TestNG.
- Benchmarking, speed and accuracy characterization and performance testing code
included in the sketches-misc repository.
- High Test Code coverage (
Hive,) as measured by
JaCoCo and published by
- Comprehensive Javadocs that satisfy
Java JDK8 standards.
- Suitable for production environments.
Opportunities to Extend
- There is ample opportunity for interested parties to contribute additional algorithms in this exciting area.
Count Distinct / Count Unique
Solves Computational Challenges Associated with Unique Identifiers
- Estimating cardinality of a stream with many duplicates
- Performing set operations (e.g., Union, Intersection,
and Difference) on sets of unique identifiers
- Estimates of the error bounds of the result can be obtained directly from the result sketch
- Two families of Count Unique algorithms:
- Get normal or inverse PDFs or CDFs of the distributions of any numeric value from your raw data in a
- Well defined error bounds on the result.
- Get the most frequent items from a stream of items.
- Associative sketches that are useful for performing approximate join operations and
extracting other kinds of behavior associated with unique identifiers.