Table of Contents
# Quantiles StreamA Study

## Versions

## The Paper

## The Input Data

## Creating the Exact CDF and Histograms For Comparison

## The Results

### K = 256, Size = 21408 bytes, *a priori* Accuracy = +/- 0.717%

### K = 32, Size = 3232 bytes, *a priori* Accuracy = +/- 5.4%

## The Evaluation Source

The goal of this article is to compare the accuracy performance of the DataSketches Quantiles Sketch to an exact, brute-force computation using actual data extracted from one of our back-end servers.

Please get familiar with the Definitions for quantiles.

- DataSketches sketches-core, Released Version 0.12.0 Maven Central Repository Updated Aug 7, 2018.

The implementation of this Quantiles Sketch was originally inspired by [Mergable Summaries]https://dl.acm.org/citation.cfm?id=2213562), PODS, May, 2012 paper by Agarwal, Cormode, Huang, Phillips, Wei, and Yi.

The data file used for this evaluation, *streamA.txt*, is real data extracted from one of our back-end servers. It represents one hour of web-site time-spent data measured in milliseconds. The data in this file has a smooth and well-behaved value distribution with a wide dynamic range. It is a text file and consists of consecutive strings of numeric values separated by a line-feeds. Its size is about 2GB.

In order to measure the accuracy of the Approximate Histogram, we need an exact, brute-force analysis of *streamA.txt*.

The process for creating these comparison standards can be found here.

The DataSketches Quantiles Sketch provides a very straightforward tradeoff between sketch size and accuracy as the next two examples show.

The CDF plot:

The green dots in the above plot represents the Exact cumulative distribution (CDF) of ranks to quantile values. The red circles represent the values returned from the DS Quantiles Sketch *getQuantiles(double[])* function.

The plot reveals a very tight fit to the exact quantiles over the full range. The thin black and blue lines just to the left and right of the plotted points represent the lower-bound and upper-bound error derived from the sketch’s *getLowerBound()* and *getUpperBound()* functions.

The Histogram plot:

The next plot is the Histogram generated from the values returned by the *getPMF(double[])* function. Each of the returned values is multiplied by *getN()* to produce the respecive mass of each bin. The Green bars represent the Exact Distribution, and the Orange bars represent the results obtained from the DS Quantiles sketch. You can see that they match pretty well.

The CDF plot:

Note the wider interval between the upper and lower bounds curves:

The Histogram plot:

A little noiser but still tracks the exact shape pretty well:

The following are used to create the plots above.

- DSQsketch profiler
- DSQsketch K256 config
- DSQsketch K32 config
- StreamA Data file This is stored using git-lfs.

Run the above profilers as a java application and supply the config file as the single argument. The program will check if the data file has been unzipped, and if not it will unzip it for you.