Table of Contents
# Definitions for Quantiles Studies

## Hypothetically Sorted Stream (HSS) of Values

## Rank and Mass

## getQuantile(rank) and getRank(value)

### Accuracy

### The Sketch and Data Independence

The goal of this article is to compare the accuracy performance of the Druid built-in Approximate Histogram to an exact, brute-force computation using actual data extracted from one of our back-end servers.

Consider an input stream of *n* values. Now imagine if this input stream was sorted from the smallest to largest value. This is what we call the *hypothetical sorted stream* (HSS) of input values. We call it *hypothetical* because we never actually perform this sort (it would be very expensive) so we ask you to imagine it.

We define the *natural rank* of a value in this HSS as its zero-based index that ranges from 0 to *n-1*. If we divide the natural rank by *n*, it becomes the *normalized or fractional rank*, and would range from 0 to *(n-1)/n* and be in the interval [0, 1). In our documentation when we refer to *rank* we are referring to the *normalized or fractional rank* and not the *natural rank*.

This *normalized rank* is closely associated with the concept of *mass*. If there are 1 million values in the HSS, then the value associated with the rank 0.5 represents the median value, or the center-point of the HSS *mass*, where half of the values (500,000) are below the median and half are above.

We now have two hypothetical streams of values, a HSS of values and the stream of ranks where each value in the HSS is assigned a unique rank.
The definiton of *quantile* is the *value* associated with a particular *rank*. So the function *getQuantile(r)* returns the value associated rank *r*.
Similarly, the function *getRank(v)* returns the rank associated with the value *v*.

If a particular value *v* is repeated several times in the HSS, there is obviously a range of ranks that are assigned to each of the identical *v* values. This results in asymmetry between *v = getQuantile(r)* and *r = getRank(v)*. So how do we answer the following two queries when there are duplicate values in the HSS??

- What rank is returned from the query
*getRank(v)*? - What value is returned from the query
*getQuantile(r)*?

It turns out that there are 3 different rules for *getRank(v)* that you will find in the literature when *v* has duplicates in the HSS:

*min rank rule*: return the smallest rank associated with the range of duplicate values*v*.*max rank rule*: return the largest rank associated with the range of duplicate values*v*.*mid rank rule*: return the rank closest to the midpoint of the range of duplicate values*v*.

In our analysis and algorithms we use the *min rank rule*.

If the value provided to *getRank(v)* does not exist in the HSS, the rule is to return the rank of the next largest value greater than *v*.
This is the rank that would include *v* as if it existed.

For example: given the following HSS and associated ranks:

```
Hypothetically Sorted Stream (HSS)
Val Natural Rank Normalized Rank (divided by n = 9)
1.00 0 0.00
2.00 1 0.11
3.00 2 0.22
5.00 3 0.33
5.00 4 0.44
5.00 5 0.56
7.00 6 0.67
8.00 7 0.78
9.00 8 0.89
```

The *getRank(v)* will return for each of the following:

```
getRank(v)
Val Rank
0.00 0.00
0.50 0.00
1.00 0.00
1.50 0.11
2.00 0.11
2.50 0.22
3.00 0.22
3.50 0.33
4.00 0.33
4.50 0.33
5.00 0.33
5.50 0.67
6.00 0.67
6.50 0.67
7.00 0.67
7.50 0.78
8.00 0.78
8.50 0.89
9.00 0.89
```

For the *getQuantile(r)* query:

*Strictly less-than quantile rule*: return the value associated with the largest rank that is strictly less-than*r*.

The *getQuantile(r)* will return for each of the following:

```
getQuantile(r)
Rank Val
0.00 1.00
0.10 1.00
0.20 2.00
0.30 3.00
0.40 5.00
0.50 5.00
0.60 5.00
0.70 7.00
0.80 8.00
0.90 9.00
1.00 9.00
```

Accuracy for quantiles sketches is measured with respect to the *rank* only. Not the values.

A specified accuracy of 1% at the median (rank = 0.50) means that the true value (if you could extract it from the HSS) should be
between *getQuantile(0.49)* and *getQuantile(0.51)*. Note that this is an absolute rank error and not a relative rank error.
Measured at the 10th percentile means that the true value (from the HSS) should be
between *getQuantile(0.09)* and *getQuantile(0.11)*.

A *sketch* is an implementation of a *streaming algorithm*. By definition, a sketch has only one chance to examine each item of the stream. It is this property that makes a sketch a *streaming* algorithm and useful for real-time analysis of very large streams that may be impractical to actually store.

We also assume that the sketch knows nothing about the input data stream: its length, the range of the values or how the values are distributed. If the authors of a particular algorithm require the user to know any of the above attributes of the input data stream in order to “tune” the algorithm, the algorithm is not a sketch as it would require multiple passes through the input data in order to produce correct results.

The only thing the user needs to know is how to extract the values from the stream so that they can be fed into the sketch.
It is also reasonable that the user knows the *type* of values in the stream: e.g., are they alphanumeric strings, numeric strings, or numeric primitives. These properties may determine the type of sketch to use as well as how to extract the appropriate quantities to feed into the sketch.