public class KllFloatsSketch extends Object
This is a stochastic streaming sketch that enables nearreal time analysis of the approximate distribution of values from a very large stream in a single pass, requiring only that the values are comparable. The analysis is obtained using getQuantile() or getQuantiles() functions or the inverse functions getRank(), getPMF() (Probability Mass Function), and getCDF() (Cumulative Distribution Function).
Given an input stream of N numeric values, the absolute rank of any specific value is defined as its index (0 to N1) in the hypothetical sorted stream of all N input values.
The normalized rank (rank) of any specific value is defined as its absolute rank divided by N. Thus, the normalized rank is a value between zero and one. In the documentation and Javadocs for this sketch absolute rank is never used so any reference to just rank should be interpreted to mean normalized rank.
This sketch is configured with a parameter k, which affects the size of the sketch and its estimation error.
The estimation error is commonly called epsilon (or eps) and is a fraction between zero and one. Larger values of k result in smaller values of epsilon. Epsilon is always with respect to the rank and cannot be applied to the corresponding values.
The relationship between the normalized rank and the corresponding values can be viewed as a two dimensional monotonic plot with the normalized rank on one axis and the corresponding values on the other axis. If the yaxis is specified as the valueaxis and the xaxis as the normalized rank, then y = getQuantile(x) is a monotonically increasing function.
The functions getQuantile(rank) and getQuantiles(...) translate ranks into corresponding values. The functions getRank(value), getCDF(...) (Cumulative Distribution Function), and getPMF(...) (Probability Mass Function) perform the opposite operation and translate values into ranks.
The getPMF(...) function has about 13 to 47% worse rank error (depending on k) than the other queries because the mass of each "bin" of the PMF has "doublesided" error from the upper and lower edges of the bin as a result of a subtraction, as the errors from the two edges can sometimes add.
The default k of 200 yields a "singlesided" epsilon of about 1.33% and a "doublesided" (PMF) epsilon of about 1.65%.
A getQuantile(rank) query has the following guarantees:
A getRank(value) query has the following guarantees:
A getPMF() query has the following guarantees:
A getCDF(...) query has the following guarantees;
From the above, it might seem like we could make some estimates to bound the value returned from a call to getQuantile(). The sketch, however, does not let us derive error bounds or confidences around values. Because errors are independent, we can approximately bracket a value as shown below, but there are no error estimates available. Additionally, the interval may be quite large for certain distributions.
Modifier and Type  Field and Description 

static int 
DEFAULT_K
The default value of K.

Constructor and Description 

KllFloatsSketch()
Constructor with the default k (rank error of about 1.65%)

KllFloatsSketch(int k)
Constructor with a given parameter k.

Modifier and Type  Method and Description 

double[] 
getCDF(float[] splitPoints)
Returns an approximation to the Cumulative Distribution Function (CDF), which is the
cumulative analog of the PMF, of the input stream given a set of splitPoint (values).

int 
getK()
Returns the parameter k

static int 
getKFromEpsilon(double epsilon,
boolean pmf)
Gets the approximate value of k to use given epsilon, the normalized rank error.

static int 
getMaxSerializedSizeBytes(int k,
long n)
Returns upper bound on the serialized size of a sketch given a parameter k and stream
length.

float 
getMaxValue()
Returns the max value of the stream.

float 
getMinValue()
Returns the min value of the stream.

long 
getN()
Returns the length of the input stream.

double 
getNormalizedRankError()
Deprecated.
replaced by
getNormalizedRankError(boolean) 
double 
getNormalizedRankError(boolean pmf)
Gets the approximate rank error of this sketch normalized as a fraction between zero and one.

static double 
getNormalizedRankError(int k)
Deprecated.
replaced by
getNormalizedRankError(int, boolean) 
static double 
getNormalizedRankError(int k,
boolean pmf)
Gets the normalized rank error given k and pmf.

int 
getNumRetained()
Returns the number of retained items (samples) in the sketch.

double[] 
getPMF(float[] splitPoints)
Returns an approximation to the Probability Mass Function (PMF) of the input stream
given a set of splitPoints (values).

float 
getQuantile(double fraction)
Returns an approximation to the value of the data item
that would be preceded by the given fraction of a hypothetical sorted
version of the input stream so far.

float 
getQuantileLowerBound(double fraction)
Gets the lower bound of the value interval in which the true quantile of the given rank
exists with a confidence of at least 99%.

float[] 
getQuantiles(double[] fractions)
This is a more efficient multiplequery version of getQuantile().

float[] 
getQuantiles(int numEvenlySpaced)
This is also a more efficient multiplequery version of getQuantile() and allows the caller to
specify the number of evenly spaced fractional ranks.

float 
getQuantileUpperBound(double fraction)
Gets the upper bound of the value interval in which the true quantile of the given rank
exists with a confidence of at least 99%.

double 
getRank(float value)
Returns an approximation to the normalized (fractional) rank of the given value from 0 to 1,
inclusive.

int 
getSerializedSizeBytes()
Returns the number of bytes this sketch would require to store.

static KllFloatsSketch 
heapify(org.apache.datasketches.memory.Memory mem)
Heapify takes the sketch image in Memory and instantiates an onheap sketch.

boolean 
isEmpty()
Returns true if this sketch is empty.

boolean 
isEstimationMode()
Returns true if this sketch is in estimation mode.

KllFloatsSketchIterator 
iterator() 
void 
merge(KllFloatsSketch other)
Merges another sketch into this one.

byte[] 
toByteArray()
Returns serialized sketch in a byte array form.

String 
toString() 
String 
toString(boolean withLevels,
boolean withData)
Returns a summary of the sketch as a string.

void 
update(float value)
Updates this sketch with the given data item.

public static final int DEFAULT_K
public KllFloatsSketch()
public KllFloatsSketch(int k)
k
 parameter that controls size of the sketch and accuracy of estimatespublic int getK()
public long getN()
public boolean isEmpty()
public int getNumRetained()
public boolean isEstimationMode()
public void update(float value)
value
 an item from a stream of items. NaNs are ignored.public void merge(KllFloatsSketch other)
other
 sketch to merge into this onepublic float getMinValue()
public float getMaxValue()
public float getQuantile(double fraction)
We note that this method has a fairly large overhead (microseconds instead of nanoseconds) so it should not be called multiple times to get different quantiles from the same sketch. Instead use getQuantiles(), which pays the overhead only once.
If the sketch is empty this returns NaN.
fraction
 the specified fractional position in the hypothetical sorted stream.
These are also called normalized ranks or fractional ranks.
If fraction = 0.0, the true minimum value of the stream is returned.
If fraction = 1.0, the true maximum value of the stream is returned.public float getQuantileUpperBound(double fraction)
fraction
 the given normalized rank as a fractionpublic float getQuantileLowerBound(double fraction)
fraction
 the given normalized rank as a fractionpublic float[] getQuantiles(double[] fractions)
This returns an array that could have been generated by using getQuantile() with many different fractional ranks, but would be very inefficient. This method incurs the internal setup overhead once and obtains multiple quantile values in a single query. It is strongly recommend that this method be used instead of multiple calls to getQuantile().
If the sketch is empty this returns null.
fractions
 given array of fractional positions in the hypothetical sorted stream.
These are also called normalized ranks or fractional ranks.
These fractions must be in the interval [0.0, 1.0], inclusive.public float[] getQuantiles(int numEvenlySpaced)
If the sketch is empty this returns null.
numEvenlySpaced
 an integer that specifies the number of evenly spaced fractional ranks.
This must be a positive integer greater than 0. A value of 1 will return the min value.
A value of 2 will return the min and the max value. A value of 3 will return the min,
the median and the max value, etc.public double getRank(float value)
The resulting approximation has a probabilistic guarantee that be obtained from the getNormalizedRankError(false) function.
If the sketch is empty this returns NaN.
value
 to be rankedpublic double[] getPMF(float[] splitPoints)
The resulting approximations have a probabilistic guarantee that be obtained from the getNormalizedRankError(true) function.
If the sketch is empty this returns null.
splitPoints
 an array of m unique, monotonically increasing float values
that divide the real number line into m+1 consecutive disjoint intervals.
The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and
exclusive of the right splitPoint, with the exception that the last interval will include
the maximum value.
It is not necessary to include either the min or max values in these splitpoints.public double[] getCDF(float[] splitPoints)
The resulting approximations have a probabilistic guarantee that be obtained from the getNormalizedRankError(false) function.
If the sketch is empty this returns null.
splitPoints
 an array of m unique, monotonically increasing float values
that divide the real number line into m+1 consecutive disjoint intervals.
The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and
exclusive of the right splitPoint, with the exception that the last interval will include
the maximum value.
It is not necessary to include either the min or max values in these splitpoints.@Deprecated public double getNormalizedRankError()
getNormalizedRankError(boolean)
KllFloatsSketch
public double getNormalizedRankError(boolean pmf)
pmf
 if true, returns the "doublesided" normalized rank error for the getPMF() function.
Otherwise, it is the "singlesided" normalized rank error for all the other queries.KllFloatsSketch
@Deprecated public static double getNormalizedRankError(int k)
getNormalizedRankError(int, boolean)
getNormalizedRankError()
that
specifies k.k
 the configuration parameterKllFloatsSketch
public static double getNormalizedRankError(int k, boolean pmf)
getNormalizedRankError(boolean)
.k
 the configuation parameterpmf
 if true, returns the "doublesided" normalized rank error for the getPMF() function.
Otherwise, it is the "singlesided" normalized rank error for all the other queries.KllFloatsSketch
public static int getKFromEpsilon(double epsilon, boolean pmf)
epsilon
 the normalized rank error between zero and one.pmf
 if true, this function returns the value of k assuming the input epsilon
is the desired "doublesided" epsilon for the getPMF() function. Otherwise, this function
returns the value of k assuming the input epsilon is the desired "singlesided"
epsilon for all the other queries.KllFloatsSketch
public int getSerializedSizeBytes()
public static int getMaxSerializedSizeBytes(int k, long n)
k
 parameter that controls size of the sketch and accuracy of estimatesn
 stream lengthpublic String toString(boolean withLevels, boolean withData)
withLevels
 if true include information about levelswithData
 if true include sketch datapublic byte[] toByteArray()
public static KllFloatsSketch heapify(org.apache.datasketches.memory.Memory mem)
mem
 a Memory image of a sketch.
See Memorypublic KllFloatsSketchIterator iterator()
Copyright © 2015–2020 The Apache Software Foundation. All rights reserved.