Table of Contents
## Size and Byte Array Structures

### Update Sketch Families – Hash Table Form

### Compact Sketch Family – Compact Form

#### Compact Sketch Sizing

##### Quick Select Sketch (Default)

##### Alpha Sketch

### Set Operation Family

#### Union

#### Intersection

#### A not B

All the sketches in the *theta* package share a common byte array data structure and similar
strategies for managing use of memory when the sketches are instantiated on the Java heap or
when instantiated in off-heap native memory.

The byte array structure has two forms *Hash Table* and *Compact*.

The Hash Table form is similar to how the sketch is instantiated on the Java heap. Hash tables consume more space depending on how full the table is. However, updating the sketch is much faster in this form and is the default for all the Update Sketches.

There are two Update Sketch families, *QuickSelect*, and *Alpha*.
The QuickSelect family has both a Java heap and a direct, off-heap variants while the
Alpha sketch is designed only for operation on the Java heap.

The Update Sketches consist of an internal hash table cache and a few other primitives.
How the cache is sized and/or resized can be controlled by the user specified
*ResizeFactor enum* that has the values *X1, X2, X4*, and *X8* (default).
The meaning of these values is as follows:

- X1: A Resize Factor of 1 means the cache will not increase its size. A new sketch will
immediately allocate space for twice the given Nominal Entries.
For example, if
*nomEntries is 4096*the internal cache will be allocated at 8192 entries. Each entry is 8 bytes. - X2: A Resize Factor of 2 means the cache may increase the cache size by factors of 2. A new sketch will initially allocate a very small cache (typically enough for 16 entries), and, when required, will increase the cache size by factors of 2 until the maximum cache size is reached, which is twice the Nominal Entries.
- X4: A Resize Factor of 4 is similar to X2, except the resizing factor is 4.
- X8: A Resize Factor of 8 is similar to X2, except the resizing factor is 8. This is the default.

Resizing a cache is costly as a new array needs to be allocated and the hash table rebuilt. Depending on the application, the user can select the Resize Factor to trade-off memory size and overall update speed performance.

When an Update Sketch is converted to a byte array using *toByteArray()*,
the internal structure is a 24-byte preamble followed by a non-contiguous
hash table array of 8-byte, long data entries.
This enables the sketch to be quickly reconstructed from the byte array so that updating
of the sketch can be continued.

Once the updating of a sketch is completed the HT is no longer needed, so the sketch can be
stored in a compact form.
The size of this compact form is a simple function of the number of retained hash values
(8 bytes) and a small preamble that varies from 8 to 24 bytes depending on the
internal state of the sketch. An empty sketch is represented by only 8 bytes.
The upper limit of the size of the sketch varies by the type of sketch but is
in the range of *8 k to 16k*.

Compact Sketches optimize memory usage and sketch merge performance, are inherently *read only*
and can only be created from an existing Update Sketch or Set Operation
(*Union, Intersection, or AnotB*) object.
The internal structure of Compact Sketches is an 8, 16, or 24 byte preamble followed
by a contiguous data array of 8-byte, long data entries.
This data array can be either *ordered* or *unordered*.
This data structure is the same whether the Compact Sketch is instantiated on the
Java heap or in off-heap direct memory.
An empty Compact Sketch consumes only 8 bytes.

The *ordered* form is ideal for systems environments where the building of the sketches
from data occur in an offline system (like a Hadoop grid), then compacted, ordered and
uploaded to a real-time query engine (like Druid) where the compact sketches can be quickly
merged to satisfy end-user queries.
The ordered form enables *early stop* enhancements to the merge algorithm
that makes the merging extreemly fast (~ 10’s of millions of sketches per second).

The *unordered* form is more desirable for systems environments where the
building of the sketches from data occur in real-time and queried in real-time.
In this environment there is no need to pay the cost of the sort.

Thus, the choice of *ordered* or *unordered* is a tradeoff between
real-time sketch build & getEstimate() performance and offline sketch-build
and real-time merge performance.

The Compact Sketch is created four different ways selected by the ordering preference
(*dstOrdered*) and whether the Compact Sketch will reside on-heap or off-heap (*dstMemory*).

- Unordered, On-heap
- Update Sketch:
*compact(false, null)* - Set Operation:
*getResult(false, null)*

- Update Sketch:
- Ordered, On-heap
- Update Sketch:
*compact(true, null)* - Set Operation:
*getResult(true, null)*

- Update Sketch:
- Unordered, Off-heap
- Update Sketch:
*compact(false, Memory mem)* - Set Operation:
*getResult(false, Memory mem)*

- Update Sketch:
- Ordered, Off-heap
- Update Sketch:
*compact(true, Memory mem)* - Set Operation:
*getResult(true, Memory mem)*

- Update Sketch:

These tables compute the size of a sketch after it has been converted into Compact Form.
The size of a sketch during the build phase is explained above as the sketch starts small and
resizes by the configurable *Resize Factor* up to the in-memory size of *2k*8* bytes plus
a few primitives.

Note: a sketch entry = 8 bytes.

The number of valid entries in the Quick Select Sketch after it enters estimation mode
statistically varies from *k* to *15k/8* with an average of about *3k/2*.
It is a user option to force a rebuild() prior to compacting the sketch in which case the
number of valid entries is never larger than *k*.

Empty | After Rebuild() | Estimating Avg | Estimating Max | |
---|---|---|---|---|

Nominal Entries (k) : Formula -> | 8 | k*8 +24 | k*12 + 24 | k*15 + 24 |

16 | 8 | 152 | 216 | 264 |

32 | 8 | 280 | 408 | 504 |

64 | 8 | 536 | 792 | 984 |

128 | 8 | 1,048 | 1,560 | 1,944 |

256 | 8 | 2,072 | 3,096 | 3,864 |

512 | 8 | 4,120 | 6,168 | 7,704 |

1,024 | 8 | 8,216 | 12,312 | 15,384 |

2,048 | 8 | 16,408 | 24,600 | 30,744 |

4,096 | 8 | 32,792 | 49,176 | 61,464 |

8,192 | 8 | 65,560 | 98,328 | 122,904 |

16,384 | 8 | 131,096 | 196,632 | 245,784 |

32,768 | 8 | 262,168 | 393,240 | 491,544 |

65,536 | 8 | 524,312 | 786,456 | 983,064 |

131,072 | 8 | 1,048,600 | 1,572,888 | 1,966,104 |

262,144 | 8 | 2,097,176 | 3,145,752 | 3,932,184 |

524,288 | 8 | 4,194,328 | 6,291,480 | 7,864,344 |

1,048,576 | 8 | 8,388,632 | 12,582,936 | 15,728,664 |

The number of valid entries in the Alpha Sketch after it enters estimation mode
is a random variable that has a probability distribution with a mean of *k*
and a standard deviation of *sqrt(k)*.
The last column computes the maximum size with a confidence of 99.997% representing
plus 4 standard deviations.

Empty | Estimating Avg | Std Dev | Max @ 99.997% | |
---|---|---|---|---|

Nominal Entries (k) : Formula -> | 8 | k*8 + 24 | sqrt(k) | (k+4SD)*8 +24 |

512 | 8 | 4,120 | 23 | 4,844 |

1,024 | 9 | 8,216 | 32 | 9,240 |

2,048 | 10 | 16,408 | 45 | 17,856 |

4,096 | 11 | 32,792 | 64 | 34,840 |

8,192 | 12 | 65,560 | 91 | 68,456 |

16,384 | 13 | 131,096 | 128 | 135,192 |

32,768 | 14 | 262,168 | 181 | 267,961 |

65,536 | 15 | 524,312 | 256 | 532,504 |

131,072 | 16 | 1,048,600 | 362 | 1,060,185 |

262,144 | 17 | 2,097,176 | 512 | 2,113,560 |

524,288 | 18 | 4,194,328 | 724 | 4,217,498 |

1,048,576 | 19 | 8,388,632 | 1,024 | 8,421,400 |

The *Union* operation has both a Java heap and a direct, off-heap variant.
When a Union operation is converted to a byte array using *toByteArray()*,
the internal structure is a 32-byte preamble followed by a non-contiguous hash
table array of 8-byte, long data entries.
This enables the Union to be quickly reconstructed from the byte array
so that updating of the Union can be continued.

The *Intersection* operation has both a Java heap and a direct, off-heap variant.
When an Intersection operation is converted to a byte array using *toByteArray()*,
the internal structure is a 24-byte preamble followed by a non-contiguous hash
table array of 8-byte, long data entries.
This enables the Intersection to be quickly reconstructed from the byte array
so that updating of the Intersection can be continued.

The *A not B* operation is asymmetric and stateless.
Both the *A* and *B* arguments are presented and the difference
is computed and returned.
There is no need for a byte array form.