Table of Contents
## Theta Sketch Set Operations Accuracy

### The Theta Rules

### Union Set Expressions

#### Source sketches and target with the same *Nominal Entries* or *k*

#### Source sketches and target with different *Nominal Entries* or *k*

### Mixed Set Expressions (Union, Intersection, AnotB)

#### Source sketches and target with the same *Nominal Entries* or *k*

RSE = *sqrt( est(Union(A,B)) / est(SetOperation(A,B)) ) * sqrt( 1/ (k-1) )*
##### Example

*F* = (size of Union(A,B) ) / (size of Intersection(A,B).
RSE_{intersection} = *sqrt(F) * (RSE of input Sketches)*
RSE_{intersection} = sqrt(4M/4K) * 1/sqrt(4K) = 31.63 * 0.016 = 0.5 = 50%.
#### Source sketches and target with different *Nominal Entries* or *k*

All set operations (Union, Intersection, Difference) between two sketches *A* and *B*
must obey the following two rules:

*θ*._{result}= min( θ_{A}, θ_{B})- All entries retained in the result sketch must be less than
*θ*._{result}

These rules can be extended to arbitrary set expressions as:

*θ*._{result}= min{ θ_{i}}- All entries retained in the result sketch must be less than
*θ*._{result}

As long as all source sketches and target have been configured with the same *Nominial Entries*
or *k*, and there has been no other intervening *Intersection* or *AnotB* operations,
the accuracy of the resulting *Union* sketch will have the same Relative Standard Error (RSE)
as determined by the *Nominal Entries* or *k* value. In other words,

*Union Estimate = k/θ*._{result}- RSE
._{union}= 1 / sqrt(k - 1)

This remains true no matter how many sketches are unioned together.

The Relative Standard Error (RSE) of the resulting *Union* sketch will determined by
the *Theta Rules* above. Ultimately, the source sketch with the smallest theta will
dominate the overall resulting error of the result. Given two sketches with equal cardinalities and
different values of *k*, the sketch with the smaller value of *k* will have the smallest
value of theta and will largely determine the error distribution of the result.

Conceptually, the Intersection and AnotB functions operate by first performing a Union of all the values from both source sketches and then identifying the appropriate proper subset of that Union set creating a result sketch with the Union theta (which is the minimum theta of the source sketches) and the qualifying subset of values.

This means, of course, that depending on the operations and the data, the result set could have
zero, all, or some number in between of the retained values of the Union sketch.
Mixed set expressions can produce an error distribution that is larger that of a standard sketch
of a given *Nominal Entries* or *k* and is mathematically described in
Sketch Equations / Subsets of Fixed *k* Sampling.

When the source and target sketches have the same value of *k*,
the accuracy of the result sketch has a relatively straightforward mathematical solution and intuition:

The intuition for this can be explained using this simple example of intersection (set difference would behave similarly):

Suppose we have as inputs two segments of unique identifiers, *A* and *B*.

Assumptions:

- We will use a sketch size of
*k = 4096*entries with an RSE ~ 1/sqrt(k) = 1.6%. - Segment
*A*has 4 million (4M) unique identifiers. - Segment
*B*has 4 thousand (4K) unique identifiers and is a proper subset of*A*.

Steps:

- Build the Sketches. Sketch
*S*will be fed segment_{A}*A*and Sketch*S*will be fed segment_{B}*B*.- Segment
*B*fits exactly in*S*, so_{B}*θ*= 4K/4K = 1.0._{B} - Sketch
*S*will end up with_{A}*θ*= 4K/4M = .001._{A}

- Segment

The hash values retained in *S _{A}* represent a uniform sampling of all of the unique identifiers
in segment

Even though in the raw data all the values of segment *B* are in segment *A*, the probability
that all the 4K samples of *S _{B}* appear

Applying the *Theta Rules*:

*θ*_{result}= min(0.001, 1.0) = .001- All the entries from
*S*already qualify._{A} - Since all 4K values in
*S*are uniformly distributed between 0 and 1.0, only .001 of them or approximately 4 of the bottom values will remain._{B} - The resulting intersection sketch will have, on average, only 4 values and a theta of .001.

The mean estimate from the intersection sketch will be 4/.001 = 4K. This happens to be correct using this hand-wavy analysis but in general is a random result with a variance. The proof that the estimate will be unbiased is in the attached Sketch Equations.

The RSE of a sketch with only 4 values is ~ 1/sqrt(4) = .5 or 50% error.
This is considerably larger than the RSE of either *S _{A}* or

The insight to be gained from this example is that it was the theta (sampling rate) of the sketch of the
larger population that caused the increase in error.
And, for this example, increasing the sketch size of *S _{A}* would improve the resulting error.
The general case may be more complex.

More formally, if we define a factor *F* to be the ratio:

Then a simple way to compute the resulting RSE of an intersection (or difference) is

And in our example:

In the general case this scenario is even more complex and difficult to predict mathematically, but a conservative
estimate would be to use the above equations substituting *k* with the smallest of the participating
*k* values.