API Snapshots: Java Core, Memory, Pig, Hive,

Table of Contents

Theta Sketch Pig UDFs


  • get jars
  • save the following script as theta.pig
  • adjust jar versions and paths if necessary
  • save the below data into a file called “data.txt”
  • copy data to hdfs: “hdfs dfs -copyFromLocal data.txt”
  • run pig script: “pig theta.pig”

theta.pig script: building sketches, merging sketches and getting estimates

register sketches-core-0.5.2.jar;
register sketches-pig-0.5.2.jar;

define dataToSketch org.apache.datasketches.pig.theta.DataToSketch('32');
define unionSketch org.apache.datasketches.pig.theta.Union('32');
define getEstimate org.apache.datasketches.pig.theta.Estimate();

a = load 'data.txt' as (id, category);
b = group a by category;
c = foreach b generate flatten(group) as (category), flatten(dataToSketch(a.id)) as (sketch);
-- Sketches can be stored at this point in binary format to be used later:
-- store c into 'intermediate/$date' using BinStorage();
-- The next two lines print the results in human readable form for the purpose of this example
d = foreach c generate category, getEstimate(sketch);
dump d;

-- This can be a separate query
-- For example, the first part can produce a daily intermediate feed and store it,
-- and this part can load several instances of this daily intermediate feed and merge them
-- c = load 'intermediate/$date1,intermediate/$date2' using BinStorage() as (category, sketch);
e = group c all;
f = foreach e generate flatten(unionSketch(c.sketch)) as (sketch);
g = foreach f generate getEstimate(sketch);
dump g;

data.txt (tab separated)

The example input data has 2 fields: id and category. There are 2 categories ‘a’ and ‘b’ with 50 unique IDs in each. Most of the IDs in these categories overlap, so that there are 60 unique IDs in total.

Results: From ‘dump d’:


From ‘dump g’ (merged across categories):


The expected exact result would be (60.0). The estimate has high relative error because the sketch was configured with only 32 nominal entries.

theta_setops.pig script: set operations on sketches

register sketches-core-0.7.0.jar;
register sketches-pig-0.7.0.jar;

define dataToSketch org.apache.datasketches.pig.theta.DataToSketch('32');
define unionSketch org.apache.datasketches.pig.theta.Union();
define intersect org.apache.datasketches.pig.theta.Intersect();
define anotb org.apache.datasketches.pig.theta.AexcludeB();
define estimate org.apache.datasketches.pig.theta.Estimate();

a = load 'setops_data.txt' as (id1, id2);
b = group a all;
c = foreach b generate
  flatten(dataToSketch(a.id1)) as (sketch1),
  flatten(dataToSketch(a.id2)) as (sketch2);
d = foreach c generate
  sketch1, -- pass sketches through to have all estimates in one place 
  flatten(unionSketch(TOBAG(sketch1, sketch2))) as (a_union_b),
  flatten(intersect(TOBAG(sketch1, sketch2))) as (a_intersect_b),
  flatten(anotb(sketch1, sketch2)) as (a_not_b),
  flatten(anotb(sketch2, sketch1)) as (b_not_a);
e = foreach d generate
dump e;

setops_data.txt (tab separated)