Distinct Count

This is a Special topic for EECS 376

Distinct Count

Background

Say, ${a_1, \cdots a_M}$ are a set of IPs. They are sent to a router, and now the router wants to figure out how many distinct IPs. Let $n$ be the number of distinct IPs. If using a hashmap, it may take $O(n)$ space to implement.

We can allow $n$ within a range:

Call it the “$(\epsilon, \delta)$” guarantee.

We are aiming for a space of $O(\epsilon^{-2} \log{\frac{1}{\delta}})$.

Aim

  • Find an unbiased estimator (Random Hash Function).
  • Reduce the variance (Apply Chebychev).
  • Reduce the error probability (Apply Chernoff-Hoeffding).

In general, a Random Hash Function is like: $h(IP address) \rightarrow [0, 1]$. $h(a)$ is uniformly in the interval of [0, 1] and is independent.

Min Sketch Method

We are using the “Min sketch”: $Y = min(h(a_1), \cdots, h(a_M))$.

The reason is that $Y$ is good as it filters out duplicates of $a_i$. The distribution of $Y$ is only affected by the number of distinct elements in ${a_1, \cdots a_M}$.

Intuitive Guess

This process is equivalent to getting the smallest among $n$ random values between $[0, 1]$.Intuitively, this result is $\frac{1}{n+1}$, because we can take $n = 1$ and the result is $\frac{1}{2}$.

Mathematical Proof

Therefore the expectation of $Y$,

The estimation of $\hat{n}$: $\hat{n} = \frac{1}{Y} - 1$.

But this expected value of $\hat{n}$ is infinite as $E\left[\frac{1}{Y}\right] \rightarrow \infty$. To prove this, we make a partition between $[0, 1]$

| $\cdots$ | $\frac{1}{8}$ | $\frac{1}{4}$ | $\frac{1}{2}$ |

In reality,

Multiple Hash Functions

We use $k$ hash functions and take its mean to approximate this $Y$, i.e., $Z = \frac{Y_1 + Y_2 + \cdots Y_k}{k}$.

The expectation of $Z$ is equal to that of $Y$, i.e, $E(Z) = \frac{1}{n+1}$, and its variance is reduced.

Variance of $Z$

Remember that,

Therefore the variance of $Z$,

Then we calculate the variance of $Y_i$, denote as $Y$.

Let $u = 1 - \sqrt{x}$, then $x = (1-u)^2$, $dx = -2(1-u) du$.

Therefore we get the variance of $Z$,

Chebychev Inequality

Using Chebychev,

Picking $\frac{1}{k\epsilon^2} \leq \frac{1}{3}$ (Randomly, just a small value), then $k = \frac{3}{\epsilon^2}$.

$Pr(\frac{1}{Z} - 1\textit{ is a good estimate}) \geq \frac{2}{3}$.

We can pick $k$ larger, but it is not practical, because we are paying for the accuracy, $k = \frac{\delta}{\epsilon^2}$.

Improvements

We take multiple $Z$ and get their median:

median is more robust than simply $Z$

As long as good estimates is more than half, we are happy with that result.

Define $I_i = 1$ when $\frac{1}{Z} - 1$ is a good estimator.

$E[I_i] \geq \frac{2}{3}$, $X = I_1 + \cdots + I_l$, then $E(X) \geq \frac{2l}{3}$.

$Pr(X \leq \frac{l}{2}) = Pr(\frac{X}{l} \leq \frac{2}{3} - \frac{1}{6}) \leq e^{-2 \frac{1}{6^2} l}$

This is a logarithmic approximation.

The Space (in bits): $3 \epsilon^{-2} \log{\ln{\frac{1}{3}}} \cdot 64$bits.

Author

Beijie Liu

Posted on

2023-12-04

Updated on

2024-01-01

Licensed under