Distinct Count
This is a Special topic for EECS 376
Distinct Count
Background
Say, ${a_1, \cdots a_M}$ are a set of IPs. They are sent to a router, and now the router wants to figure out how many distinct IPs. Let $n$ be the number of distinct IPs. If using a hashmap, it may take $O(n)$ space to implement.
We can allow $n$ within a range:
Call it the “$(\epsilon, \delta)$” guarantee.
We are aiming for a space of $O(\epsilon^{-2} \log{\frac{1}{\delta}})$.
Aim
- Find an unbiased estimator (Random Hash Function).
- Reduce the variance (Apply Chebychev).
- Reduce the error probability (Apply Chernoff-Hoeffding).
In general, a Random Hash Function is like: $h(IP address) \rightarrow [0, 1]$. $h(a)$ is uniformly in the interval of [0, 1] and is independent.
Min Sketch Method
We are using the “Min sketch”: $Y = min(h(a_1), \cdots, h(a_M))$.
The reason is that $Y$ is good as it filters out duplicates of $a_i$. The distribution of $Y$ is only affected by the number of distinct elements in ${a_1, \cdots a_M}$.
Intuitive Guess
This process is equivalent to getting the smallest among $n$ random values between $[0, 1]$.Intuitively, this result is $\frac{1}{n+1}$, because we can take $n = 1$ and the result is $\frac{1}{2}$.
Mathematical Proof
Therefore the expectation of $Y$,
The estimation of $\hat{n}$: $\hat{n} = \frac{1}{Y} - 1$.
But this expected value of $\hat{n}$ is infinite as $E\left[\frac{1}{Y}\right] \rightarrow \infty$. To prove this, we make a partition between $[0, 1]$
| $\cdots$ | $\frac{1}{8}$ | $\frac{1}{4}$ | $\frac{1}{2}$ |
In reality,
Multiple Hash Functions
We use $k$ hash functions and take its mean to approximate this $Y$, i.e., $Z = \frac{Y_1 + Y_2 + \cdots Y_k}{k}$.
The expectation of $Z$ is equal to that of $Y$, i.e, $E(Z) = \frac{1}{n+1}$, and its variance is reduced.
Variance of $Z$
Remember that,
Therefore the variance of $Z$,
Then we calculate the variance of $Y_i$, denote as $Y$.
Let $u = 1 - \sqrt{x}$, then $x = (1-u)^2$, $dx = -2(1-u) du$.
Therefore we get the variance of $Z$,
Chebychev Inequality
Using Chebychev,
Picking $\frac{1}{k\epsilon^2} \leq \frac{1}{3}$ (Randomly, just a small value), then $k = \frac{3}{\epsilon^2}$.
$Pr(\frac{1}{Z} - 1\textit{ is a good estimate}) \geq \frac{2}{3}$.
We can pick $k$ larger, but it is not practical, because we are paying for the accuracy, $k = \frac{\delta}{\epsilon^2}$.
Improvements
We take multiple $Z$ and get their median:
median is more robust than simply $Z$
As long as good estimates is more than half, we are happy with that result.
Define $I_i = 1$ when $\frac{1}{Z} - 1$ is a good estimator.
$E[I_i] \geq \frac{2}{3}$, $X = I_1 + \cdots + I_l$, then $E(X) \geq \frac{2l}{3}$.
$Pr(X \leq \frac{l}{2}) = Pr(\frac{X}{l} \leq \frac{2}{3} - \frac{1}{6}) \leq e^{-2 \frac{1}{6^2} l}$
This is a logarithmic approximation.
The Space (in bits): $3 \epsilon^{-2} \log{\ln{\frac{1}{3}}} \cdot 64$bits.
Distinct Count