{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OpOHLyhPpaxC"
   },
   "source": [
    "# Algorithms for Data Science"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NvZ7rLY2pg8i"
   },
   "source": [
    "## Counting Distinct Items"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NqIyJ2BgplVA"
   },
   "source": [
    "### 1. Preliminaries \n",
    "\n",
    "The objective of this lab is to implement the Flajolet-Martin approach to count distinct items. First, we generate an universe of $N$ strings of length $12$, and take $d$ items which will constitute our universe of distinct items."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "id": "uj1GxrPF_DxQ"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['aoqikdbuqloh', 'ohvypdjixavl', 'iggotzsuzyta']\n"
     ]
    }
   ],
   "source": [
    "import random\n",
    "from string import ascii_lowercase\n",
    "\n",
    "#parameters\n",
    "N = 256 #universe of N \n",
    "d = 3 #distinct items\n",
    "stream_size = 10000\n",
    "\n",
    "#generate some random strings of size 10\n",
    "U = []\n",
    "for _ in range(N):\n",
    "  U.append(''.join(random.choice(ascii_lowercase) for i in range(12)))\n",
    "\n",
    "D = random.sample(U,k=d)\n",
    "\n",
    "print(D)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GgKzrJj3_oiX"
   },
   "source": [
    "### 2. Flajolet-Martin: Creating a Hash Function, Estimating Distinct Items Using Trailing 0s\n",
    "\n",
    "In the following we create a hash function $h(x)$, which also takes as a parameter a hashable and $N$, and returns a value in $0,\\dots,N-1$. We simulate a stream taking random values from $D$, count the trailing $0$s in its hash value, keep the maximum value $R$, and then output $2^R$ as the estimator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "id": "IbQ0B1a3BpAV",
    "outputId": "b5f8e92e-b215-4b12-9691-aa5310c29145"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Estimation of distinct items: 8\n"
     ]
    }
   ],
   "source": [
    "import math\n",
    "import random\n",
    "from datetime import datetime\n",
    "\n",
    "random.seed(datetime.now().timestamp())\n",
    "\n",
    "def h(x,n):\n",
    "  return hash(x)%n\n",
    "\n",
    "#method for counting trailing 0s\n",
    "def trailing_0(x):\n",
    "  x1 = x\n",
    "  t0 = 0\n",
    "  while x1%2==0 and x1!=0:\n",
    "    t0 += 1\n",
    "    x1 //= 2\n",
    "  return t0\n",
    "\n",
    "#simulating the stream\n",
    "R = 0\n",
    "for _ in range(stream_size):\n",
    "  #take a random string from the distinct pool\n",
    "  s = random.choice(D)\n",
    "  #check its hash value\n",
    "  hv = h(s,2*N) #to allow more space for hash values\n",
    "  r = trailing_0(hv)\n",
    "  if r>R:  #we get the max number of trailing consecutive ceros  \n",
    "    R=r\n",
    "\n",
    "est = int(math.pow(2,R))\n",
    "\n",
    "print('Estimation of distinct items: %d'%est)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Dz9rNT2Na35t"
   },
   "source": [
    "### 3. **TASK** Flajolet-Martin: Using Multiple Hash Functions\n",
    "\n",
    "Implement the refined version of the above estimator, using multiple ($k$) hash functions (use the method of generating several pairs of numbers presented last time in the lab) and compute:\n",
    "1. The average of the $k$ estimators\n",
    "2. The median of the $k$ estimators\n",
    "3. Divide the estimators into groups (vary the group size); take the median in each group and then the average over the groups.\n",
    "\n",
    "Compare the three methods' final outputs. What do you notice?\n",
    "\n",
    "_Note_: you can use the Python 3.4 _statistics_ package (not available in previous versions) to compute medians, averages, and other statistics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "3Va-_6fda-jf"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Using 20 hash functions:\n",
      "Max trailing zeros:    [4, 2, 2, 0, 0, 6, 4, 0, 2, 0, 0, 0, 2, 2, 3, 2, 1, 2, 2, 1]\n",
      "Individual estimators: [16, 4, 4, 1, 1, 64, 16, 1, 4, 1, 1, 1, 4, 4, 8, 4, 2, 4, 4, 2]\n",
      "\n",
      "Method 1 - Average of 20 estimators: 7.30\n",
      "Method 2 - Median of 20 estimators: 4.00\n",
      "\n",
      "Method 3 - Group median then average:\n",
      "  Group size 2 (10 groups): 12.50 (error: 316.7%)\n",
      "  Group size 4 (5 groups): 5.80 (error: 93.3%)\n",
      "  Group size 5 (4 groups): 4.00 (error: 33.3%)\n",
      "  Group size 10 (2 groups): 4.00 (error: 33.3%)\n",
      "\n",
      "============================================================\n",
      "COMPARISON:\n",
      "============================================================\n",
      "True number of distinct items: 3\n",
      "Method 1 (Average):            7.30\n",
      "Method 2 (Median):             4.00\n",
      "Method 3 varies by group size (see above)\n",
      "\n",
      "Relative errors:\n",
      "Method 1: 143.3%\n",
      "Method 2: 33.3%\n",
      "\n",
      "Method 3 relative errors:\n",
      "  Group size 2: 316.7%\n",
      "  Group size 4: 93.3%\n",
      "  Group size 5: 33.3%\n",
      "  Group size 10: 33.3%\n",
      "\n",
      "============================================================\n",
      "ANALYSIS AND CONCLUSIONS:\n",
      "============================================================\n",
      "\n",
      "1. OUTLIER SENSITIVITY:\n",
      "   - Method 1 (Average) is heavily influenced by outlier estimators\n",
      "   - The exponential nature of 2^R means a few hash functions with high \n",
      "     trailing zeros (like R=6) can drastically inflate the average\n",
      "   - In this run, estimators ranged from 1 to 64, \n",
      "     causing Method 1 to overestimate by 143.3%\n",
      "\n",
      "2. ROBUSTNESS OF MEDIAN:\n",
      "   - Method 2 (Median = 4) performed much better (33.3% error)\n",
      "   - The median is resistant to extreme values, focusing on the \"typical\" \n",
      "     estimator rather than being skewed by outliers\n",
      "   - With 20 hash functions, the median reflects the central tendency\n",
      "\n",
      "3. GROUP-BASED APPROACH (METHOD 3):\n",
      "   - Size 2: 12.50 (error 316.7%) | Group medians: [16, 4, 64, 16, 4, 1, 4, 8, 4, 4]\n",
      "     => Too small: pairs containing outliers can't filter them\n",
      "   - Size 4: 5.80 (error 93.3%) | Group medians: [4, 16, 1, 4, 4]\n",
      "     => Outliers likely concentrated in few groups\n",
      "   - Size 5: 4.00 (error 33.3%) | Group medians: [4, 4, 4, 4]\n",
      "     => Good balance of outlier filtering and averaging\n",
      "   - Size 10: 4.00 (error 33.3%) | Group medians: [4, 4]\n",
      "     => Behaves like pure median with less averaging benefit\n",
      "\n",
      "   Why Method 3 varies:\n",
      "   - Small groups: Can't filter outliers (median of [1,64] is still high)\n",
      "   - Medium groups: Work when outliers are spread across groups\n",
      "   - Large groups: Converge to Method 2 behavior\n",
      "\n",
      "4. RECOMMENDATION:\n",
      "   Best performer: Method 2\n",
      "    => Use Method 2 (median) or Method 3 with group size 4-5 for reliable estimates\n",
      "    => Avoid Method 1 due to outlier sensitivity\n",
      "\n",
      "   Results may change due to the random stream information. OVERALL, WE CONFIRM THE CLASSES IDEAS: \n",
      "\n",
      "1. Method 2 (Median) is generally more robust to outliers than Method 1 (Average)\n",
      "   - Mean is sensible to outliers, a single hash function with very high trailing zeros will produce this effect \n",
      "   - As most of cases have high trailing ceros, method 1 doesnt perform well\n",
      "\n",
      "2. Method 3 (Group median then average) provides a good balance:\n",
      "   - Reduces impact of outliers within each group via median\n",
      "   - Works best with moderate group sizes (e.g., 4-5)\n",
      "\n",
      "3. The median-based approaches (Methods 2 & 3) typically have lower variance\n",
      "   across multiple runs, making them more reliable estimators.\n",
      "\n",
      "4. The Flajolet-Martin algorithm tends to overestimate due to the exponential\n",
      "   2^R, so using medians helps mitigate this.\n",
      "\n",
      "5. On Big O notation it will have a O(n) complexity, which is fast enough to analyze real time data\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Generate k pairs of random numbers for hash functions\n",
    "# Using the (a,b) method: h_i(x) = (a_i * hash(x) + b_i) mod m\n",
    "k = 20 \n",
    "\n",
    "hash_params = []\n",
    "for i in range(k):\n",
    "    a = random.randint(1, 2*N)\n",
    "    b = random.randint(0, 2*N)\n",
    "    hash_params.append((a, b))\n",
    "\n",
    "#print(hash_params)\n",
    "\n",
    "def h(x, a, b):\n",
    "    \"\"\"Hash function using parameters a, b\"\"\"\n",
    "    return (a * hash(x) + b) % (2*N)\n",
    "\n",
    "# Simulate the stream with k hash functions\n",
    "R = [0] * k  # max trailing zeros for each hash function\n",
    "\n",
    "for _ in range(stream_size):\n",
    "    s = random.choice(D)\n",
    "    \n",
    "    for i in range(k): #k = number of hashes\n",
    "        a, b = hash_params[i]\n",
    "        hv = h(s, a, b)\n",
    "        r = trailing_0(hv)\n",
    "        if r > R[i]:\n",
    "            R[i] = r\n",
    "\n",
    "# Compute estimators for each hash function\n",
    "estimators = [int(math.pow(2, r)) for r in R]\n",
    "\n",
    "print(f\"Using {k} hash functions:\")\n",
    "print(f\"Max trailing zeros:    {R}\")\n",
    "print(f\"Individual estimators: {estimators}\")\n",
    "print()\n",
    "\n",
    "# Method 1: Average of all estimators\n",
    "avg_estimate = sum(estimators) / len(estimators)\n",
    "print(f\"Method 1 - Average of {k} estimators: {avg_estimate:.2f}\")\n",
    "\n",
    "# Method 2: Median of all estimators\n",
    "median_estimate = sorted(estimators)[len(estimators)//2]\n",
    "print(f\"Method 2 - Median of {k} estimators: {median_estimate:.2f}\")\n",
    "\n",
    "# Method 3: Divide into groups, take median per group, then average\n",
    "print(\"\\nMethod 3 - Group median then average:\")\n",
    "group_sizes = [2, 4, 5, 10] #different group sizes for checking precision on each case\n",
    "method3_errors = {}  # Store errors for later use\n",
    "method3_results = {}  # Store final estimates for later use\n",
    "\n",
    "for group_size in group_sizes:\n",
    "    if k % group_size != 0:\n",
    "        print(f\"  Group size {group_size}: skipped (doesn't divide {k} evenly)\")\n",
    "        continue\n",
    "    \n",
    "    num_groups = k // group_size\n",
    "    group_medians = []\n",
    "    \n",
    "    for g in range(num_groups):\n",
    "        group = estimators[g * group_size : (g + 1) * group_size] #we check group by group e.g (2 => 4)\n",
    "        group_medians.append(sorted(group)[len(group)//2])\n",
    "    \n",
    "    final_estimate = sum(group_medians) / len(group_medians)\n",
    "    error = abs(final_estimate - d) / d * 100\n",
    "    method3_errors[group_size] = error\n",
    "    method3_results[group_size] = final_estimate\n",
    "\n",
    "    print(f\"  Group size {group_size} ({num_groups} groups): {final_estimate:.2f} (error: {error:.1f}%)\")\n",
    "\n",
    "print(\"\\n\" + \"=\"*60)\n",
    "print(\"COMPARISON:\")\n",
    "print(\"=\"*60)\n",
    "print(f\"True number of distinct items: {d}\")\n",
    "print(f\"Method 1 (Average):            {avg_estimate:.2f}\")\n",
    "print(f\"Method 2 (Median):             {median_estimate:.2f}\")\n",
    "print(f\"Method 3 varies by group size (see above)\")\n",
    "print(\"\\nRelative errors:\")\n",
    "print(f\"Method 1: {abs(avg_estimate - d)/d * 100:.1f}%\")\n",
    "print(f\"Method 2: {abs(median_estimate - d)/d * 100:.1f}%\")\n",
    "print(\"\\nMethod 3 relative errors:\")\n",
    "for group_size, error in method3_errors.items():\n",
    "    print(f\"  Group size {group_size}: {error:.1f}%\")\n",
    "\n",
    "print(\"\\n\" + \"=\"*60)\n",
    "print(\"ANALYSIS AND CONCLUSIONS:\")\n",
    "print(\"=\"*60)\n",
    "\n",
    "# Find best performing methods\n",
    "best_method2 = abs(median_estimate - d)\n",
    "best_method3_size = min(method3_errors.keys(), key=lambda x: method3_errors[x])\n",
    "best_method3_error = method3_errors[best_method3_size]\n",
    "\n",
    "print(f\"\"\"\n",
    "1. OUTLIER SENSITIVITY:\n",
    "   - Method 1 (Average) is heavily influenced by outlier estimators\n",
    "   - The exponential nature of 2^R means a few hash functions with high \n",
    "     trailing zeros (like R={max(R)}) can drastically inflate the average\n",
    "   - In this run, estimators ranged from {min(estimators)} to {max(estimators)}, \n",
    "     causing Method 1 to overestimate by {abs(avg_estimate - d)/d * 100:.1f}%\n",
    "\n",
    "2. ROBUSTNESS OF MEDIAN:\n",
    "   - Method 2 (Median = {median_estimate:.0f}) performed much better ({abs(median_estimate - d)/d * 100:.1f}% error)\n",
    "   - The median is resistant to extreme values, focusing on the \"typical\" \n",
    "     estimator rather than being skewed by outliers\n",
    "   - With {k} hash functions, the median reflects the central tendency\n",
    "\n",
    "3. GROUP-BASED APPROACH (METHOD 3):\"\"\")\n",
    "\n",
    "# Brief analysis of each group size\n",
    "for group_size in sorted(method3_results.keys()):\n",
    "    result = method3_results[group_size]\n",
    "    error = method3_errors[group_size]\n",
    "    num_groups = k // group_size\n",
    "    \n",
    "    # Calculate group medians\n",
    "    group_medians = []\n",
    "    for g in range(num_groups):\n",
    "        group = estimators[g * group_size : (g + 1) * group_size]\n",
    "        group_medians.append(sorted(group)[len(group)//2])\n",
    "    \n",
    "    print(f\"   - Size {group_size}: {result:.2f} (error {error:.1f}%) | Group medians: {group_medians}\")\n",
    "    \n",
    "    if group_size == 2:\n",
    "        print(f\"     => Too small: pairs containing outliers can't filter them\")\n",
    "    elif group_size <= 5:\n",
    "        if error < 50:\n",
    "            print(f\"     => Good balance of outlier filtering and averaging\")\n",
    "        else:\n",
    "            print(f\"     => Outliers likely concentrated in few groups\")\n",
    "    else:\n",
    "        print(f\"     => Behaves like pure median with less averaging benefit\")\n",
    "\n",
    "print(f\"\"\"\n",
    "   Why Method 3 varies:\n",
    "   - Small groups: Can't filter outliers (median of [1,64] is still high)\n",
    "   - Medium groups: Work when outliers are spread across groups\n",
    "   - Large groups: Converge to Method 2 behavior\n",
    "\n",
    "4. RECOMMENDATION:\n",
    "   Best performer: {\"Method 2\" if best_method2 <= min(method3_errors.values()) else f\"Method 3 (group size {best_method3_size})\"}\n",
    "    => Use Method 2 (median) or Method 3 with group size 4-5 for reliable estimates\n",
    "    => Avoid Method 1 due to outlier sensitivity\n",
    "\n",
    "   Results may change due to the random stream information. OVERALL, WE CONFIRM THE CLASSES IDEAS: \n",
    "\n",
    "1. Method 2 (Median) is generally more robust to outliers than Method 1 (Average)\n",
    "   - Mean is sensible to outliers, a single hash function with very high trailing zeros will produce this effect \n",
    "   - As most of cases have high trailing ceros, method 1 doesnt perform well\n",
    "   \n",
    "2. Method 3 (Group median then average) provides a good balance:\n",
    "   - Reduces impact of outliers within each group via median\n",
    "   - Works best with moderate group sizes (e.g., 4-5)\n",
    "   \n",
    "3. The median-based approaches (Methods 2 & 3) typically have lower variance\n",
    "   across multiple runs, making them more reliable estimators.\n",
    "   \n",
    "4. The Flajolet-Martin algorithm tends to overestimate due to the exponential\n",
    "   2^R, so using medians helps mitigate this.\n",
    "\n",
    "5. On Big O notation it will have a O(n) complexity, which is fast enough to analyze real time data\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FZHcE7Jve1PR"
   },
   "source": [
    "_You can use this cell to write your discussion of the results_"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "m2_ds_algods_lab3_distinct.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
