{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OpOHLyhPpaxC"
   },
   "source": [
    "# Algorithms for Data Science"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NvZ7rLY2pg8i"
   },
   "source": [
    "## Filtering Stream Items"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NqIyJ2BgplVA"
   },
   "source": [
    "### 1. Preliminaries \n",
    "\n",
    "The objective of this lab is to implement algorithms for filtering \"good\" items on streams. We will start by the simple implementation using only one hash function, and then it will be required of you to implement the full Bloom filter. We assume a random stream $S$ of $m$ email strings. We assume that the first $g$ emails are the good ones, that we have $n$ bits allocated in the bit array $B$ (for simplicity, implemented as an array here)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "id": "uj1GxrPF_DxQ"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['rwusx@lsecm', 'fiswg@zojph', 'xltmq@czpnq', 'hhxzj@amgek', 'fejal@xrtlf', 'rrmfd@erckq', 'eedjh@aqoia', 'bkrgm@fnxhw', 'qniur@xvxne', 'stxak@btfpi', 'nrvcf@opcgq', 'iiizu@iuoor', 'iqqwj@ylvxf', 'pmwru@wkoci', 'fgoge@wlggi', 'xfthb@tzqko', 'nttpb@nqmmb', 'kxakn@qdmlq', 'pctog@ghxpg', 'ytrnl@oapbt', 'bgpoo@kvatz', 'zabvn@femql', 'qhpzu@vnbbr', 'kbcyf@wzeii', 'igtmi@wzaqb', 'fvpie@htmrh', 'lrkkp@mzfka', 'rzuve@kzhwq', 'qtqhj@hrpgx', 'pbafw@yqwlo', 'fkssd@jxyxj', 'hlfng@izict', 'fjjuf@eevai', 'atmef@tiami', 'twadb@yppgt', 'tyjhj@qjiqi', 'wxfjt@wixoo', 'ozmfj@iolfh', 'rayst@pufbl', 'luhqz@yerzx', 'bcmgz@iwivu', 'aocrs@cwnxf', 'vfxvd@zbvid', 'fevct@pgzdp', 'wkjmt@ysfsd', 'alroq@yzxpe', 'vzbrd@pkyon', 'svqfz@ddvam', 'tkuaa@qxrsb', 'djdfk@qrdzp', 'skhys@bkaow', 'eekpu@pmymp', 'ajlcm@balst', 'yspqc@sbvvq', 'knqgk@gmnir', 'ehaly@qbwqv', 'ufgnn@sbrva', 'zizqq@nrjsf', 'suofs@tlfad', 'zhihg@zynbt', 'cofsp@hgnec', 'kkewx@qarld', 'oqabx@jbiyu', 'waoiu@wxrlm', 'ayriu@ddxtr', 'avoot@cbfhp', 'csrmd@piwnd', 'nogbd@bwjid', 'dzauh@mohkj', 'vrgss@fxprt', 'zvpcc@abhle', 'awvnm@avwyu', 'tjznm@cpjze', 'kjuwe@eohoj', 'vjyir@suvkw', 'micze@ahjyw', 'bvnqq@jqodo', 'pkzsu@bhgfa', 'cdwsl@nryee', 'edjyi@xdxmx', 'viock@ljgyf', 'yzubo@gmqel', 'iahdw@qpnwt', 'xynph@ytvuo', 'ovomk@sycqi', 'fsaay@hewcv', 'sjrom@ukiif', 'ldtur@qbtii', 'nrvut@csovl', 'umqtf@dejwd', 'arkaw@fjndw', 'omwzs@wquxe', 'cczxv@acnnk', 'vpgep@jldff', 'zprfz@erxfx', 'jowjk@rdwlg', 'xkwgp@rarhm', 'mvlog@ztvwb', 'teivl@bkeba', 'oseng@azodw']\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'\\nD is a dictonary with different mails, they will be sent in the streams \\n'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import random\n",
    "from string import ascii_lowercase\n",
    "\n",
    "#parameters\n",
    "m = 100 #number of different mails\n",
    "g = 10  #number of elements in the sample\n",
    "stream_size = 10000\n",
    "n = 512\n",
    "\n",
    "#generate some random strings of size 5 + 1 + 5\n",
    "D = []\n",
    "for _ in range(m):\n",
    "  D.append(''.join(random.choice(ascii_lowercase) for i in range(5))+\\\n",
    "           '@'+''.join(random.choice(ascii_lowercase) for i in range(5)))\n",
    "\n",
    "print(D)\n",
    "\"\"\"\n",
    "D is a dictonary with different mails, they will be sent in the streams \n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GgKzrJj3_oiX"
   },
   "source": [
    "### 2. Creating a Hash Function, Filtering Items Using a Single Hash\n",
    "\n",
    "In the following we create a hash function $h(x)$, which also takes as a parameter a value and $n$, and returns a value in $0\\dots n-1$. We populate the byte array $B$, and then we simulate a stream taking random values from $D$ and checking whether the value is good or not. We measure the true positive, false positive, and false negative rates. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "id": "IbQ0B1a3BpAV",
    "outputId": "0ee899b2-b9d8-4a43-8dce-ecf772cfc42a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'fiswg@zojph', 'qniur@xvxne', 'xltmq@czpnq', 'eedjh@aqoia', 'bkrgm@fnxhw', 'rrmfd@erckq', 'stxak@btfpi', 'rwusx@lsecm', 'fejal@xrtlf', 'hhxzj@amgek'}\n",
      "[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]\n",
      "False positive rate: 0.039412\n"
     ]
    }
   ],
   "source": [
    "n = 128\n",
    "\n",
    "#hash function\n",
    "def h(x,n):\n",
    "  return hash(x)%n #x module 128 is the hash function \n",
    "\n",
    "good_set = set(D[:g]) #just for checking TP and FP rates (we get the 10 first values, those will be the right ones)\n",
    "print(good_set) #results may differ, because when removing elements the set reorders\n",
    "\n",
    "#allocate the array of 0s\n",
    "B = [0] * n #Data stream with bits \n",
    "\n",
    "#fill the byte array (we apply the hash function to all valid mails)\n",
    "for i in range(g): B[h(D[i],n)] = 1\n",
    "    \n",
    "print(B) \n",
    "\n",
    "tp = 0 # good items passing\n",
    "fp = 0 # bad items passing\n",
    "tn = 0 # bad items discarded\n",
    "fn = 0 # good items discarded\n",
    "\n",
    "#simulate a stream\n",
    "for _ in range(stream_size):\n",
    "  #take a random email\n",
    "  s = random.choice(D)\n",
    "  #check its hash value\n",
    "  if B[h(s,n)]==1: #good (we apply hash over the receiving mail, if the result index is also 1 on the B array it is valid)\n",
    "    if s not in good_set:\n",
    "      fp += 1\n",
    "    else:\n",
    "      tp += 1\n",
    "  else: #bad \n",
    "    if s in good_set:\n",
    "      fn += 1\n",
    "    else:\n",
    "      tn += 1\n",
    "\n",
    "print('False positive rate: %f'%(float(fp)/float(tn+fp)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "poLqQQeBanUH"
   },
   "source": [
    "We may want to create a random hash function that can also be pairwise independent when we will need to generate $k$ independent pairwise hashes.\n",
    "The following procedure can be implemented:\n",
    "* choose a large prime number $p$\n",
    "* generate two random numbers $a$ and $b$ in the range $\\{1,\\dots,p\\}$\n",
    "* the hash is then $h_{a,b}(x)=ax+b \\mod p$\n",
    "* we can also restrict it into $\\{0,\\dots,n-1\\}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 55
    },
    "id": "muzsN80wa3gY",
    "outputId": "addce3d3-7228-4c61-9402-c2ebc55f0075"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n"
     ]
    }
   ],
   "source": [
    "p = 1223543677\n",
    "\n",
    "a = random.randrange(p)\n",
    "b = random.randrange(p)\n",
    "\n",
    "#this is our new hashing function\n",
    "def h(x,a,b,p,n):\n",
    "  return ((a*hash(x)+b)%p)%n\n",
    "#remark: here we use hash(x) instead of the values to allow for all hashable python types\n",
    "#   e.g., strings, tuples\n",
    "\n",
    "#reinitialize the array, for testing\n",
    "B = [0] * n\n",
    "\n",
    "for i in range(g): \n",
    "  B[h(D[i],a,b,p,n)] = 1 #this process will be offline, we initialize our valid values matrix\n",
    "\n",
    "print(B)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Dz9rNT2Na35t"
   },
   "source": [
    "### 3. **TASK** - Bloom Filters\n",
    "\n",
    "Your task is to implement the Bloom filters as described in the class lecture. For this, you have to:\n",
    "1. generate $k$ random pairwise independent hash functions (_hint_: use the example shown above)\n",
    "2. initialize $B$, by setting $1$ in each $h_i(x)$, $i\\in\\{1,\\dots,k\\}$, for all items $x$ in the good set\n",
    "3. an item $s$ in the stream is considered good if, for all $i\\in\\{1,\\dots,k\\}$, we have $B[h_i(s)]=1$\n",
    "\n",
    "Measure the true positive and false positive rate for various values of $k$ and compare to the values obtained when setting $k=n/m\\ln 2$ (to the nearest integer value). What do you notice?\n",
    "\n",
    "Rates:\n",
    "\n",
    "$\n",
    "  \\text{false positive rate} \\frac{FP}{FP+TN}\n",
    "$\n",
    "\n",
    "$\n",
    "  \\text{true positive rate} \\frac{TP}{TP+FN}\n",
    "$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "id": "3Va-_6fda-jf"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of hash functions (k): 8\n",
      "Optimal k = (n/g) * ln(2) = (128/10) * ln(2) = 8\n",
      "a, b parameters used in each hash function: [(139292676, 1174084466), (256819263, 550627202), (1181409591, 863280643), (342086226, 889190627), (779173856, 1190258261), (106211458, 256922262), (472181094, 1181866919), (728572705, 1031484711)]\n",
      "==================== RESULTS  ====================\n",
      "False positive rate: 0.000000\n",
      "True positive rate : 1.000000\n"
     ]
    }
   ],
   "source": [
    "# Lets generate all of our k random hash functions \n",
    "p = 1223543677\n",
    "\n",
    "# As mentioned in the presentation, the best number of hash functions is followed by:\n",
    "# k = (n/m) * ln(2)\n",
    "# where n = size of array B, m = number of valid mails\n",
    "import math\n",
    "n_hashes = int((len(B)/g) * math.log(2))\n",
    "\n",
    "# Generate k hash functions with their parameters stored (so then we can apply the same ones)\n",
    "hash_functions = []\n",
    "for _ in range(n_hashes): \n",
    "    a = random.randrange(p)\n",
    "    b = random.randrange(p)\n",
    "    hash_functions.append((a, b))\n",
    "\n",
    "\n",
    "# Reinitialize the array\n",
    "B = [0] * n\n",
    "\n",
    "# Initialize B using all k hash functions for each valid email\n",
    "for i in range(g): \n",
    "    for a, b in hash_functions:\n",
    "        index = h(D[i], a, b, p, n)\n",
    "        B[index] = 1\n",
    "\n",
    "# Reset counters\n",
    "tp = 0 # good items passing\n",
    "fp = 0 # bad items passing\n",
    "tn = 0 # bad items discarded\n",
    "fn = 0 # good items discarded\n",
    "\n",
    "# Simulate stream with Bloom filter\n",
    "for _ in range(stream_size):\n",
    "    s = random.choice(D)\n",
    "    \n",
    "    # Check if ALL hash functions map to 1\n",
    "    passes_filter = True\n",
    "    for a, b in hash_functions:\n",
    "        if B[h(s, a, b, p, n)] == 0: #if just 1 of the hashes returns 0, it is not valid (NO FALSE NEGATIVES)\n",
    "            passes_filter = False\n",
    "            break\n",
    "    \n",
    "    if passes_filter:\n",
    "        if s not in good_set:\n",
    "            fp += 1\n",
    "        else:\n",
    "            tp += 1\n",
    "    else:\n",
    "        if s in good_set:\n",
    "            fn += 1\n",
    "        else:\n",
    "            tn += 1\n",
    "\n",
    "print(f'Number of hash functions (k): {n_hashes}')\n",
    "print(f'Optimal k = (n/g) * ln(2) = ({n}/{g}) * ln(2) = {n_hashes}')\n",
    "print(f\"a, b parameters used in each hash function: {hash_functions}\") # each a, b combination that we will be using \n",
    "print('=' * 20, \"RESULTS \", '=' * 20)\n",
    "print(f'False positive rate: {float(fp)/float(tn+fp):.6f}')\n",
    "print(f'True positive rate : {float(tp)/float(tp+fn):.6f}')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FZHcE7Jve1PR"
   },
   "source": [
    "_You can use this cell to write your discussion of the results_"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "m2_ds_algods_lab2_filtering_solved.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}