{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "City5_WJEfye"
   },
   "source": [
    "# Algorithms for Data Science"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ySZT0ubAEjJ3"
   },
   "source": [
    "## Finding Similar Items"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "mGqRZ-W1EoJQ"
   },
   "source": [
    "The objective of this lab is to implement the Min-Hashing and Locality Sensitive Hashing. This lab needs Python and Jupyter, along with the NumPy package.\n",
    "\n",
    "1. We first load the required libraries and the files containing the documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 72
    },
    "colab_type": "code",
    "id": "onDbJw2yEfXa",
    "outputId": "db289d42-dd72-4488-e6f8-55940a404660"
   },
   "outputs": [
    {
     "ename": "ModuleNotFoundError",
     "evalue": "No module named 'numpy'",
     "output_type": "error",
     "traceback": [
      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
      "\u001b[31mModuleNotFoundError\u001b[39m                       Traceback (most recent call last)",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[5]\u001b[39m\u001b[32m, line 2\u001b[39m\n\u001b[32m      1\u001b[39m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01msys\u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m2\u001b[39m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mnumpy\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mas\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mnp\u001b[39;00m\n\u001b[32m      3\u001b[39m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01murllib\u001b[39;00m\u001b[34;01m.\u001b[39;00m\u001b[34;01mrequest\u001b[39;00m\n\u001b[32m      4\u001b[39m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mre\u001b[39;00m\n",
      "\u001b[31mModuleNotFoundError\u001b[39m: No module named 'numpy'"
     ]
    }
   ],
   "source": [
    "import sys\n",
    "import numpy as np\n",
    "import urllib.request\n",
    "import re\n",
    "import string\n",
    "import random\n",
    "\n",
    "file_location = 'https://phparis.net/slides/algo_4_ds/week2/tweets.txt' #you can change this to a local file on your computer\n",
    "\n",
    "#keeping document in memory\n",
    "infile = urllib.request.urlopen(file_location)\n",
    "docs = []\n",
    "for line in infile: \n",
    "  docs.append(str(line.strip()).lower())\n",
    "\n",
    "#print(docs)\n",
    "for i in range(len(docs)): \n",
    "   if i % 100 == 0: \n",
    "      print(docs[i] + \"\\n\")\n",
    "\n",
    "      \n",
    "print(\"Number of documents: %d\"%len(docs))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "dzCsuFbiGmI7"
   },
   "source": [
    "2. We transform the document into $k$-shingles, and we hash them to their integer values. We compute the Jaccard similarity between two documents given as sets of shingle ids."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 52
    },
    "colab_type": "code",
    "id": "EoyEXIR7GzeR",
    "outputId": "b62e1f90-476f-4faf-c9ac-65e043bbd2d9"
   },
   "outputs": [],
   "source": [
    "k = 5 #k for shingles\n",
    "\n",
    "shingle_id = {}\n",
    "id_shingle = []\n",
    "m = []\n",
    "ids = 0\n",
    "\n",
    "total_shingles = 0\n",
    "\n",
    "for d in docs:\n",
    "  #removing whitespace\n",
    "  d_new = ''.join(c for c in d if c.isalnum())\n",
    "  char_shing = [d_new[i:i+k] for i in range(len(d_new)-k+1)]\n",
    "  total_shingles += len(char_shing)\n",
    "  sid = set()\n",
    "  for sh in char_shing:\n",
    "    if sh not in shingle_id:\n",
    "      shingle_id[sh]=ids\n",
    "      id_shingle.append(sh)\n",
    "      ids=ids+1\n",
    "    sid.add(shingle_id[sh])\n",
    "  m.append(sid)\n",
    "\n",
    "print (\"Unique shingles: %d\"%len(id_shingle))\n",
    "print (\"Total shingles: %d\"%total_shingles)\n",
    "  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "2vciVEmjUiIU",
    "outputId": "057a20f3-f743-4633-f1ec-cffd7ecb806d"
   },
   "outputs": [],
   "source": [
    "def jaccard_similarity(doc1, doc2):\n",
    "  if len(doc1)==0 or len(doc2)==0:\n",
    "    return 0.0\n",
    "  else:\n",
    "    inter = doc1.intersection(doc2)\n",
    "    union = doc1.union(doc2)\n",
    "    return float(len(inter))/float(len(union))\n",
    "\n",
    "#example\n",
    "print(f\"This is the jaccard similarity between set from doc1 and doc2: {jaccard_similarity(m[0],m[1])}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "yovaXpHCG0FL"
   },
   "source": [
    "3. We implement the method for min-hashing given a permutation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "fpd3J3HsHW6m",
    "outputId": "74affc79-b1f4-4110-e4e8-4239bc2f8a88"
   },
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'id_shingle' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
      "\u001b[31mNameError\u001b[39m                                 Traceback (most recent call last)",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[6]\u001b[39m\u001b[32m, line 8\u001b[39m\n\u001b[32m      5\u001b[39m             \u001b[38;5;28;01mreturn\u001b[39;00m shingle_id\n\u001b[32m      6\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mfloat\u001b[39m(\u001b[33m'\u001b[39m\u001b[33minf\u001b[39m\u001b[33m'\u001b[39m)  \u001b[38;5;66;03m# if doc is empty\u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m8\u001b[39m perm = \u001b[38;5;28mlist\u001b[39m(\u001b[38;5;28mrange\u001b[39m(\u001b[38;5;28mlen\u001b[39m(\u001b[43mid_shingle\u001b[49m)))\n\u001b[32m      9\u001b[39m random.shuffle(perm)\n\u001b[32m     11\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33m\"\u001b[39m\u001b[33mExample of using minhash for document 0 and a random perm\u001b[39m\u001b[33m\"\u001b[39m + min_hash(m[\u001b[32m0\u001b[39m],perm))\n",
      "\u001b[31mNameError\u001b[39m: name 'id_shingle' is not defined"
     ]
    }
   ],
   "source": [
    "def min_hash(doc, perm):\n",
    "    # perm is a permutation of shingle indices\n",
    "    for shingle_id in perm:\n",
    "        if shingle_id in doc:\n",
    "            return shingle_id\n",
    "    return float('inf')  # if doc is empty\n",
    "\n",
    "perm = list(range(len(id_shingle)))\n",
    "random.shuffle(perm)\n",
    "\n",
    "print(f\"Example of using minhash for document 0 and a random perm {min_hash(m[0],perm)}\")\n",
    "\n",
    "print(f\"Number of documents (sets of the documents) {len(m)}\")\n",
    "#print(len(id_shingle))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "hJD2piyyHXXe"
   },
   "source": [
    "4. Implement the full Min-Hashing signature matrix for a given number $n$ of permutations. Implement the similarity estimation based on Min-Hashing (i.e., the number of permutation on which two documents agree)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "96nNQIfgWLto"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Creating Min-Hash signatures with 100 permutations...\n"
     ]
    },
    {
     "ename": "NameError",
     "evalue": "name 'm' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
      "\u001b[31mNameError\u001b[39m                                 Traceback (most recent call last)",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[4]\u001b[39m\u001b[32m, line 49\u001b[39m\n\u001b[32m     46\u001b[39m n_permutations = \u001b[32m100\u001b[39m  \u001b[38;5;66;03m# Number of hash functions\u001b[39;00m\n\u001b[32m     48\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[33mCreating Min-Hash signatures with \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mn_permutations\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m permutations...\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m---> \u001b[39m\u001b[32m49\u001b[39m signature_matrix, permutations = create_minhash_signature(\u001b[43mm\u001b[49m, n_permutations)\n\u001b[32m     51\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33mSignature matrix shape: \u001b[39m\u001b[38;5;132;01m{\u001b[39;00msignature_matrix.shape\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m\"\u001b[39m)\n\u001b[32m     53\u001b[39m \u001b[38;5;66;03m# Test similarity estimation\u001b[39;00m\n",
      "\u001b[31mNameError\u001b[39m: name 'm' is not defined"
     ]
    }
   ],
   "source": [
    "def create_minhash_signature(docs, n_permutations):\n",
    "    \"\"\"\n",
    "    OBJ: Create Min-Hashing signature matrix\n",
    "    \n",
    "    Args:\n",
    "        documents: list of sets (each set contains shingle IDs for a document)\n",
    "        n_permutations: number of hash functions (permutations) to use\n",
    "    =>\n",
    "        signature_matrix: numpy array of shape (n_permutations, len(documents))\n",
    "    \"\"\"\n",
    "    n_docs = len(docs)\n",
    "    n_shingles = len(id_shingle)\n",
    "    \n",
    "    # Initialize signature matrix\n",
    "    signature_matrix = np.full((n_permutations, n_docs), float('inf'))\n",
    "    \n",
    "    # Generate n permutations\n",
    "    permutations = []\n",
    "    for i in range(n_permutations):\n",
    "        perm = list(range(n_shingles))\n",
    "        random.shuffle(perm)\n",
    "        permutations.append(perm)\n",
    "    \n",
    "    # Compute signatures\n",
    "    for doc_idx, doc in enumerate(docs):\n",
    "        for perm_idx, perm in enumerate(permutations):\n",
    "            signature_matrix[perm_idx, doc_idx] = min_hash(doc, perm)\n",
    "    \n",
    "    return signature_matrix, permutations\n",
    "\n",
    "def minhash_similarity(sig1, sig2):\n",
    "    \"\"\"\n",
    "    OBJ: Estimate Jaccard similarity using Min-Hash signatures\n",
    "    \n",
    "    Args:\n",
    "        sig1, sig2: signature vectors for two documents => double(estimated similarity)\n",
    "    => \n",
    "    Number of similarity between both signatures from perms\n",
    "    \"\"\"\n",
    "    if len(sig1) != len(sig2):\n",
    "        raise ValueError(\"Signature vectors must have same length\")\n",
    "    \n",
    "    matches = np.sum(sig1 == sig2)\n",
    "    return float(matches) / len(sig1)\n",
    "\n",
    "# Example usage\n",
    "n_permutations = 100  # Number of hash functions\n",
    "\n",
    "print(f\"\\nCreating Min-Hash signatures with {n_permutations} permutations...\")\n",
    "signature_matrix, permutations = create_minhash_signature(m, n_permutations)\n",
    "\n",
    "print(f\"Signature matrix shape: {signature_matrix.shape}\")\n",
    "\n",
    "# Test similarity estimation\n",
    "doc1_idx = 0 \n",
    "doc2_idx = 1 \n",
    "actual_jaccard = jaccard_similarity(m[doc1_idx], m[doc2_idx])\n",
    "estimated_jaccard = minhash_similarity(signature_matrix[:, doc1_idx], signature_matrix[:, doc2_idx]) #: means all rows\n",
    "\n",
    "print(f\"\\nSimilarity between documents {doc1_idx} and {doc2_idx}:\")\n",
    "print(f\"Actual Jaccard similarity: {actual_jaccard:.4f}\")\n",
    "print(f\"Min-Hash estimated similarity: {estimated_jaccard:.4f}\")\n",
    "print(f\"Estimation error: {abs(actual_jaccard - estimated_jaccard):.4f}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "wpS0J_z4WN0T"
   },
   "source": [
    "5. __TASK__ Implement Locality-Sensitive Hashing, given $b$ bands of $r$ rows such that $br=n$. Compute the similarity threshold needed using the formula in the lecture $t=(1/b)^{1/r}$. Assume that signatures in the same band are similar only if the are the same (i.e., they agree on all rows within that band). Check for similarity all documents that agree in at least one band, and compare with the true jaccard similarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "h08gagG8YUkh"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "LSH with b=20 bands, r=5 rows per band\n",
      "Similarity threshold t = (1/b)^(1/r) = 0.5493\n"
     ]
    },
    {
     "ename": "NameError",
     "evalue": "name 'signature_matrix' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
      "\u001b[31mNameError\u001b[39m                                 Traceback (most recent call last)",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[7]\u001b[39m\u001b[32m, line 56\u001b[39m\n\u001b[32m     53\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33mSimilarity threshold t = (1/b)^(1/r) = \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mt\u001b[38;5;132;01m:\u001b[39;00m\u001b[33m.4f\u001b[39m\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m\"\u001b[39m)\n\u001b[32m     55\u001b[39m \u001b[38;5;66;03m# Run LSH\u001b[39;00m\n\u001b[32m---> \u001b[39m\u001b[32m56\u001b[39m candidate_pairs = lsh(\u001b[43msignature_matrix\u001b[49m, b, r)\n\u001b[32m     57\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33mLSH found \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mlen\u001b[39m(candidate_pairs)\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m candidate pairs\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m     59\u001b[39m \u001b[38;5;66;03m# Compare LSH candidates with true Jaccard similarity\u001b[39;00m\n",
      "\u001b[31mNameError\u001b[39m: name 'signature_matrix' is not defined"
     ]
    }
   ],
   "source": [
    "def lsh_hash_band(band_signature):\n",
    "    \"\"\"Hash a band signature to a string\"\"\"\n",
    "    return str(tuple(band_signature))\n",
    "\n",
    "def lsh(signature_matrix, b, r):\n",
    "    \"\"\"\n",
    "    Args:\n",
    "        signature_matrix: Min-Hash signatures\n",
    "        b: number of bands\n",
    "        r: number of rows per band (b * r = n_permutations)\n",
    "    =>\n",
    "        candidate_pairs: set of document pairs that hash to same bucket in at least one band\n",
    "    \"\"\"\n",
    "    n_docs = signature_matrix.shape[1]\n",
    "    candidate_pairs = set()\n",
    "    \n",
    "    # Process each band\n",
    "    for band in range(b):\n",
    "        start_row = band * r\n",
    "        end_row = start_row + r\n",
    "        \n",
    "        # Hash table for this band\n",
    "        buckets = {}\n",
    "        \n",
    "        # Hash each document's band signature\n",
    "        for doc in range(n_docs):\n",
    "            band_sig = []\n",
    "            for row in range(start_row, end_row):\n",
    "                band_sig.append(signature_matrix[row, doc])\n",
    "            \n",
    "            hash_val = lsh_hash_band(band_sig)\n",
    "            \n",
    "            if hash_val not in buckets:\n",
    "                buckets[hash_val] = []\n",
    "            buckets[hash_val].append(doc)\n",
    "        \n",
    "        # Find pairs in same bucket\n",
    "        for bucket_docs in buckets.values():\n",
    "            if len(bucket_docs) > 1:\n",
    "                for i in range(len(bucket_docs)):\n",
    "                    for j in range(i + 1, len(bucket_docs)):\n",
    "                        candidate_pairs.add((bucket_docs[i], bucket_docs[j]))\n",
    "    \n",
    "    return candidate_pairs\n",
    "\n",
    "# LSH parameters\n",
    "b = 20  # bands\n",
    "r = 5   # rows per band (so b*r = 100 = n_permutations)\n",
    "\n",
    "# Calculate similarity threshold\n",
    "t = (1.0 / b) ** (1.0 / r)\n",
    "print(f\"LSH with b={b} bands, r={r} rows per band\")\n",
    "print(f\"Similarity threshold t = (1/b)^(1/r) = {t:.4f}\")\n",
    "\n",
    "# Run LSH\n",
    "candidate_pairs = lsh(signature_matrix, b, r)\n",
    "print(f\"LSH found {len(candidate_pairs)} candidate pairs\")\n",
    "\n",
    "# Compare LSH candidates with true Jaccard similarity\n",
    "print(\"LSH Candidate pairs vs True Jaccard similarity:\")\n",
    "print(\"Doc1\\tDoc2\\tMin-Hash\\tTrue Jaccard\\tAbove threshold?\")\n",
    "print(\"-\" * 60)\n",
    "\n",
    "candidate_list = list(candidate_pairs)\n",
    "for i in range(min(10, len(candidate_list))):\n",
    "    doc1, doc2 = candidate_list[i]\n",
    "    minhash_sim = minhash_similarity(signature_matrix[:, doc1], signature_matrix[:, doc2])\n",
    "    #print(f\"Min hash similarity: {minhash_sim}\")\n",
    "    true_jaccard = jaccard_similarity(m[doc1], m[doc2])\n",
    "    #print(f\"True Jaccard Similarity: {true_jaccard} \")\n",
    "    above_threshold = \"Yes\" if true_jaccard >= t else \"No\"\n",
    "    \n",
    "    print(f\"{doc1}\\t{doc2}\\t{minhash_sim:.4f}\\t\\t{true_jaccard:.4f}\\t\\t{above_threshold}\")\n",
    "\n",
    "# Statistics (to see its similarity with the \"unprocessed data\")\n",
    "if len(candidate_pairs) > 0:\n",
    "    true_positives = 0  # candidates with true similarity >= threshold\n",
    "    for doc1, doc2 in candidate_pairs:\n",
    "        true_jaccard = jaccard_similarity(m[doc1], m[doc2])\n",
    "        if true_jaccard >= t:\n",
    "            true_positives += 1\n",
    "    \n",
    "    precision = true_positives / len(candidate_pairs)\n",
    "    print(\"LSH Statistics:\")\n",
    "    print(f\"Candidates above threshold: {true_positives}/{len(candidate_pairs)}\")\n",
    "    print(f\"Precision: {precision:.4f}\")\n",
    "\n",
    "\"\"\"\n",
    "The only \"problem\" i see is that as we are using always different permutations, altough the conclusions could be the same;\n",
    "as the pairs and precisions change the algorithm seems not deterministic (always not giving the same result). \n",
    "This should be solved if instead we used determined hash functions or a randomization seed instead of the \"mathematical theorical\" approach \n",
    "of the permutations that was suggested. \n",
    "TO CHANGE THIS: \n",
    "random.seed(number) could be added \n",
    "Creating a hash_function that could be generated by another method called generate_hash_functions. \n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "m2_ds_algods_lab1_similar.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
