Домашни > Време е да помислите за проектите си! > Решения > Решението на Тодор Недковски

Резултати
0 точки от тестове
0 точки от учител

0 точки общо

0 успешни теста
0 неуспешни теста
Код

   1{
   2 "cells": [
   3  {
   4   "cell_type": "markdown",
   5   "metadata": {},
   6   "source": [
   7    "# Huffman Compression Algorithm"
   8   ]
   9  },
  10  {
  11   "cell_type": "markdown",
  12   "metadata": {},
  13   "source": [
  14    "### By Todor Nedkovski"
  15   ]
  16  },
  17  {
  18   "cell_type": "markdown",
  19   "metadata": {},
  20   "source": [
  21    "### Table of contents"
  22   ]
  23  },
  24  {
  25   "cell_type": "markdown",
  26   "metadata": {},
  27   "source": [
  28    "This notebook includes a brief explanation about Huffman Compression Algorithm works.\n",
  29    "\n",
  30    "We are going to talk about a couple of questions:\n",
  31    "\n",
  32    "1. What is the difference betwenn lossless and lossy compression?\n",
  33    "2. How are Huffman trees constructed?\n",
  34    "3. How can we get back the uncompressed data from the Huffman tree?\n",
  35    "4. Comparison with other lossless compression algorithms.\n",
  36    "\n",
  37    "In the end we are going to overview what we have just learned and think of other problems we might encounter while compressing files. "
  38   ]
  39  },
  40  {
  41   "cell_type": "markdown",
  42   "metadata": {},
  43   "source": [
  44    "## Abstract"
  45   ]
  46  },
  47  {
  48   "cell_type": "markdown",
  49   "metadata": {},
  50   "source": [
  51    "In the early days computers (around 1980s) did not have much more then a 10 MB storage capacity. It's really mind blowing that you have ten thousand times as much now in your phone then you did back then. The problem is that the files we want to store seem to keep pace with that growth. How can we store more data with less space? The answear is compression! There are a lot of compression algorithms but we are going to talk about a specific one."
  52   ]
  53  },
  54  {
  55   "cell_type": "markdown",
  56   "metadata": {},
  57   "source": [
  58    "## What is the difference bewtween Fixed and Variable length code"
  59   ]
  60  },
  61  {
  62   "cell_type": "markdown",
  63   "metadata": {},
  64   "source": [
  65    "### What is a Fixed Length code "
  66   ]
  67  },
  68  {
  69   "cell_type": "markdown",
  70   "metadata": {},
  71   "source": [
  72    "In a fixed-length code each codeword has the same number of bits"
  73   ]
  74  },
  75  {
  76   "cell_type": "markdown",
  77   "metadata": {},
  78   "source": [
  79    "### What is Variable-Length code"
  80   ]
  81  },
  82  {
  83   "cell_type": "markdown",
  84   "metadata": {},
  85   "source": [
  86    "In coding theory a variable-length code is a code which maps source symbols to a variable number of bits. There are 3 sub-types of the variable-length code."
  87   ]
  88  },
  89  {
  90   "cell_type": "markdown",
  91   "metadata": {},
  92   "source": [
  93    "### Non-singular"
  94   ]
  95  },
  96  {
  97   "cell_type": "markdown",
  98   "metadata": {},
  99   "source": [
 100    "A code is non-singular if each character is mapped to a different bit string. In other words the mapping from characters to bit strings is injective (one to one).\n",
 101    "\n",
 102    "1. For example, the mapping $M_{1}=\\{\\,a\\mapsto 0,b\\mapsto 0,c\\mapsto 1\\,\\}$ is not non-singular because both \"a\" and \"b\" map to the same bit string \"0\". Any extension of this mapping will generate a lossy coding. It may be useful when some loss of information is acceptable.\n",
 103    "2. However, the mapping $M_{2}=\\{\\,a\\mapsto 1,b\\mapsto 011,c\\mapsto 01110,d\\mapsto 1110,e\\mapsto 10011,f\\mapsto 0\\}$ is non-singular."
 104   ]
 105  },
 106  {
 107   "cell_type": "markdown",
 108   "metadata": {},
 109   "source": [
 110    "### Uniquely decodable"
 111   ]
 112  },
 113  {
 114   "cell_type": "markdown",
 115   "metadata": {},
 116   "source": [
 117    "A code is **uniquely decodable** if its extension is non-singular.\n",
 118    "\n",
 119    "1. The mapping $M_{3}=\\{\\,a\\mapsto 0,b\\mapsto 01,c\\mapsto 011\\,\\}$ is uniquely decodable. \n",
 120    "2. Consider again the code $M_{2}$ from the previous section. This code is not uniquely decodable, since the string 011101110011 can be interpreted as the sequence of codewords 01110 – 1110 – 011, but also as the sequence of codewords 011 – 1 – 011 – 10011. The two decodings of this encoded string are cdb and babe."
 121   ]
 122  },
 123  {
 124   "cell_type": "markdown",
 125   "metadata": {},
 126   "source": [
 127    "### Prefix"
 128   ]
 129  },
 130  {
 131   "cell_type": "markdown",
 132   "metadata": {},
 133   "source": [
 134    "For something to be a prefix code, the entire set of possible encoded values must not contain any values that start with any other value in the set. For example: $M_{4}=\\{\\,a\\mapsto 0,b\\mapsto 10,c\\mapsto 110\\,\\}$ is a prefix code, because none of the values start with any of the other values. However, $M_{3}=\\{\\,a\\mapsto 0,b\\mapsto 01,c\\mapsto 011\\,\\}$ is not a prefix code, because one of the values starts with another of the values.\n",
 135    "\n",
 136    "Prefix codes are useful because you can pick out each value without needing to know where the value starts and ends."
 137   ]
 138  },
 139  {
 140   "cell_type": "markdown",
 141   "metadata": {},
 142   "source": [
 143    "## What is the difference betwenn lossless and lossy compression?"
 144   ]
 145  },
 146  {
 147   "cell_type": "markdown",
 148   "metadata": {},
 149   "source": [
 150    "### What is lossless compression?"
 151   ]
 152  },
 153  {
 154   "cell_type": "markdown",
 155   "metadata": {},
 156   "source": [
 157    "The Lossless compression method is capable of reconstituting the original form of the data.\n",
 158    "The quality of the data is not compromised.\n",
 159    "This technique allows a file to restore its original form.\n",
 160    "Lossless compression can be applied to any file format can improve the performance of the compression ratio."
 161   ]
 162  },
 163  {
 164   "cell_type": "markdown",
 165   "metadata": {},
 166   "source": [
 167    "### What is lossy compression?"
 168   ]
 169  },
 170  {
 171   "cell_type": "markdown",
 172   "metadata": {},
 173   "source": [
 174    "The Lossy compression method eliminates some amount of data that is not noticeable.\n",
 175    "This technique **does not** allow a file to restore in its original form but significantly reduces the size.\n",
 176    "The lossy compression technique is beneficial if the quality of the data is not your priority.\n",
 177    "It slightly degrades the quality of the file or data but is convenient when one wants to send or store the data.\n",
 178    "This type of data compression is used for organic data like audio signals and images."
 179   ]
 180  },
 181  {
 182   "cell_type": "markdown",
 183   "metadata": {},
 184   "source": [
 185    "## When can we get away with lossy compression?"
 186   ]
 187  },
 188  {
 189   "cell_type": "markdown",
 190   "metadata": {},
 191   "source": [
 192    "As the definition above states lossy compression eliminates the data that is not noticeable.\n",
 193    "This is said because for the most part we use lossy compression for images, videos and audio files.\n",
 194    "Most of the information is still there and you would not notice a few milliseconds missing from the video or a couple of pixels missing from the image.\n",
 195    "Basically everything without text. Unless you are into guessing missing chars from words I guess it is fine."
 196   ]
 197  },
 198  {
 199   "cell_type": "markdown",
 200   "metadata": {},
 201   "source": [
 202    "## What is Huffman Compression Algorithm?"
 203   ]
 204  },
 205  {
 206   "cell_type": "markdown",
 207   "metadata": {},
 208   "source": [
 209    "Huffman Compression Algorithm is a prefix loseless type of compression algorithm developed by David A. Huffman (student at MIT), and published in the 1952 paper. I guess you can say it is old. The main idea is to use smaller sequences of bits for more frequent charecters. "
 210   ]
 211  },
 212  {
 213   "cell_type": "markdown",
 214   "metadata": {},
 215   "source": [
 216    "To begin generating the Huffman tree, each character gets a weight equal to the number of times it\n",
 217    "occurs in the file. For example, in the \"happy hip hop\" example, the character 'p' has weight 4,\n",
 218    "'h' has weight 3, the space has weight 2, and the other characters have weight 1. Our first task is to\n",
 219    "calculate these weights.\n",
 220    "\n",
 221    "1. Create a collection of singleton trees, one for each character, with weight equal to the character frequency.\n",
 222    "2. From the collection, pick out the two trees with the smallest weights and remove them. Combine them into a new tree whose root has a weight equal to the sum of the weights of the two nodes.\n",
 223    "3. Add the new tree back into the collection.\n",
 224    "4. Repeat steps 2 and 3 until there is only one tree left.\n",
 225    "\n",
 226    "<img src=\"imgs/first.png\" style=\"width: 400px;\">"
 227   ]
 228  },
 229  {
 230   "cell_type": "markdown",
 231   "metadata": {},
 232   "source": [
 233    "We start by choosing the two smallest nodes. There are four nodes with the minimal weight of one, it\n",
 234    "doesn't matter which two we pick. We choose 'o' and 'y' and combine them into a new tree whose\n",
 235    "root is the sum of the weights chosen. We replace those two nodes with the combined tree. The nodes\n",
 236    "remaining in the collection are shown in the light gray box at each stage.\n",
 237    "\n",
 238    "<img src=\"imgs/second.png\" style=\"width: 400px;\"/>"
 239   ]
 240  },
 241  {
 242   "cell_type": "markdown",
 243   "metadata": {},
 244   "source": [
 245    "Now we repeat that step, this time there is no choice for the minimal nodes, it must be 'a' and 'i'.\n",
 246    "We take those out and combine them into a weight 2 tree. Note how the collection of nodes shrinks by\n",
 247    "one each iteration (we remove two nodes and add a new one back in).\n",
 248    "\n",
 249    "<img src=\"imgs/third.png\" style=\"width: 400px;\"> "
 250   ]
 251  },
 252  {
 253   "cell_type": "markdown",
 254   "metadata": {},
 255   "source": [
 256    "Again, we pull out the two smallest nodes and build a tree of weight 4:\n",
 257    "\n",
 258    "<img src=\"imgs/fourth.png\" style=\"width: 400px;\"> "
 259   ]
 260  },
 261  {
 262   "cell_type": "markdown",
 263   "metadata": {},
 264   "source": [
 265    "One more iteration combines the weight 3 and 2 trees into a combined tree of weight 5:\n",
 266    "\n",
 267    "<img src=\"imgs/fifth.png\" style=\"width: 400px;\"> "
 268   ]
 269  },
 270  {
 271   "cell_type": "markdown",
 272   "metadata": {},
 273   "source": [
 274    "Combining the two 4s gets a tree of weight 8:\n",
 275    "\n",
 276    "<img src=\"imgs/sixth.png\" style=\"width: 400px;\"> "
 277   ]
 278  },
 279  {
 280   "cell_type": "markdown",
 281   "metadata": {},
 282   "source": [
 283    "And finally, we combine the last two to get our final tree. The root node of the final tree will always\n",
 284    "have a weight equal to the number of characters in the input file.\n",
 285    "\n",
 286    "<img src=\"imgs/seventh.png\" style=\"width: 400px;\"> "
 287   ]
 288  },
 289  {
 290   "cell_type": "markdown",
 291   "metadata": {},
 292   "source": [
 293    "Remember that it is essential that you use the same tree to do both encoding and decoding of your files.\n",
 294    "Since each Huffman tree creates a unique encoding of a particular file, you need to ensure that your\n",
 295    "decoding algorithm generates the exact same tree, so that you can get back the file."
 296   ]
 297  },
 298  {
 299   "cell_type": "markdown",
 300   "metadata": {},
 301   "source": [
 302    "### The Pseudo-EOF"
 303   ]
 304  },
 305  {
 306   "cell_type": "markdown",
 307   "metadata": {},
 308   "source": [
 309    "You can now see that our three has asymmetrical form (jagged tree). We have not talked about this but trees usually should not be like this. What happens when we try to store a Huffman-encoded sequence on-disk in a file? Each file on your computer is typically stored as a number of bytes (groups of eight bits). Therefore if you try to write a Huffman-encoded string of bits into a file and you don't have exactly multiple of eight bits then your operating system would add random bits at the end. "
 310   ]
 311  },
 312  {
 313   "cell_type": "markdown",
 314   "metadata": {},
 315   "source": [
 316    "For example the word \"ahoy\" with the huffman tree above translated would look something like this \n",
 317    "\n",
 318    "<center><b>1101001100111</b></center>\n",
 319    "\n",
 320    "Those are excatly thirheen bits which means that 3 more will be added. Let's say that those 3 bits will be **111**\n",
 321    "\n",
 322    "In that case, the bit sequence would be written to disk as\n",
 323    "\n",
 324    "<center><b>1101001100111111</b></center>\n",
 325    "\n",
 326    "If we were to then load this back from disk and decode it into a sequence of characters, we would get \"ahoyi\". Even worse if those random bits are not in out tree (in that case \"000\") we would get an error.\n",
 327    "\n",
 328    "This is clearly a problem. We need a special charcter which tells us where our compressed data ends. It won't be seen and we can add it to our tree in the same way we have added all the other characters. Here is a possible tree with ■\n",
 329    "\n",
 330    "<img src=\"imgs/eight.png\" style=\"width: 400px;\">\n",
 331    "\n",
 332    "If we encode \"happy hip hop■\" we get the following bitstring\n",
 333    "\n",
 334    "<center><b>001100101011110110011011001100111010010</b></center>\n",
 335    "\n",
 336    "| Symbol | Codeword |\n",
 337    "|--------|----------|\n",
 338    "| h      | 00       |\n",
 339    "| a      | 1100     |\n",
 340    "| p      | 10       |\n",
 341    "| p      | 10       |\n",
 342    "| y      | 1111     |\n",
 343    "|        | 011      |\n",
 344    "| h      | 00       |\n",
 345    "| i      | 1101     |\n",
 346    "| p      | 10       |\n",
 347    "|        | 011      |\n",
 348    "| h      | 00       |\n",
 349    "| o      | 1110     |\n",
 350    "| p      | 10       |\n",
 351    "| ■      | 010      |\n",
 352    "| ignore | ignore   |\n",
 353    "\n",
 354    "This ■ is called *pseudo end of file* or *Pseudo-EOF* for short."
 355   ]
 356  },
 357  {
 358   "cell_type": "markdown",
 359   "metadata": {},
 360   "source": [
 361    "From here we are going to write some code so it would be better if we gather all of our imports in one place."
 362   ]
 363  },
 364  {
 365   "cell_type": "code",
 366   "execution_count": null,
 367   "metadata": {},
 368   "outputs": [],
 369   "source": [
 370    "import numpy as np\n",
 371    "import matplotlib.pyplot as plt"
 372   ]
 373  },
 374  {
 375   "cell_type": "markdown",
 376   "metadata": {},
 377   "source": [
 378    "Let's implement it with python"
 379   ]
 380  },
 381  {
 382   "cell_type": "markdown",
 383   "metadata": {},
 384   "source": [
 385    "Now we will create our Node tree class\n",
 386    "\n",
 387    "```python\n",
 388    "class NodeTree(object):\n",
 389    "\n",
 390    "    def __init__(self, left=None, right=None):\n",
 391    "        self.left = left\n",
 392    "        self.right = right\n",
 393    "\n",
 394    "    def children(self):\n",
 395    "        return (self.left, self.right)\n",
 396    "\n",
 397    "    def nodes(self):\n",
 398    "        return (self.left, self.right)\n",
 399    "\n",
 400    "    def __str__(self):\n",
 401    "        return '%s_%s' % (self.left, self.right)\n",
 402    "```"
 403   ]
 404  },
 405  {
 406   "cell_type": "markdown",
 407   "metadata": {},
 408   "source": [
 409    "We need to calculate character's frequences.\n",
 410    "\n",
 411    "```python\n",
 412    "def calculate_frequency(data):\n",
 413    "    freq = {}\n",
 414    "    for c in data:\n",
 415    "        if c in freq:\n",
 416    "            freq[c] += 1\n",
 417    "        else:\n",
 418    "            freq[c] = 1\n",
 419    "    return freq\n",
 420    "```"
 421   ]
 422  },
 423  {
 424   "cell_type": "markdown",
 425   "metadata": {},
 426   "source": [
 427    "After that we need to sort them in ascending order\n",
 428    "\n",
 429    "```python\n",
 430    "freq = sorted(freq.items(), key=lambda x: x[1], reverse=True)\n",
 431    "```"
 432   ]
 433  },
 434  {
 435   "cell_type": "markdown",
 436   "metadata": {},
 437   "source": [
 438    "Now we need to create the function that merges our nodes  \n",
 439    "\n",
 440    "```python\n",
 441    "def merge(nodes)\n",
 442    "    while len(nodes) > 1:\n",
 443    "        (key1, c1) = nodes[-1]\n",
 444    "        (key2, c2) = nodes[-2]\n",
 445    "        nodes = nodes[:-2]\n",
 446    "        node = NodeTree(key1, key2)\n",
 447    "        nodes.append((node, c1 + c2))\n",
 448    "\n",
 449    "        nodes = sorted(nodes, key=lambda x: x[1], reverse=True)\n",
 450    "    return nodes\n",
 451    "```"
 452   ]
 453  },
 454  {
 455   "cell_type": "markdown",
 456   "metadata": {},
 457   "source": [
 458    "Finally we need to add the nodes to the dictionary\n",
 459    "\n",
 460    "```python\n",
 461    "def huffman_code_tree(node, left=True, binString=''):\n",
 462    "    if type(node) is str:\n",
 463    "        return {node: binString}\n",
 464    "    (l, r) = node.children()\n",
 465    "    d = dict()\n",
 466    "    d.update(huffman_code_tree(l, True, binString + '0'))\n",
 467    "    d.update(huffman_code_tree(r, False, binString + '1'))\n",
 468    "    return d\n",
 469    "```"
 470   ]
 471  },
 472  {
 473   "cell_type": "markdown",
 474   "metadata": {},
 475   "source": [
 476    "All combined it should look something like this"
 477   ]
 478  },
 479  {
 480   "cell_type": "code",
 481   "execution_count": null,
 482   "metadata": {
 483    "scrolled": true
 484   },
 485   "outputs": [],
 486   "source": [
 487    "class NodeTree(object):\n",
 488    "\n",
 489    "    def __init__(self, left=None, right=None):\n",
 490    "        self.left = left\n",
 491    "        self.right = right\n",
 492    "\n",
 493    "    def children(self):\n",
 494    "        return (self.left, self.right)\n",
 495    "\n",
 496    "    def nodes(self):\n",
 497    "        return (self.left, self.right)\n",
 498    "\n",
 499    "    def __str__(self):\n",
 500    "        return '%s_%s' % (self.left, self.right)\n",
 501    "\n",
 502    "def calculate_frequency(data):\n",
 503    "    freq = {}\n",
 504    "    for c in data:\n",
 505    "        if c in freq:\n",
 506    "            freq[c] += 1\n",
 507    "        else:\n",
 508    "            freq[c] = 1\n",
 509    "    return freq\n",
 510    "\n",
 511    "def huffman_code_tree(node, left=True, binString=''):\n",
 512    "    if type(node) is str:\n",
 513    "        return {node: binString}\n",
 514    "    (l, r) = node.children()\n",
 515    "    d = dict()\n",
 516    "    d.update(huffman_code_tree(l, True, binString + '0'))\n",
 517    "    d.update(huffman_code_tree(r, False, binString + '1'))\n",
 518    "    return d\n",
 519    "\n",
 520    "def merge(nodes):\n",
 521    "    while len(nodes) > 1:\n",
 522    "        (key1, c1) = nodes[-1]\n",
 523    "        (key2, c2) = nodes[-2]\n",
 524    "        nodes = nodes[:-2]\n",
 525    "        node = NodeTree(key1, key2)\n",
 526    "        nodes.append((node, c1 + c2))\n",
 527    "\n",
 528    "        nodes = sorted(nodes, key=lambda x: x[1], reverse=True)\n",
 529    "    return nodes\n",
 530    "\n",
 531    "def encode(text, table):\n",
 532    "    result = \"\"\n",
 533    "    \n",
 534    "    for c in text:\n",
 535    "        result += table[c]\n",
 536    "    return result\n",
 537    "\n",
 538    "def decode(encoded_text, tree):\n",
 539    "    decoded_text = \"\"\n",
 540    "    \n",
 541    "    buffer = \"\"\n",
 542    "    \n",
 543    "    for i in encoded_text:\n",
 544    "        buffer += i\n",
 545    "        if buffer in tree.values():\n",
 546    "            value = str([k for k, v in tree.items() if v == buffer][0])\n",
 547    "            decoded_text += value\n",
 548    "            buffer = \"\"\n",
 549    "    return decoded_text"
 550   ]
 551  },
 552  {
 553   "cell_type": "code",
 554   "execution_count": null,
 555   "metadata": {},
 556   "outputs": [],
 557   "source": [
 558    "def compress(string):\n",
 559    "    freq = calculate_frequency(string)\n",
 560    "    \n",
 561    "    freq = sorted(freq.items(), key=lambda x: x[1], reverse=True)\n",
 562    "\n",
 563    "    nodes = freq\n",
 564    "\n",
 565    "    nodes = merge(nodes)\n",
 566    "\n",
 567    "    huffmanCode = huffman_code_tree(nodes[0][0])\n",
 568    "        \n",
 569    "    return (huffmanCode, freq, encode(string, huffmanCode))"
 570   ]
 571  },
 572  {
 573   "cell_type": "markdown",
 574   "metadata": {},
 575   "source": [
 576    "And of course a function to uncompress it."
 577   ]
 578  },
 579  {
 580   "cell_type": "code",
 581   "execution_count": null,
 582   "metadata": {},
 583   "outputs": [],
 584   "source": [
 585    "def uncompress(encoded_text, tree):\n",
 586    "    return decode(encoded_text, tree)"
 587   ]
 588  },
 589  {
 590   "cell_type": "markdown",
 591   "metadata": {},
 592   "source": [
 593    "It's also good to have a function to print the encoding tree."
 594   ]
 595  },
 596  {
 597   "cell_type": "code",
 598   "execution_count": null,
 599   "metadata": {},
 600   "outputs": [],
 601   "source": [
 602    "def print_tree(huffmanCode, freq):\n",
 603    "    print(' Char | Huffman code ')\n",
 604    "    print('----------------------')\n",
 605    "    for (char, frequency) in freq:\n",
 606    "        print(' %-4r |%12s' % (char, huffmanCode[char]))"
 607   ]
 608  },
 609  {
 610   "cell_type": "markdown",
 611   "metadata": {},
 612   "source": [
 613    "Let's implement data reading function in advance because we are going to use it a lot later on."
 614   ]
 615  },
 616  {
 617   "cell_type": "code",
 618   "execution_count": null,
 619   "metadata": {},
 620   "outputs": [],
 621   "source": [
 622    "def get_data(file):\n",
 623    "    data = open(\"data/\" + file + \".txt\")\n",
 624    "    text = data.readlines()\n",
 625    "    return text[0]"
 626   ]
 627  },
 628  {
 629   "cell_type": "code",
 630   "execution_count": null,
 631   "metadata": {},
 632   "outputs": [],
 633   "source": [
 634    "text = get_data(\"non_random_text\")\n",
 635    "(code, freq, encoding) = compress(text)\n",
 636    "uncompress(encoding, code) == text"
 637   ]
 638  },
 639  {
 640   "cell_type": "markdown",
 641   "metadata": {},
 642   "source": [
 643    "From the code above we see that our uncompress function works!"
 644   ]
 645  },
 646  {
 647   "cell_type": "code",
 648   "execution_count": null,
 649   "metadata": {
 650    "scrolled": true
 651   },
 652   "outputs": [],
 653   "source": [
 654    "text = get_data(\"non_random_text\")\n",
 655    "(code, freq, encoding) = compress(text)\n",
 656    "print_tree(code, freq)"
 657   ]
 658  },
 659  {
 660   "cell_type": "markdown",
 661   "metadata": {},
 662   "source": [
 663    "Here we can see how the most common character, in this case \"e\", is encoded with fewer bits because it's frequency is higher . As for \"K\" is encoded with more bits because it's frequency is lower . I wonder what would happen if we give more \"random\" data to our algorithm?"
 664   ]
 665  },
 666  {
 667   "cell_type": "code",
 668   "execution_count": null,
 669   "metadata": {
 670    "scrolled": true
 671   },
 672   "outputs": [],
 673   "source": [
 674    "text = get_data(\"random_text\")\n",
 675    "(code, freq, encoding) = compress(text)\n",
 676    "print_tree(code, freq)"
 677   ]
 678  },
 679  {
 680   "cell_type": "markdown",
 681   "metadata": {},
 682   "source": [
 683    "Interesting... I expected more uneven distibution of the frequencies. I wonder if there is correlation between character count and the randomness of our data. We can easily calculate the randmness but first let's see the compressed/uncompressed ratio."
 684   ]
 685  },
 686  {
 687   "cell_type": "markdown",
 688   "metadata": {},
 689   "source": [
 690    "Now we need a function to count the number of bits in a specific code tree"
 691   ]
 692  },
 693  {
 694   "cell_type": "code",
 695   "execution_count": null,
 696   "metadata": {},
 697   "outputs": [],
 698   "source": [
 699    "def calculate_bits(code, freq):\n",
 700    "    bits = 0\n",
 701    "    for (char, count) in freq:\n",
 702    "        bits += len(code[char]) * count\n",
 703    "    return bits"
 704   ]
 705  },
 706  {
 707   "cell_type": "markdown",
 708   "metadata": {},
 709   "source": [
 710    "We can calculate how much data we can compress with that particular code tree with the function below"
 711   ]
 712  },
 713  {
 714   "cell_type": "code",
 715   "execution_count": null,
 716   "metadata": {},
 717   "outputs": [],
 718   "source": [
 719    "def comparisson(file):\n",
 720    "    text = get_data(file)\n",
 721    "\n",
 722    "    (code, freq, encode) = compress(text)\n",
 723    "\n",
 724    "    after_compression = calculate_bits(code, freq)\n",
 725    "\n",
 726    "    before_compression = len(bin(int.from_bytes(get_data(\"non_random_text\").encode(), 'big')))\n",
 727    "\n",
 728    "    percentage = (after_compression / before_compression) * 100\n",
 729    "\n",
 730    "    print(str(before_compression) + \" bits before\")\n",
 731    "    print(str(after_compression) + \" bits after\")\n",
 732    "    print(\"%.2f\" % percentage + '%')"
 733   ]
 734  },
 735  {
 736   "cell_type": "code",
 737   "execution_count": null,
 738   "metadata": {
 739    "scrolled": true
 740   },
 741   "outputs": [],
 742   "source": [
 743    "comparisson(\"non_random_text\")"
 744   ]
 745  },
 746  {
 747   "cell_type": "markdown",
 748   "metadata": {},
 749   "source": [
 750    "That means that we have $45.52\\%$ compressed space of the original file. Let's try that with the random data"
 751   ]
 752  },
 753  {
 754   "cell_type": "code",
 755   "execution_count": null,
 756   "metadata": {},
 757   "outputs": [],
 758   "source": [
 759    "comparisson(\"random_text\")"
 760   ]
 761  },
 762  {
 763   "cell_type": "markdown",
 764   "metadata": {},
 765   "source": [
 766    "It really should not be surprising that we have only $19.95\\%$ comparison. "
 767   ]
 768  },
 769  {
 770   "cell_type": "markdown",
 771   "metadata": {},
 772   "source": [
 773    "We can defenetly say that more unpredictable data is less compressed. We don't know why yet but soon we might."
 774   ]
 775  },
 776  {
 777   "cell_type": "markdown",
 778   "metadata": {},
 779   "source": [
 780    "That is the file \"random_text\". "
 781   ]
 782  },
 783  {
 784   "cell_type": "code",
 785   "execution_count": null,
 786   "metadata": {
 787    "scrolled": true
 788   },
 789   "outputs": [],
 790   "source": [
 791    "get_data(\"non_random_text\")"
 792   ]
 793  },
 794  {
 795   "cell_type": "markdown",
 796   "metadata": {},
 797   "source": [
 798    "And that is the file \"non_random_text\"."
 799   ]
 800  },
 801  {
 802   "cell_type": "code",
 803   "execution_count": null,
 804   "metadata": {
 805    "scrolled": true
 806   },
 807   "outputs": [],
 808   "source": [
 809    "get_data(\"random_text\")"
 810   ]
 811  },
 812  {
 813   "cell_type": "markdown",
 814   "metadata": {},
 815   "source": [
 816    "## What is entropy?"
 817   ]
 818  },
 819  {
 820   "cell_type": "markdown",
 821   "metadata": {},
 822   "source": [
 823    "Entropy or more specifically Shannon entropy helps us calculate what is the \"randomness\" of our distribution in our case text."
 824   ]
 825  },
 826  {
 827   "cell_type": "markdown",
 828   "metadata": {},
 829   "source": [
 830    "### Information for an Event"
 831   ]
 832  },
 833  {
 834   "cell_type": "markdown",
 835   "metadata": {},
 836   "source": [
 837    "The intuition behind quantifying information is the idea of measuring how much surprise there is in an event. Those events that are rare (low probability) are more surprising and therefore have more information than those events that are common (high probability).\n",
 838    "\n",
 839    "In other words:\n",
 840    "\n",
 841    "1. **Low Probability Event**: High Information (surprising).\n",
 842    "2. **High Probability Event**: Low Information (unsurprising).\n"
 843   ]
 844  },
 845  {
 846   "cell_type": "markdown",
 847   "metadata": {},
 848   "source": [
 849    "$$h(p_i) = - \\log_{2}{p_i}$$"
 850   ]
 851  },
 852  {
 853   "cell_type": "markdown",
 854   "metadata": {},
 855   "source": [
 856    "Log is on base 2 because we want to measure our data in bits. The negative sign ensures that the result is always positive or zero. Consider a flip of a single fair coin. The probability of heads (and tails) is 0.5. Let's make some examples with python!"
 857   ]
 858  },
 859  {
 860   "cell_type": "code",
 861   "execution_count": null,
 862   "metadata": {},
 863   "outputs": [],
 864   "source": [
 865    "def information_for_p(p):\n",
 866    "    p = p\n",
 867    "    h = - np.log(p) / np.log(2)\n",
 868    "    print('p(x)=%.3f, information: %.3f bits' % (p, h))"
 869   ]
 870  },
 871  {
 872   "cell_type": "code",
 873   "execution_count": null,
 874   "metadata": {},
 875   "outputs": [],
 876   "source": [
 877    "information_for_p(0.5)"
 878   ]
 879  },
 880  {
 881   "cell_type": "markdown",
 882   "metadata": {},
 883   "source": [
 884    "When the probability of our event is 0.5 it's information content for the event is 1 bit. If we flip that let's say 100 times the information for this sequence would be 100 bits. (That's called foreshadowing)"
 885   ]
 886  },
 887  {
 888   "cell_type": "markdown",
 889   "metadata": {},
 890   "source": [
 891    "Consider our coin is not fair and the chance of getting heads is 0.9."
 892   ]
 893  },
 894  {
 895   "cell_type": "code",
 896   "execution_count": null,
 897   "metadata": {
 898    "scrolled": false
 899   },
 900   "outputs": [],
 901   "source": [
 902    "information_for_p(0.9)"
 903   ]
 904  },
 905  {
 906   "cell_type": "markdown",
 907   "metadata": {},
 908   "source": [
 909    "Here our event is more frequent and thus it's information size is smaller."
 910   ]
 911  },
 912  {
 913   "cell_type": "markdown",
 914   "metadata": {},
 915   "source": [
 916    "Let's plot the function $-log_2p_i$"
 917   ]
 918  },
 919  {
 920   "cell_type": "code",
 921   "execution_count": null,
 922   "metadata": {},
 923   "outputs": [],
 924   "source": [
 925    "probs = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]\n",
 926    "\n",
 927    "info = [-np.log(p) / np.log(2) for p in probs]\n",
 928    "\n",
 929    "plt.plot(probs, info, marker='.')\n",
 930    "plt.title('Probability vs Information')\n",
 931    "plt.xlabel('Probability')\n",
 932    "plt.ylabel('Information')\n",
 933    "plt.show()"
 934   ]
 935  },
 936  {
 937   "cell_type": "markdown",
 938   "metadata": {},
 939   "source": [
 940    "We should not be surprised by two things the first of which is that the graph is not linear. It's normall because our function is log and the second is the correlation bewteen information size and the it's probability."
 941   ]
 942  },
 943  {
 944   "cell_type": "markdown",
 945   "metadata": {},
 946   "source": [
 947    "### Calculate the Entropy for a Random Variable"
 948   ]
 949  },
 950  {
 951   "cell_type": "code",
 952   "execution_count": null,
 953   "metadata": {},
 954   "outputs": [],
 955   "source": [
 956    "def entropy(probabilities, base):\n",
 957    "    h = 0\n",
 958    "    for p in probabilities:\n",
 959    "        h += -np.log(p) / np.log(base) * p\n",
 960    "    return h "
 961   ]
 962  },
 963  {
 964   "cell_type": "code",
 965   "execution_count": null,
 966   "metadata": {},
 967   "outputs": [],
 968   "source": [
 969    "p = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]\n",
 970    "print('entropy: %.3f bits' % entropy(p, 2))"
 971   ]
 972  },
 973  {
 974   "cell_type": "markdown",
 975   "metadata": {},
 976   "source": [
 977    "Mathematically written it should look something like that \n",
 978    "\n",
 979    "$$H(x) = -\\sum_{i = 1}^n p_i {\\log_2{p_i}}$$"
 980   ]
 981  },
 982  {
 983   "cell_type": "markdown",
 984   "metadata": {},
 985   "source": [
 986    "Now we are able to calculate the entropy of our distrubution. We would also need a function to calculate characters freqency and another function to calculate the probability of any single one."
 987   ]
 988  },
 989  {
 990   "cell_type": "code",
 991   "execution_count": null,
 992   "metadata": {},
 993   "outputs": [],
 994   "source": [
 995    "def calculate_probability(frequencies, base_count):\n",
 996    "    prob = []\n",
 997    "    for f in frequencies.values():\n",
 998    "        prob.append(f/base_count)\n",
 999    "    return prob\n",
1000    "\n",
1001    "def entropy(data, base):\n",
1002    "    frequencies = calculate_frequency(data)\n",
1003    "    probabilities = calculate_probability(frequencies, len(data))\n",
1004    "    \n",
1005    "    h = 0\n",
1006    "    \n",
1007    "    for p in probabilities:\n",
1008    "        h += (np.log(p) / np.log(base)) * p\n",
1009    "        \n",
1010    "    return -h "
1011   ]
1012  },
1013  {
1014   "cell_type": "code",
1015   "execution_count": null,
1016   "metadata": {},
1017   "outputs": [],
1018   "source": [
1019    "string = get_data(\"random_text\")\n",
1020    "base = 2\n",
1021    "\n",
1022    "entropy(string, base)"
1023   ]
1024  },
1025  {
1026   "cell_type": "code",
1027   "execution_count": null,
1028   "metadata": {},
1029   "outputs": [],
1030   "source": [
1031    "string = get_data(\"non_random_text\")\n",
1032    "base = 2\n",
1033    "\n",
1034    "entropy(string, base)"
1035   ]
1036  },
1037  {
1038   "cell_type": "markdown",
1039   "metadata": {},
1040   "source": [
1041    "For our random text it's entrophy is approximately 6.35 which means that on average we need 6.35 bits to represent a single character. With that said with big entropy comes unpredictability which is a synonym for randomness. Same for small entrophy. Small entrophy means more predictability and you know the rest. Side note (at first I thought that there is going to be correlation between characters frequencies and the entrophy of our data but there is none)"
1042   ]
1043  },
1044  {
1045   "cell_type": "markdown",
1046   "metadata": {},
1047   "source": [
1048    "We have come to the conclusion that Huffman coding uses entropy encoding scheme."
1049   ]
1050  },
1051  {
1052   "cell_type": "markdown",
1053   "metadata": {},
1054   "source": [
1055    "## How can we get back the uncompressed data from the Huffman tree?"
1056   ]
1057  },
1058  {
1059   "cell_type": "markdown",
1060   "metadata": {},
1061   "source": [
1062    "One problem we will encounter while trying to send the compressed data is that the recivier does not have the tree to recover the data. There a couple of ways to resove this. We could agree on the coding tree in advance but for that we need to know the frequency of the letters in advance. Second option is to prefix the bit sequence with a header containing enough information to reconstruct the Huffman encoding tree. You can store that sequence at the head of the file in a human readable format. Reading this it will allow you to recreate the the tree."
1063   ]
1064  },
1065  {
1066   "cell_type": "markdown",
1067   "metadata": {},
1068   "source": [
1069    "## Comparison with other lossless compression algorithms"
1070   ]
1071  },
1072  {
1073   "cell_type": "markdown",
1074   "metadata": {},
1075   "source": [
1076    "We are going to use an already done comparisson. We have got the files sizes here. But first let's define what all compression algorithms mentioined below mean."
1077   ]
1078  },
1079  {
1080   "cell_type": "markdown",
1081   "metadata": {},
1082   "source": [
1083    "**Run-length encoding (RLE)** is a form of lossless data compression in which runs of data (sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run. For example this string \"aaaaaeeeeeffff\" translates to \"5a5e4f\". As you can see this works only if we have consecutive characters."
1084   ]
1085  },
1086  {
1087   "cell_type": "markdown",
1088   "metadata": {},
1089   "source": [
1090    "**Shannon Fano Algorithm** is an entropy encoding technique for lossless data compression of multimedia. Named after Claude Shannon and Robert Fano, it assigns a code to each symbol based on their probabilities of occurrence. It is a variable length encoding scheme, that is, the codes assigned to the symbols will be of varying length."
1091   ]
1092  },
1093  {
1094   "cell_type": "markdown",
1095   "metadata": {},
1096   "source": [
1097    "**Adaptive Huffman coding** (also called **Dynamic Huffman coding**) is an adaptive coding technique based on Huffman coding but with the expetion that it allows building the tree as the symbols are being send having no knowledge of charcter's frequencies. It's used in live video and audio compressions. "
1098   ]
1099  },
1100  {
1101   "cell_type": "markdown",
1102   "metadata": {},
1103   "source": [
1104    "**Arithmetic coding** is a form of entropy encoding used in lossless data compression. Normally, a string of characters such as the words \"hello there\" is represented using a fixed number of bits per character, as in the ASCII code."
1105   ]
1106  },
1107  {
1108   "cell_type": "markdown",
1109   "metadata": {},
1110   "source": [
1111    "<img src=\"imgs\\datasize.png\" style=\"width: 500px;\"> "
1112   ]
1113  },
1114  {
1115   "cell_type": "markdown",
1116   "metadata": {},
1117   "source": [
1118    "<img src=\"imgs\\comparison.png\" style=\"width: 700px;\"> "
1119   ]
1120  },
1121  {
1122   "cell_type": "markdown",
1123   "metadata": {},
1124   "source": [
1125    "The overall behaviour of Shannon-Fano coding, Huffman coding and Adaptive Huffman coding is very similar with Arithmetic coding. Unlike Huffman coding, no code tree needs to be transmitted to the receiver. Here, encoding is done to a group of symbols, not symbol by symbol, which leads to higher compression ratios."
1126   ]
1127  },
1128  {
1129   "cell_type": "markdown",
1130   "metadata": {},
1131   "source": [
1132    "## Overview"
1133   ]
1134  },
1135  {
1136   "cell_type": "markdown",
1137   "metadata": {},
1138   "source": [
1139    "We have talked about what huffman coding is, how is generated. We made a wrong hypothesis that data with evenly distrubuted character counts is not random. We have talked about entrophy and how and why exactly we mesure one's data randomness. We made a test that confirms it. This is supposed to be a brief explanation about Huffman coding and it's properties.       "
1140   ]
1141  },
1142  {
1143   "cell_type": "markdown",
1144   "metadata": {},
1145   "source": [
1146    "Resources used:\n",
1147    "\n",
1148    "Numpy"
1149   ]
1150  },
1151  {
1152   "cell_type": "markdown",
1153   "metadata": {},
1154   "source": [
1155    "## References"
1156   ]
1157  },
1158  {
1159   "cell_type": "markdown",
1160   "metadata": {},
1161   "source": [
1162    "1. [https://en.wikipedia.org/wiki/Huffman_coding](https://en.wikipedia.org/wiki/Huffman_coding)\n",
1163    "2. [https://www.youtube.com/watch?v=dM6us854Jk0](https://www.youtube.com/watch?v=dM6us854Jk0)\n",
1164    "3. [https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1126/handouts/220%20Huffman%20Encoding.pdf](https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1126/handouts/220%20Huffman%20Encoding.pdf)\n",
1165    "4. [http://160592857366.free.fr/joe/ebooks/ShareData/A%20Comparitive%20Study%20of%20Text%20Compression%20Algorithms.pdf](http://160592857366.free.fr/joe/ebooks/ShareData/A%20Comparitive%20Study%20of%20Text%20Compression%20Algorithms.pdf)\n",
1166    "5. [http://home.cse.ust.hk/faculty/golin/COMP271Sp03/Notes/MyL17.pdf](http://home.cse.ust.hk/faculty/golin/COMP271Sp03/Notes/MyL17.pdf)\n",
1167    "6. [https://www.programiz.com/dsa/huffman-coding](https://www.programiz.com/dsa/huffman-coding) - Huffman source code\n",
1168    "7. [https://en.wikipedia.org/wiki/Prefix_code](https://en.wikipedia.org/wiki/Prefix_code)\n",
1169    "8. [https://www.tutorialspoint.com/huffman-codes-and-entropy-in-data-structure](https://www.tutorialspoint.com/huffman-codes-and-entropy-in-data-structure)\n",
1170    "9. [https://machinelearningmastery.com/what-is-information-entropy/](https://machinelearningmastery.com/what-is-information-entropy/)\n",
1171    "10. [https://en.wikipedia.org/wiki/Variable-length_code](https://en.wikipedia.org/wiki/Variable-length_code)\n",
1172    "11. [https://leimao.github.io/blog/Huffman-Coding/](https://leimao.github.io/blog/Huffman-Coding/)\n",
1173    "12. [https://www.princeton.edu/~cuff/ele201/kulkarni_text/information.pdf](https://www.princeton.edu/~cuff/ele201/kulkarni_text/information.pdf)"
1174   ]
1175  }
1176 ],
1177 "metadata": {
1178  "kernelspec": {
1179   "display_name": "Python 3 (ipykernel)",
1180   "language": "python",
1181   "name": "python3"
1182  },
1183  "language_info": {
1184   "codemirror_mode": {
1185    "name": "ipython",
1186    "version": 3
1187   },
1188   "file_extension": ".py",
1189   "mimetype": "text/x-python",
1190   "name": "python",
1191   "nbconvert_exporter": "python",
1192   "pygments_lexer": "ipython3",
1193   "version": "3.11.5"
1194  }
1195 },
1196 "nbformat": 4,
1197 "nbformat_minor": 4
1198}


----------------------------------------------------------------------
Ran 0 tests in 0.000s

OK

Дискусия
Виктор Бечев
05.01.2024 22:29

Не е задължително да използвате ООП, да. За нас проектите са един от трите инструмента да преценим доколко си овладял технологията, която преподаваме. С други думи - бихме оценили дълбокото разбиране по дадена тема, различна от Python, но не това е фокусът на нашият курс. Не търсим развита тема, търсим добро количесто функционалност, тестове и добър стил, всичко от това - написано на Python. Чудесно е, че имаш познанията свързани с алгоритми за компресиране, но това, което оценяваме - Питонския код - е изключително малко. Последните домашни са средно 150-200 реда код, няма как финалният проект да е 50 (или 200, в този ред на мисли). Проектът дава 60 точки, тоест е равносилен на 6 домашни. Вместо да ти давам пример за подходящи като размер проекти - разгледай предложенията на колегите. Особено тези с 8+ точки са в общия случай добре описани проекти, които са с доста добър или дори надвишаващ изискванията ни обем (което не е лошо, който се кефи и има времето - нека понапише и повече код).
Тодор Недковски
05.01.2024 18:03

Попринцип мислех това да ми е проекта с малко добълнения откъм функционалност. Доколкото разбирам не е задължително да използваме ооп. Също така не знам като обем колко трябва да ни бъде проекта.Ако можеш да дадеш малко информация ще ти бъда благодарен.
Виктор Бечев
05.01.2024 12:42

Чудесен paper. Но не разбирам каква е идеята ти за проект? Huffman са няколко десетки реда код, всичките, от които, имаш в този Notebook. Какъв мислиш да е проектът ти на края на курса?
История

t1{t1{
2 "cells": [2 "cells": [
3  {3  {
4   "cell_type": "markdown",4   "cell_type": "markdown",
5   "metadata": {},5   "metadata": {},
6   "source": [6   "source": [
7    "# Huffman Compression Algorithm"7    "# Huffman Compression Algorithm"
8   ]8   ]
9  },9  },
10  {10  {
11   "cell_type": "markdown",11   "cell_type": "markdown",
12   "metadata": {},12   "metadata": {},
13   "source": [13   "source": [
14    "### By Todor Nedkovski"14    "### By Todor Nedkovski"
15   ]15   ]
16  },16  },
17  {17  {
18   "cell_type": "markdown",18   "cell_type": "markdown",
19   "metadata": {},19   "metadata": {},
20   "source": [20   "source": [
21    "### Table of contents"21    "### Table of contents"
22   ]22   ]
23  },23  },
24  {24  {
25   "cell_type": "markdown",25   "cell_type": "markdown",
26   "metadata": {},26   "metadata": {},
27   "source": [27   "source": [
28    "This notebook includes a brief explanation about Huffman Compression Algorithm works.\n",28    "This notebook includes a brief explanation about Huffman Compression Algorithm works.\n",
29    "\n",29    "\n",
30    "We are going to talk about a couple of questions:\n",30    "We are going to talk about a couple of questions:\n",
31    "\n",31    "\n",
32    "1. What is the difference betwenn lossless and lossy compression?\n",32    "1. What is the difference betwenn lossless and lossy compression?\n",
33    "2. How are Huffman trees constructed?\n",33    "2. How are Huffman trees constructed?\n",
34    "3. How can we get back the uncompressed data from the Huffman tree?\n",34    "3. How can we get back the uncompressed data from the Huffman tree?\n",
35    "4. Comparison with other lossless compression algorithms.\n",35    "4. Comparison with other lossless compression algorithms.\n",
36    "\n",36    "\n",
37    "In the end we are going to overview what we have just learned and think of other problems we might encounter while compressing files. "37    "In the end we are going to overview what we have just learned and think of other problems we might encounter while compressing files. "
38   ]38   ]
39  },39  },
40  {40  {
41   "cell_type": "markdown",41   "cell_type": "markdown",
42   "metadata": {},42   "metadata": {},
43   "source": [43   "source": [
44    "## Abstract"44    "## Abstract"
45   ]45   ]
46  },46  },
47  {47  {
48   "cell_type": "markdown",48   "cell_type": "markdown",
49   "metadata": {},49   "metadata": {},
50   "source": [50   "source": [
51    "In the early days computers (around 1980s) did not have much more then a 10 MB storage capacity. It's really mind blowing that you have ten thousand times as much now in your phone then you did back then. The problem is that the files we want to store seem to keep pace with that growth. How can we store more data with less space? The answear is compression! There are a lot of compression algorithms but we are going to talk about a specific one."51    "In the early days computers (around 1980s) did not have much more then a 10 MB storage capacity. It's really mind blowing that you have ten thousand times as much now in your phone then you did back then. The problem is that the files we want to store seem to keep pace with that growth. How can we store more data with less space? The answear is compression! There are a lot of compression algorithms but we are going to talk about a specific one."
52   ]52   ]
53  },53  },
54  {54  {
55   "cell_type": "markdown",55   "cell_type": "markdown",
56   "metadata": {},56   "metadata": {},
57   "source": [57   "source": [
58    "## What is the difference bewtween Fixed and Variable length code"58    "## What is the difference bewtween Fixed and Variable length code"
59   ]59   ]
60  },60  },
61  {61  {
62   "cell_type": "markdown",62   "cell_type": "markdown",
63   "metadata": {},63   "metadata": {},
64   "source": [64   "source": [
65    "### What is a Fixed Length code "65    "### What is a Fixed Length code "
66   ]66   ]
67  },67  },
68  {68  {
69   "cell_type": "markdown",69   "cell_type": "markdown",
70   "metadata": {},70   "metadata": {},
71   "source": [71   "source": [
72    "In a fixed-length code each codeword has the same number of bits"72    "In a fixed-length code each codeword has the same number of bits"
73   ]73   ]
74  },74  },
75  {75  {
76   "cell_type": "markdown",76   "cell_type": "markdown",
77   "metadata": {},77   "metadata": {},
78   "source": [78   "source": [
79    "### What is Variable-Length code"79    "### What is Variable-Length code"
80   ]80   ]
81  },81  },
82  {82  {
83   "cell_type": "markdown",83   "cell_type": "markdown",
84   "metadata": {},84   "metadata": {},
85   "source": [85   "source": [
86    "In coding theory a variable-length code is a code which maps source symbols to a variable number of bits. There are 3 sub-types of the variable-length code."86    "In coding theory a variable-length code is a code which maps source symbols to a variable number of bits. There are 3 sub-types of the variable-length code."
87   ]87   ]
88  },88  },
89  {89  {
90   "cell_type": "markdown",90   "cell_type": "markdown",
91   "metadata": {},91   "metadata": {},
92   "source": [92   "source": [
93    "### Non-singular"93    "### Non-singular"
94   ]94   ]
95  },95  },
96  {96  {
97   "cell_type": "markdown",97   "cell_type": "markdown",
98   "metadata": {},98   "metadata": {},
99   "source": [99   "source": [
100    "A code is non-singular if each character is mapped to a different bit string. In other words the mapping from characters to bit strings is injective (one to one).\n",100    "A code is non-singular if each character is mapped to a different bit string. In other words the mapping from characters to bit strings is injective (one to one).\n",
101    "\n",101    "\n",
102    "1. For example, the mapping $M_{1}=\\{\\,a\\mapsto 0,b\\mapsto 0,c\\mapsto 1\\,\\}$ is not non-singular because both \"a\" and \"b\" map to the same bit string \"0\". Any extension of this mapping will generate a lossy coding. It may be useful when some loss of information is acceptable.\n",102    "1. For example, the mapping $M_{1}=\\{\\,a\\mapsto 0,b\\mapsto 0,c\\mapsto 1\\,\\}$ is not non-singular because both \"a\" and \"b\" map to the same bit string \"0\". Any extension of this mapping will generate a lossy coding. It may be useful when some loss of information is acceptable.\n",
103    "2. However, the mapping $M_{2}=\\{\\,a\\mapsto 1,b\\mapsto 011,c\\mapsto 01110,d\\mapsto 1110,e\\mapsto 10011,f\\mapsto 0\\}$ is non-singular."103    "2. However, the mapping $M_{2}=\\{\\,a\\mapsto 1,b\\mapsto 011,c\\mapsto 01110,d\\mapsto 1110,e\\mapsto 10011,f\\mapsto 0\\}$ is non-singular."
104   ]104   ]
105  },105  },
106  {106  {
107   "cell_type": "markdown",107   "cell_type": "markdown",
108   "metadata": {},108   "metadata": {},
109   "source": [109   "source": [
110    "### Uniquely decodable"110    "### Uniquely decodable"
111   ]111   ]
112  },112  },
113  {113  {
114   "cell_type": "markdown",114   "cell_type": "markdown",
115   "metadata": {},115   "metadata": {},
116   "source": [116   "source": [
117    "A code is **uniquely decodable** if its extension is non-singular.\n",117    "A code is **uniquely decodable** if its extension is non-singular.\n",
118    "\n",118    "\n",
119    "1. The mapping $M_{3}=\\{\\,a\\mapsto 0,b\\mapsto 01,c\\mapsto 011\\,\\}$ is uniquely decodable. \n",119    "1. The mapping $M_{3}=\\{\\,a\\mapsto 0,b\\mapsto 01,c\\mapsto 011\\,\\}$ is uniquely decodable. \n",
120    "2. Consider again the code $M_{2}$ from the previous section. This code is not uniquely decodable, since the string 011101110011 can be interpreted as the sequence of codewords 01110 – 1110 – 011, but also as the sequence of codewords 011 – 1 – 011 – 10011. The two decodings of this encoded string are cdb and babe."120    "2. Consider again the code $M_{2}$ from the previous section. This code is not uniquely decodable, since the string 011101110011 can be interpreted as the sequence of codewords 01110 – 1110 – 011, but also as the sequence of codewords 011 – 1 – 011 – 10011. The two decodings of this encoded string are cdb and babe."
121   ]121   ]
122  },122  },
123  {123  {
124   "cell_type": "markdown",124   "cell_type": "markdown",
125   "metadata": {},125   "metadata": {},
126   "source": [126   "source": [
127    "### Prefix"127    "### Prefix"
128   ]128   ]
129  },129  },
130  {130  {
131   "cell_type": "markdown",131   "cell_type": "markdown",
132   "metadata": {},132   "metadata": {},
133   "source": [133   "source": [
134    "For something to be a prefix code, the entire set of possible encoded values must not contain any values that start with any other value in the set. For example: $M_{4}=\\{\\,a\\mapsto 0,b\\mapsto 10,c\\mapsto 110\\,\\}$ is a prefix code, because none of the values start with any of the other values. However, $M_{3}=\\{\\,a\\mapsto 0,b\\mapsto 01,c\\mapsto 011\\,\\}$ is not a prefix code, because one of the values starts with another of the values.\n",134    "For something to be a prefix code, the entire set of possible encoded values must not contain any values that start with any other value in the set. For example: $M_{4}=\\{\\,a\\mapsto 0,b\\mapsto 10,c\\mapsto 110\\,\\}$ is a prefix code, because none of the values start with any of the other values. However, $M_{3}=\\{\\,a\\mapsto 0,b\\mapsto 01,c\\mapsto 011\\,\\}$ is not a prefix code, because one of the values starts with another of the values.\n",
135    "\n",135    "\n",
136    "Prefix codes are useful because you can pick out each value without needing to know where the value starts and ends."136    "Prefix codes are useful because you can pick out each value without needing to know where the value starts and ends."
137   ]137   ]
138  },138  },
139  {139  {
140   "cell_type": "markdown",140   "cell_type": "markdown",
141   "metadata": {},141   "metadata": {},
142   "source": [142   "source": [
143    "## What is the difference betwenn lossless and lossy compression?"143    "## What is the difference betwenn lossless and lossy compression?"
144   ]144   ]
145  },145  },
146  {146  {
147   "cell_type": "markdown",147   "cell_type": "markdown",
148   "metadata": {},148   "metadata": {},
149   "source": [149   "source": [
150    "### What is lossless compression?"150    "### What is lossless compression?"
151   ]151   ]
152  },152  },
153  {153  {
154   "cell_type": "markdown",154   "cell_type": "markdown",
155   "metadata": {},155   "metadata": {},
156   "source": [156   "source": [
157    "The Lossless compression method is capable of reconstituting the original form of the data.\n",157    "The Lossless compression method is capable of reconstituting the original form of the data.\n",
158    "The quality of the data is not compromised.\n",158    "The quality of the data is not compromised.\n",
159    "This technique allows a file to restore its original form.\n",159    "This technique allows a file to restore its original form.\n",
160    "Lossless compression can be applied to any file format can improve the performance of the compression ratio."160    "Lossless compression can be applied to any file format can improve the performance of the compression ratio."
161   ]161   ]
162  },162  },
163  {163  {
164   "cell_type": "markdown",164   "cell_type": "markdown",
165   "metadata": {},165   "metadata": {},
166   "source": [166   "source": [
167    "### What is lossy compression?"167    "### What is lossy compression?"
168   ]168   ]
169  },169  },
170  {170  {
171   "cell_type": "markdown",171   "cell_type": "markdown",
172   "metadata": {},172   "metadata": {},
173   "source": [173   "source": [
174    "The Lossy compression method eliminates some amount of data that is not noticeable.\n",174    "The Lossy compression method eliminates some amount of data that is not noticeable.\n",
175    "This technique **does not** allow a file to restore in its original form but significantly reduces the size.\n",175    "This technique **does not** allow a file to restore in its original form but significantly reduces the size.\n",
176    "The lossy compression technique is beneficial if the quality of the data is not your priority.\n",176    "The lossy compression technique is beneficial if the quality of the data is not your priority.\n",
177    "It slightly degrades the quality of the file or data but is convenient when one wants to send or store the data.\n",177    "It slightly degrades the quality of the file or data but is convenient when one wants to send or store the data.\n",
178    "This type of data compression is used for organic data like audio signals and images."178    "This type of data compression is used for organic data like audio signals and images."
179   ]179   ]
180  },180  },
181  {181  {
182   "cell_type": "markdown",182   "cell_type": "markdown",
183   "metadata": {},183   "metadata": {},
184   "source": [184   "source": [
185    "## When can we get away with lossy compression?"185    "## When can we get away with lossy compression?"
186   ]186   ]
187  },187  },
188  {188  {
189   "cell_type": "markdown",189   "cell_type": "markdown",
190   "metadata": {},190   "metadata": {},
191   "source": [191   "source": [
192    "As the definition above states lossy compression eliminates the data that is not noticeable.\n",192    "As the definition above states lossy compression eliminates the data that is not noticeable.\n",
193    "This is said because for the most part we use lossy compression for images, videos and audio files.\n",193    "This is said because for the most part we use lossy compression for images, videos and audio files.\n",
194    "Most of the information is still there and you would not notice a few milliseconds missing from the video or a couple of pixels missing from the image.\n",194    "Most of the information is still there and you would not notice a few milliseconds missing from the video or a couple of pixels missing from the image.\n",
195    "Basically everything without text. Unless you are into guessing missing chars from words I guess it is fine."195    "Basically everything without text. Unless you are into guessing missing chars from words I guess it is fine."
196   ]196   ]
197  },197  },
198  {198  {
199   "cell_type": "markdown",199   "cell_type": "markdown",
200   "metadata": {},200   "metadata": {},
201   "source": [201   "source": [
202    "## What is Huffman Compression Algorithm?"202    "## What is Huffman Compression Algorithm?"
203   ]203   ]
204  },204  },
205  {205  {
206   "cell_type": "markdown",206   "cell_type": "markdown",
207   "metadata": {},207   "metadata": {},
208   "source": [208   "source": [
209    "Huffman Compression Algorithm is a prefix loseless type of compression algorithm developed by David A. Huffman (student at MIT), and published in the 1952 paper. I guess you can say it is old. The main idea is to use smaller sequences of bits for more frequent charecters. "209    "Huffman Compression Algorithm is a prefix loseless type of compression algorithm developed by David A. Huffman (student at MIT), and published in the 1952 paper. I guess you can say it is old. The main idea is to use smaller sequences of bits for more frequent charecters. "
210   ]210   ]
211  },211  },
212  {212  {
213   "cell_type": "markdown",213   "cell_type": "markdown",
214   "metadata": {},214   "metadata": {},
215   "source": [215   "source": [
216    "To begin generating the Huffman tree, each character gets a weight equal to the number of times it\n",216    "To begin generating the Huffman tree, each character gets a weight equal to the number of times it\n",
217    "occurs in the file. For example, in the \"happy hip hop\" example, the character 'p' has weight 4,\n",217    "occurs in the file. For example, in the \"happy hip hop\" example, the character 'p' has weight 4,\n",
218    "'h' has weight 3, the space has weight 2, and the other characters have weight 1. Our first task is to\n",218    "'h' has weight 3, the space has weight 2, and the other characters have weight 1. Our first task is to\n",
219    "calculate these weights.\n",219    "calculate these weights.\n",
220    "\n",220    "\n",
221    "1. Create a collection of singleton trees, one for each character, with weight equal to the character frequency.\n",221    "1. Create a collection of singleton trees, one for each character, with weight equal to the character frequency.\n",
222    "2. From the collection, pick out the two trees with the smallest weights and remove them. Combine them into a new tree whose root has a weight equal to the sum of the weights of the two nodes.\n",222    "2. From the collection, pick out the two trees with the smallest weights and remove them. Combine them into a new tree whose root has a weight equal to the sum of the weights of the two nodes.\n",
223    "3. Add the new tree back into the collection.\n",223    "3. Add the new tree back into the collection.\n",
224    "4. Repeat steps 2 and 3 until there is only one tree left.\n",224    "4. Repeat steps 2 and 3 until there is only one tree left.\n",
225    "\n",225    "\n",
226    "<img src=\"imgs/first.png\" style=\"width: 400px;\">"226    "<img src=\"imgs/first.png\" style=\"width: 400px;\">"
227   ]227   ]
228  },228  },
229  {229  {
230   "cell_type": "markdown",230   "cell_type": "markdown",
231   "metadata": {},231   "metadata": {},
232   "source": [232   "source": [
233    "We start by choosing the two smallest nodes. There are four nodes with the minimal weight of one, it\n",233    "We start by choosing the two smallest nodes. There are four nodes with the minimal weight of one, it\n",
234    "doesn't matter which two we pick. We choose 'o' and 'y' and combine them into a new tree whose\n",234    "doesn't matter which two we pick. We choose 'o' and 'y' and combine them into a new tree whose\n",
235    "root is the sum of the weights chosen. We replace those two nodes with the combined tree. The nodes\n",235    "root is the sum of the weights chosen. We replace those two nodes with the combined tree. The nodes\n",
236    "remaining in the collection are shown in the light gray box at each stage.\n",236    "remaining in the collection are shown in the light gray box at each stage.\n",
237    "\n",237    "\n",
238    "<img src=\"imgs/second.png\" style=\"width: 400px;\"/>"238    "<img src=\"imgs/second.png\" style=\"width: 400px;\"/>"
239   ]239   ]
240  },240  },
241  {241  {
242   "cell_type": "markdown",242   "cell_type": "markdown",
243   "metadata": {},243   "metadata": {},
244   "source": [244   "source": [
245    "Now we repeat that step, this time there is no choice for the minimal nodes, it must be 'a' and 'i'.\n",245    "Now we repeat that step, this time there is no choice for the minimal nodes, it must be 'a' and 'i'.\n",
246    "We take those out and combine them into a weight 2 tree. Note how the collection of nodes shrinks by\n",246    "We take those out and combine them into a weight 2 tree. Note how the collection of nodes shrinks by\n",
247    "one each iteration (we remove two nodes and add a new one back in).\n",247    "one each iteration (we remove two nodes and add a new one back in).\n",
248    "\n",248    "\n",
249    "<img src=\"imgs/third.png\" style=\"width: 400px;\"> "249    "<img src=\"imgs/third.png\" style=\"width: 400px;\"> "
250   ]250   ]
251  },251  },
252  {252  {
253   "cell_type": "markdown",253   "cell_type": "markdown",
254   "metadata": {},254   "metadata": {},
255   "source": [255   "source": [
256    "Again, we pull out the two smallest nodes and build a tree of weight 4:\n",256    "Again, we pull out the two smallest nodes and build a tree of weight 4:\n",
257    "\n",257    "\n",
258    "<img src=\"imgs/fourth.png\" style=\"width: 400px;\"> "258    "<img src=\"imgs/fourth.png\" style=\"width: 400px;\"> "
259   ]259   ]
260  },260  },
261  {261  {
262   "cell_type": "markdown",262   "cell_type": "markdown",
263   "metadata": {},263   "metadata": {},
264   "source": [264   "source": [
265    "One more iteration combines the weight 3 and 2 trees into a combined tree of weight 5:\n",265    "One more iteration combines the weight 3 and 2 trees into a combined tree of weight 5:\n",
266    "\n",266    "\n",
267    "<img src=\"imgs/fifth.png\" style=\"width: 400px;\"> "267    "<img src=\"imgs/fifth.png\" style=\"width: 400px;\"> "
268   ]268   ]
269  },269  },
270  {270  {
271   "cell_type": "markdown",271   "cell_type": "markdown",
272   "metadata": {},272   "metadata": {},
273   "source": [273   "source": [
274    "Combining the two 4s gets a tree of weight 8:\n",274    "Combining the two 4s gets a tree of weight 8:\n",
275    "\n",275    "\n",
276    "<img src=\"imgs/sixth.png\" style=\"width: 400px;\"> "276    "<img src=\"imgs/sixth.png\" style=\"width: 400px;\"> "
277   ]277   ]
278  },278  },
279  {279  {
280   "cell_type": "markdown",280   "cell_type": "markdown",
281   "metadata": {},281   "metadata": {},
282   "source": [282   "source": [
283    "And finally, we combine the last two to get our final tree. The root node of the final tree will always\n",283    "And finally, we combine the last two to get our final tree. The root node of the final tree will always\n",
284    "have a weight equal to the number of characters in the input file.\n",284    "have a weight equal to the number of characters in the input file.\n",
285    "\n",285    "\n",
286    "<img src=\"imgs/seventh.png\" style=\"width: 400px;\"> "286    "<img src=\"imgs/seventh.png\" style=\"width: 400px;\"> "
287   ]287   ]
288  },288  },
289  {289  {
290   "cell_type": "markdown",290   "cell_type": "markdown",
291   "metadata": {},291   "metadata": {},
292   "source": [292   "source": [
293    "Remember that it is essential that you use the same tree to do both encoding and decoding of your files.\n",293    "Remember that it is essential that you use the same tree to do both encoding and decoding of your files.\n",
294    "Since each Huffman tree creates a unique encoding of a particular file, you need to ensure that your\n",294    "Since each Huffman tree creates a unique encoding of a particular file, you need to ensure that your\n",
295    "decoding algorithm generates the exact same tree, so that you can get back the file."295    "decoding algorithm generates the exact same tree, so that you can get back the file."
296   ]296   ]
297  },297  },
298  {298  {
299   "cell_type": "markdown",299   "cell_type": "markdown",
300   "metadata": {},300   "metadata": {},
301   "source": [301   "source": [
302    "### The Pseudo-EOF"302    "### The Pseudo-EOF"
303   ]303   ]
304  },304  },
305  {305  {
306   "cell_type": "markdown",306   "cell_type": "markdown",
307   "metadata": {},307   "metadata": {},
308   "source": [308   "source": [
309    "You can now see that our three has asymmetrical form (jagged tree). We have not talked about this but trees usually should not be like this. What happens when we try to store a Huffman-encoded sequence on-disk in a file? Each file on your computer is typically stored as a number of bytes (groups of eight bits). Therefore if you try to write a Huffman-encoded string of bits into a file and you don't have exactly multiple of eight bits then your operating system would add random bits at the end. "309    "You can now see that our three has asymmetrical form (jagged tree). We have not talked about this but trees usually should not be like this. What happens when we try to store a Huffman-encoded sequence on-disk in a file? Each file on your computer is typically stored as a number of bytes (groups of eight bits). Therefore if you try to write a Huffman-encoded string of bits into a file and you don't have exactly multiple of eight bits then your operating system would add random bits at the end. "
310   ]310   ]
311  },311  },
312  {312  {
313   "cell_type": "markdown",313   "cell_type": "markdown",
314   "metadata": {},314   "metadata": {},
315   "source": [315   "source": [
316    "For example the word \"ahoy\" with the huffman tree above translated would look something like this \n",316    "For example the word \"ahoy\" with the huffman tree above translated would look something like this \n",
317    "\n",317    "\n",
318    "<center><b>1101001100111</b></center>\n",318    "<center><b>1101001100111</b></center>\n",
319    "\n",319    "\n",
320    "Those are excatly thirheen bits which means that 3 more will be added. Let's say that those 3 bits will be **111**\n",320    "Those are excatly thirheen bits which means that 3 more will be added. Let's say that those 3 bits will be **111**\n",
321    "\n",321    "\n",
322    "In that case, the bit sequence would be written to disk as\n",322    "In that case, the bit sequence would be written to disk as\n",
323    "\n",323    "\n",
324    "<center><b>1101001100111111</b></center>\n",324    "<center><b>1101001100111111</b></center>\n",
325    "\n",325    "\n",
326    "If we were to then load this back from disk and decode it into a sequence of characters, we would get \"ahoyi\". Even worse if those random bits are not in out tree (in that case \"000\") we would get an error.\n",326    "If we were to then load this back from disk and decode it into a sequence of characters, we would get \"ahoyi\". Even worse if those random bits are not in out tree (in that case \"000\") we would get an error.\n",
327    "\n",327    "\n",
328    "This is clearly a problem. We need a special charcter which tells us where our compressed data ends. It won't be seen and we can add it to our tree in the same way we have added all the other characters. Here is a possible tree with ■\n",328    "This is clearly a problem. We need a special charcter which tells us where our compressed data ends. It won't be seen and we can add it to our tree in the same way we have added all the other characters. Here is a possible tree with ■\n",
329    "\n",329    "\n",
330    "<img src=\"imgs/eight.png\" style=\"width: 400px;\">\n",330    "<img src=\"imgs/eight.png\" style=\"width: 400px;\">\n",
331    "\n",331    "\n",
332    "If we encode \"happy hip hop■\" we get the following bitstring\n",332    "If we encode \"happy hip hop■\" we get the following bitstring\n",
333    "\n",333    "\n",
334    "<center><b>001100101011110110011011001100111010010</b></center>\n",334    "<center><b>001100101011110110011011001100111010010</b></center>\n",
335    "\n",335    "\n",
336    "| Symbol | Codeword |\n",336    "| Symbol | Codeword |\n",
337    "|--------|----------|\n",337    "|--------|----------|\n",
338    "| h      | 00       |\n",338    "| h      | 00       |\n",
339    "| a      | 1100     |\n",339    "| a      | 1100     |\n",
340    "| p      | 10       |\n",340    "| p      | 10       |\n",
341    "| p      | 10       |\n",341    "| p      | 10       |\n",
342    "| y      | 1111     |\n",342    "| y      | 1111     |\n",
343    "|        | 011      |\n",343    "|        | 011      |\n",
344    "| h      | 00       |\n",344    "| h      | 00       |\n",
345    "| i      | 1101     |\n",345    "| i      | 1101     |\n",
346    "| p      | 10       |\n",346    "| p      | 10       |\n",
347    "|        | 011      |\n",347    "|        | 011      |\n",
348    "| h      | 00       |\n",348    "| h      | 00       |\n",
349    "| o      | 1110     |\n",349    "| o      | 1110     |\n",
350    "| p      | 10       |\n",350    "| p      | 10       |\n",
351    "| ■      | 010      |\n",351    "| ■      | 010      |\n",
352    "| ignore | ignore   |\n",352    "| ignore | ignore   |\n",
353    "\n",353    "\n",
354    "This ■ is called *pseudo end of file* or *Pseudo-EOF* for short."354    "This ■ is called *pseudo end of file* or *Pseudo-EOF* for short."
355   ]355   ]
356  },356  },
357  {357  {
358   "cell_type": "markdown",358   "cell_type": "markdown",
359   "metadata": {},359   "metadata": {},
360   "source": [360   "source": [
361    "From here we are going to write some code so it would be better if we gather all of our imports in one place."361    "From here we are going to write some code so it would be better if we gather all of our imports in one place."
362   ]362   ]
363  },363  },
364  {364  {
365   "cell_type": "code",365   "cell_type": "code",
366   "execution_count": null,366   "execution_count": null,
367   "metadata": {},367   "metadata": {},
368   "outputs": [],368   "outputs": [],
369   "source": [369   "source": [
370    "import numpy as np\n",370    "import numpy as np\n",
371    "import matplotlib.pyplot as plt"371    "import matplotlib.pyplot as plt"
372   ]372   ]
373  },373  },
374  {374  {
375   "cell_type": "markdown",375   "cell_type": "markdown",
376   "metadata": {},376   "metadata": {},
377   "source": [377   "source": [
378    "Let's implement it with python"378    "Let's implement it with python"
379   ]379   ]
380  },380  },
381  {381  {
382   "cell_type": "markdown",382   "cell_type": "markdown",
383   "metadata": {},383   "metadata": {},
384   "source": [384   "source": [
385    "Now we will create our Node tree class\n",385    "Now we will create our Node tree class\n",
386    "\n",386    "\n",
387    "```python\n",387    "```python\n",
388    "class NodeTree(object):\n",388    "class NodeTree(object):\n",
389    "\n",389    "\n",
390    "    def __init__(self, left=None, right=None):\n",390    "    def __init__(self, left=None, right=None):\n",
391    "        self.left = left\n",391    "        self.left = left\n",
392    "        self.right = right\n",392    "        self.right = right\n",
393    "\n",393    "\n",
394    "    def children(self):\n",394    "    def children(self):\n",
395    "        return (self.left, self.right)\n",395    "        return (self.left, self.right)\n",
396    "\n",396    "\n",
397    "    def nodes(self):\n",397    "    def nodes(self):\n",
398    "        return (self.left, self.right)\n",398    "        return (self.left, self.right)\n",
399    "\n",399    "\n",
400    "    def __str__(self):\n",400    "    def __str__(self):\n",
401    "        return '%s_%s' % (self.left, self.right)\n",401    "        return '%s_%s' % (self.left, self.right)\n",
402    "```"402    "```"
403   ]403   ]
404  },404  },
405  {405  {
406   "cell_type": "markdown",406   "cell_type": "markdown",
407   "metadata": {},407   "metadata": {},
408   "source": [408   "source": [
409    "We need to calculate character's frequences.\n",409    "We need to calculate character's frequences.\n",
410    "\n",410    "\n",
411    "```python\n",411    "```python\n",
412    "def calculate_frequency(data):\n",412    "def calculate_frequency(data):\n",
413    "    freq = {}\n",413    "    freq = {}\n",
414    "    for c in data:\n",414    "    for c in data:\n",
415    "        if c in freq:\n",415    "        if c in freq:\n",
416    "            freq[c] += 1\n",416    "            freq[c] += 1\n",
417    "        else:\n",417    "        else:\n",
418    "            freq[c] = 1\n",418    "            freq[c] = 1\n",
419    "    return freq\n",419    "    return freq\n",
420    "```"420    "```"
421   ]421   ]
422  },422  },
423  {423  {
424   "cell_type": "markdown",424   "cell_type": "markdown",
425   "metadata": {},425   "metadata": {},
426   "source": [426   "source": [
427    "After that we need to sort them in ascending order\n",427    "After that we need to sort them in ascending order\n",
428    "\n",428    "\n",
429    "```python\n",429    "```python\n",
430    "freq = sorted(freq.items(), key=lambda x: x[1], reverse=True)\n",430    "freq = sorted(freq.items(), key=lambda x: x[1], reverse=True)\n",
431    "```"431    "```"
432   ]432   ]
433  },433  },
434  {434  {
435   "cell_type": "markdown",435   "cell_type": "markdown",
436   "metadata": {},436   "metadata": {},
437   "source": [437   "source": [
438    "Now we need to create the function that merges our nodes  \n",438    "Now we need to create the function that merges our nodes  \n",
439    "\n",439    "\n",
440    "```python\n",440    "```python\n",
441    "def merge(nodes)\n",441    "def merge(nodes)\n",
442    "    while len(nodes) > 1:\n",442    "    while len(nodes) > 1:\n",
443    "        (key1, c1) = nodes[-1]\n",443    "        (key1, c1) = nodes[-1]\n",
444    "        (key2, c2) = nodes[-2]\n",444    "        (key2, c2) = nodes[-2]\n",
445    "        nodes = nodes[:-2]\n",445    "        nodes = nodes[:-2]\n",
446    "        node = NodeTree(key1, key2)\n",446    "        node = NodeTree(key1, key2)\n",
447    "        nodes.append((node, c1 + c2))\n",447    "        nodes.append((node, c1 + c2))\n",
448    "\n",448    "\n",
449    "        nodes = sorted(nodes, key=lambda x: x[1], reverse=True)\n",449    "        nodes = sorted(nodes, key=lambda x: x[1], reverse=True)\n",
450    "    return nodes\n",450    "    return nodes\n",
451    "```"451    "```"
452   ]452   ]
453  },453  },
454  {454  {
455   "cell_type": "markdown",455   "cell_type": "markdown",
456   "metadata": {},456   "metadata": {},
457   "source": [457   "source": [
458    "Finally we need to add the nodes to the dictionary\n",458    "Finally we need to add the nodes to the dictionary\n",
459    "\n",459    "\n",
460    "```python\n",460    "```python\n",
461    "def huffman_code_tree(node, left=True, binString=''):\n",461    "def huffman_code_tree(node, left=True, binString=''):\n",
462    "    if type(node) is str:\n",462    "    if type(node) is str:\n",
463    "        return {node: binString}\n",463    "        return {node: binString}\n",
464    "    (l, r) = node.children()\n",464    "    (l, r) = node.children()\n",
465    "    d = dict()\n",465    "    d = dict()\n",
466    "    d.update(huffman_code_tree(l, True, binString + '0'))\n",466    "    d.update(huffman_code_tree(l, True, binString + '0'))\n",
467    "    d.update(huffman_code_tree(r, False, binString + '1'))\n",467    "    d.update(huffman_code_tree(r, False, binString + '1'))\n",
468    "    return d\n",468    "    return d\n",
469    "```"469    "```"
470   ]470   ]
471  },471  },
472  {472  {
473   "cell_type": "markdown",473   "cell_type": "markdown",
474   "metadata": {},474   "metadata": {},
475   "source": [475   "source": [
476    "All combined it should look something like this"476    "All combined it should look something like this"
477   ]477   ]
478  },478  },
479  {479  {
480   "cell_type": "code",480   "cell_type": "code",
481   "execution_count": null,481   "execution_count": null,
482   "metadata": {482   "metadata": {
483    "scrolled": true483    "scrolled": true
484   },484   },
485   "outputs": [],485   "outputs": [],
486   "source": [486   "source": [
487    "class NodeTree(object):\n",487    "class NodeTree(object):\n",
488    "\n",488    "\n",
489    "    def __init__(self, left=None, right=None):\n",489    "    def __init__(self, left=None, right=None):\n",
490    "        self.left = left\n",490    "        self.left = left\n",
491    "        self.right = right\n",491    "        self.right = right\n",
492    "\n",492    "\n",
493    "    def children(self):\n",493    "    def children(self):\n",
494    "        return (self.left, self.right)\n",494    "        return (self.left, self.right)\n",
495    "\n",495    "\n",
496    "    def nodes(self):\n",496    "    def nodes(self):\n",
497    "        return (self.left, self.right)\n",497    "        return (self.left, self.right)\n",
498    "\n",498    "\n",
499    "    def __str__(self):\n",499    "    def __str__(self):\n",
500    "        return '%s_%s' % (self.left, self.right)\n",500    "        return '%s_%s' % (self.left, self.right)\n",
501    "\n",501    "\n",
502    "def calculate_frequency(data):\n",502    "def calculate_frequency(data):\n",
503    "    freq = {}\n",503    "    freq = {}\n",
504    "    for c in data:\n",504    "    for c in data:\n",
505    "        if c in freq:\n",505    "        if c in freq:\n",
506    "            freq[c] += 1\n",506    "            freq[c] += 1\n",
507    "        else:\n",507    "        else:\n",
508    "            freq[c] = 1\n",508    "            freq[c] = 1\n",
509    "    return freq\n",509    "    return freq\n",
510    "\n",510    "\n",
511    "def huffman_code_tree(node, left=True, binString=''):\n",511    "def huffman_code_tree(node, left=True, binString=''):\n",
512    "    if type(node) is str:\n",512    "    if type(node) is str:\n",
513    "        return {node: binString}\n",513    "        return {node: binString}\n",
514    "    (l, r) = node.children()\n",514    "    (l, r) = node.children()\n",
515    "    d = dict()\n",515    "    d = dict()\n",
516    "    d.update(huffman_code_tree(l, True, binString + '0'))\n",516    "    d.update(huffman_code_tree(l, True, binString + '0'))\n",
517    "    d.update(huffman_code_tree(r, False, binString + '1'))\n",517    "    d.update(huffman_code_tree(r, False, binString + '1'))\n",
518    "    return d\n",518    "    return d\n",
519    "\n",519    "\n",
520    "def merge(nodes):\n",520    "def merge(nodes):\n",
521    "    while len(nodes) > 1:\n",521    "    while len(nodes) > 1:\n",
522    "        (key1, c1) = nodes[-1]\n",522    "        (key1, c1) = nodes[-1]\n",
523    "        (key2, c2) = nodes[-2]\n",523    "        (key2, c2) = nodes[-2]\n",
524    "        nodes = nodes[:-2]\n",524    "        nodes = nodes[:-2]\n",
525    "        node = NodeTree(key1, key2)\n",525    "        node = NodeTree(key1, key2)\n",
526    "        nodes.append((node, c1 + c2))\n",526    "        nodes.append((node, c1 + c2))\n",
527    "\n",527    "\n",
528    "        nodes = sorted(nodes, key=lambda x: x[1], reverse=True)\n",528    "        nodes = sorted(nodes, key=lambda x: x[1], reverse=True)\n",
529    "    return nodes\n",529    "    return nodes\n",
530    "\n",530    "\n",
531    "def encode(text, table):\n",531    "def encode(text, table):\n",
532    "    result = \"\"\n",532    "    result = \"\"\n",
533    "    \n",533    "    \n",
534    "    for c in text:\n",534    "    for c in text:\n",
535    "        result += table[c]\n",535    "        result += table[c]\n",
536    "    return result\n",536    "    return result\n",
537    "\n",537    "\n",
538    "def decode(encoded_text, tree):\n",538    "def decode(encoded_text, tree):\n",
539    "    decoded_text = \"\"\n",539    "    decoded_text = \"\"\n",
540    "    \n",540    "    \n",
541    "    buffer = \"\"\n",541    "    buffer = \"\"\n",
542    "    \n",542    "    \n",
543    "    for i in encoded_text:\n",543    "    for i in encoded_text:\n",
544    "        buffer += i\n",544    "        buffer += i\n",
545    "        if buffer in tree.values():\n",545    "        if buffer in tree.values():\n",
546    "            value = str([k for k, v in tree.items() if v == buffer][0])\n",546    "            value = str([k for k, v in tree.items() if v == buffer][0])\n",
547    "            decoded_text += value\n",547    "            decoded_text += value\n",
548    "            buffer = \"\"\n",548    "            buffer = \"\"\n",
549    "    return decoded_text"549    "    return decoded_text"
550   ]550   ]
551  },551  },
552  {552  {
553   "cell_type": "code",553   "cell_type": "code",
554   "execution_count": null,554   "execution_count": null,
555   "metadata": {},555   "metadata": {},
556   "outputs": [],556   "outputs": [],
557   "source": [557   "source": [
558    "def compress(string):\n",558    "def compress(string):\n",
559    "    freq = calculate_frequency(string)\n",559    "    freq = calculate_frequency(string)\n",
560    "    \n",560    "    \n",
561    "    freq = sorted(freq.items(), key=lambda x: x[1], reverse=True)\n",561    "    freq = sorted(freq.items(), key=lambda x: x[1], reverse=True)\n",
562    "\n",562    "\n",
563    "    nodes = freq\n",563    "    nodes = freq\n",
564    "\n",564    "\n",
565    "    nodes = merge(nodes)\n",565    "    nodes = merge(nodes)\n",
566    "\n",566    "\n",
567    "    huffmanCode = huffman_code_tree(nodes[0][0])\n",567    "    huffmanCode = huffman_code_tree(nodes[0][0])\n",
568    "        \n",568    "        \n",
569    "    return (huffmanCode, freq, encode(string, huffmanCode))"569    "    return (huffmanCode, freq, encode(string, huffmanCode))"
570   ]570   ]
571  },571  },
572  {572  {
573   "cell_type": "markdown",573   "cell_type": "markdown",
574   "metadata": {},574   "metadata": {},
575   "source": [575   "source": [
576    "And of course a function to uncompress it."576    "And of course a function to uncompress it."
577   ]577   ]
578  },578  },
579  {579  {
580   "cell_type": "code",580   "cell_type": "code",
581   "execution_count": null,581   "execution_count": null,
582   "metadata": {},582   "metadata": {},
583   "outputs": [],583   "outputs": [],
584   "source": [584   "source": [
585    "def uncompress(encoded_text, tree):\n",585    "def uncompress(encoded_text, tree):\n",
586    "    return decode(encoded_text, tree)"586    "    return decode(encoded_text, tree)"
587   ]587   ]
588  },588  },
589  {589  {
590   "cell_type": "markdown",590   "cell_type": "markdown",
591   "metadata": {},591   "metadata": {},
592   "source": [592   "source": [
593    "It's also good to have a function to print the encoding tree."593    "It's also good to have a function to print the encoding tree."
594   ]594   ]
595  },595  },
596  {596  {
597   "cell_type": "code",597   "cell_type": "code",
598   "execution_count": null,598   "execution_count": null,
599   "metadata": {},599   "metadata": {},
600   "outputs": [],600   "outputs": [],
601   "source": [601   "source": [
602    "def print_tree(huffmanCode, freq):\n",602    "def print_tree(huffmanCode, freq):\n",
603    "    print(' Char | Huffman code ')\n",603    "    print(' Char | Huffman code ')\n",
604    "    print('----------------------')\n",604    "    print('----------------------')\n",
605    "    for (char, frequency) in freq:\n",605    "    for (char, frequency) in freq:\n",
606    "        print(' %-4r |%12s' % (char, huffmanCode[char]))"606    "        print(' %-4r |%12s' % (char, huffmanCode[char]))"
607   ]607   ]
608  },608  },
609  {609  {
610   "cell_type": "markdown",610   "cell_type": "markdown",
611   "metadata": {},611   "metadata": {},
612   "source": [612   "source": [
613    "Let's implement data reading function in advance because we are going to use it a lot later on."613    "Let's implement data reading function in advance because we are going to use it a lot later on."
614   ]614   ]
615  },615  },
616  {616  {
617   "cell_type": "code",617   "cell_type": "code",
618   "execution_count": null,618   "execution_count": null,
619   "metadata": {},619   "metadata": {},
620   "outputs": [],620   "outputs": [],
621   "source": [621   "source": [
622    "def get_data(file):\n",622    "def get_data(file):\n",
623    "    data = open(\"data/\" + file + \".txt\")\n",623    "    data = open(\"data/\" + file + \".txt\")\n",
624    "    text = data.readlines()\n",624    "    text = data.readlines()\n",
625    "    return text[0]"625    "    return text[0]"
626   ]626   ]
627  },627  },
628  {628  {
629   "cell_type": "code",629   "cell_type": "code",
630   "execution_count": null,630   "execution_count": null,
631   "metadata": {},631   "metadata": {},
632   "outputs": [],632   "outputs": [],
633   "source": [633   "source": [
634    "text = get_data(\"non_random_text\")\n",634    "text = get_data(\"non_random_text\")\n",
635    "(code, freq, encoding) = compress(text)\n",635    "(code, freq, encoding) = compress(text)\n",
636    "uncompress(encoding, code) == text"636    "uncompress(encoding, code) == text"
637   ]637   ]
638  },638  },
639  {639  {
640   "cell_type": "markdown",640   "cell_type": "markdown",
641   "metadata": {},641   "metadata": {},
642   "source": [642   "source": [
643    "From the code above we see that our uncompress function works!"643    "From the code above we see that our uncompress function works!"
644   ]644   ]
645  },645  },
646  {646  {
647   "cell_type": "code",647   "cell_type": "code",
648   "execution_count": null,648   "execution_count": null,
649   "metadata": {649   "metadata": {
650    "scrolled": true650    "scrolled": true
651   },651   },
652   "outputs": [],652   "outputs": [],
653   "source": [653   "source": [
654    "text = get_data(\"non_random_text\")\n",654    "text = get_data(\"non_random_text\")\n",
655    "(code, freq, encoding) = compress(text)\n",655    "(code, freq, encoding) = compress(text)\n",
656    "print_tree(code, freq)"656    "print_tree(code, freq)"
657   ]657   ]
658  },658  },
659  {659  {
660   "cell_type": "markdown",660   "cell_type": "markdown",
661   "metadata": {},661   "metadata": {},
662   "source": [662   "source": [
663    "Here we can see how the most common character, in this case \"e\", is encoded with fewer bits because it's frequency is higher . As for \"K\" is encoded with more bits because it's frequency is lower . I wonder what would happen if we give more \"random\" data to our algorithm?"663    "Here we can see how the most common character, in this case \"e\", is encoded with fewer bits because it's frequency is higher . As for \"K\" is encoded with more bits because it's frequency is lower . I wonder what would happen if we give more \"random\" data to our algorithm?"
664   ]664   ]
665  },665  },
666  {666  {
667   "cell_type": "code",667   "cell_type": "code",
668   "execution_count": null,668   "execution_count": null,
669   "metadata": {669   "metadata": {
670    "scrolled": true670    "scrolled": true
671   },671   },
672   "outputs": [],672   "outputs": [],
673   "source": [673   "source": [
674    "text = get_data(\"random_text\")\n",674    "text = get_data(\"random_text\")\n",
675    "(code, freq, encoding) = compress(text)\n",675    "(code, freq, encoding) = compress(text)\n",
676    "print_tree(code, freq)"676    "print_tree(code, freq)"
677   ]677   ]
678  },678  },
679  {679  {
680   "cell_type": "markdown",680   "cell_type": "markdown",
681   "metadata": {},681   "metadata": {},
682   "source": [682   "source": [
683    "Interesting... I expected more uneven distibution of the frequencies. I wonder if there is correlation between character count and the randomness of our data. We can easily calculate the randmness but first let's see the compressed/uncompressed ratio."683    "Interesting... I expected more uneven distibution of the frequencies. I wonder if there is correlation between character count and the randomness of our data. We can easily calculate the randmness but first let's see the compressed/uncompressed ratio."
684   ]684   ]
685  },685  },
686  {686  {
687   "cell_type": "markdown",687   "cell_type": "markdown",
688   "metadata": {},688   "metadata": {},
689   "source": [689   "source": [
690    "Now we need a function to count the number of bits in a specific code tree"690    "Now we need a function to count the number of bits in a specific code tree"
691   ]691   ]
692  },692  },
693  {693  {
694   "cell_type": "code",694   "cell_type": "code",
695   "execution_count": null,695   "execution_count": null,
696   "metadata": {},696   "metadata": {},
697   "outputs": [],697   "outputs": [],
698   "source": [698   "source": [
699    "def calculate_bits(code, freq):\n",699    "def calculate_bits(code, freq):\n",
700    "    bits = 0\n",700    "    bits = 0\n",
701    "    for (char, count) in freq:\n",701    "    for (char, count) in freq:\n",
702    "        bits += len(code[char]) * count\n",702    "        bits += len(code[char]) * count\n",
703    "    return bits"703    "    return bits"
704   ]704   ]
705  },705  },
706  {706  {
707   "cell_type": "markdown",707   "cell_type": "markdown",
708   "metadata": {},708   "metadata": {},
709   "source": [709   "source": [
710    "We can calculate how much data we can compress with that particular code tree with the function below"710    "We can calculate how much data we can compress with that particular code tree with the function below"
711   ]711   ]
712  },712  },
713  {713  {
714   "cell_type": "code",714   "cell_type": "code",
715   "execution_count": null,715   "execution_count": null,
716   "metadata": {},716   "metadata": {},
717   "outputs": [],717   "outputs": [],
718   "source": [718   "source": [
719    "def comparisson(file):\n",719    "def comparisson(file):\n",
720    "    text = get_data(file)\n",720    "    text = get_data(file)\n",
721    "\n",721    "\n",
722    "    (code, freq, encode) = compress(text)\n",722    "    (code, freq, encode) = compress(text)\n",
723    "\n",723    "\n",
724    "    after_compression = calculate_bits(code, freq)\n",724    "    after_compression = calculate_bits(code, freq)\n",
725    "\n",725    "\n",
726    "    before_compression = len(bin(int.from_bytes(get_data(\"non_random_text\").encode(), 'big')))\n",726    "    before_compression = len(bin(int.from_bytes(get_data(\"non_random_text\").encode(), 'big')))\n",
727    "\n",727    "\n",
728    "    percentage = (after_compression / before_compression) * 100\n",728    "    percentage = (after_compression / before_compression) * 100\n",
729    "\n",729    "\n",
730    "    print(str(before_compression) + \" bits before\")\n",730    "    print(str(before_compression) + \" bits before\")\n",
731    "    print(str(after_compression) + \" bits after\")\n",731    "    print(str(after_compression) + \" bits after\")\n",
732    "    print(\"%.2f\" % percentage + '%')"732    "    print(\"%.2f\" % percentage + '%')"
733   ]733   ]
734  },734  },
735  {735  {
736   "cell_type": "code",736   "cell_type": "code",
737   "execution_count": null,737   "execution_count": null,
738   "metadata": {738   "metadata": {
739    "scrolled": true739    "scrolled": true
740   },740   },
741   "outputs": [],741   "outputs": [],
742   "source": [742   "source": [
743    "comparisson(\"non_random_text\")"743    "comparisson(\"non_random_text\")"
744   ]744   ]
745  },745  },
746  {746  {
747   "cell_type": "markdown",747   "cell_type": "markdown",
748   "metadata": {},748   "metadata": {},
749   "source": [749   "source": [
750    "That means that we have $45.52\\%$ compressed space of the original file. Let's try that with the random data"750    "That means that we have $45.52\\%$ compressed space of the original file. Let's try that with the random data"
751   ]751   ]
752  },752  },
753  {753  {
754   "cell_type": "code",754   "cell_type": "code",
755   "execution_count": null,755   "execution_count": null,
756   "metadata": {},756   "metadata": {},
757   "outputs": [],757   "outputs": [],
758   "source": [758   "source": [
759    "comparisson(\"random_text\")"759    "comparisson(\"random_text\")"
760   ]760   ]
761  },761  },
762  {762  {
763   "cell_type": "markdown",763   "cell_type": "markdown",
764   "metadata": {},764   "metadata": {},
765   "source": [765   "source": [
766    "It really should not be surprising that we have only $19.95\\%$ comparison. "766    "It really should not be surprising that we have only $19.95\\%$ comparison. "
767   ]767   ]
768  },768  },
769  {769  {
770   "cell_type": "markdown",770   "cell_type": "markdown",
771   "metadata": {},771   "metadata": {},
772   "source": [772   "source": [
773    "We can defenetly say that more unpredictable data is less compressed. We don't know why yet but soon we might."773    "We can defenetly say that more unpredictable data is less compressed. We don't know why yet but soon we might."
774   ]774   ]
775  },775  },
776  {776  {
777   "cell_type": "markdown",777   "cell_type": "markdown",
778   "metadata": {},778   "metadata": {},
779   "source": [779   "source": [
780    "That is the file \"random_text\". "780    "That is the file \"random_text\". "
781   ]781   ]
782  },782  },
783  {783  {
784   "cell_type": "code",784   "cell_type": "code",
785   "execution_count": null,785   "execution_count": null,
786   "metadata": {786   "metadata": {
787    "scrolled": true787    "scrolled": true
788   },788   },
789   "outputs": [],789   "outputs": [],
790   "source": [790   "source": [
791    "get_data(\"non_random_text\")"791    "get_data(\"non_random_text\")"
792   ]792   ]
793  },793  },
794  {794  {
795   "cell_type": "markdown",795   "cell_type": "markdown",
796   "metadata": {},796   "metadata": {},
797   "source": [797   "source": [
798    "And that is the file \"non_random_text\"."798    "And that is the file \"non_random_text\"."
799   ]799   ]
800  },800  },
801  {801  {
802   "cell_type": "code",802   "cell_type": "code",
803   "execution_count": null,803   "execution_count": null,
804   "metadata": {804   "metadata": {
805    "scrolled": true805    "scrolled": true
806   },806   },
807   "outputs": [],807   "outputs": [],
808   "source": [808   "source": [
809    "get_data(\"random_text\")"809    "get_data(\"random_text\")"
810   ]810   ]
811  },811  },
812  {812  {
813   "cell_type": "markdown",813   "cell_type": "markdown",
814   "metadata": {},814   "metadata": {},
815   "source": [815   "source": [
816    "## What is entropy?"816    "## What is entropy?"
817   ]817   ]
818  },818  },
819  {819  {
820   "cell_type": "markdown",820   "cell_type": "markdown",
821   "metadata": {},821   "metadata": {},
822   "source": [822   "source": [
823    "Entropy or more specifically Shannon entropy helps us calculate what is the \"randomness\" of our distribution in our case text."823    "Entropy or more specifically Shannon entropy helps us calculate what is the \"randomness\" of our distribution in our case text."
824   ]824   ]
825  },825  },
826  {826  {
827   "cell_type": "markdown",827   "cell_type": "markdown",
828   "metadata": {},828   "metadata": {},
829   "source": [829   "source": [
830    "### Information for an Event"830    "### Information for an Event"
831   ]831   ]
832  },832  },
833  {833  {
834   "cell_type": "markdown",834   "cell_type": "markdown",
835   "metadata": {},835   "metadata": {},
836   "source": [836   "source": [
837    "The intuition behind quantifying information is the idea of measuring how much surprise there is in an event. Those events that are rare (low probability) are more surprising and therefore have more information than those events that are common (high probability).\n",837    "The intuition behind quantifying information is the idea of measuring how much surprise there is in an event. Those events that are rare (low probability) are more surprising and therefore have more information than those events that are common (high probability).\n",
838    "\n",838    "\n",
839    "In other words:\n",839    "In other words:\n",
840    "\n",840    "\n",
841    "1. **Low Probability Event**: High Information (surprising).\n",841    "1. **Low Probability Event**: High Information (surprising).\n",
842    "2. **High Probability Event**: Low Information (unsurprising).\n"842    "2. **High Probability Event**: Low Information (unsurprising).\n"
843   ]843   ]
844  },844  },
845  {845  {
846   "cell_type": "markdown",846   "cell_type": "markdown",
847   "metadata": {},847   "metadata": {},
848   "source": [848   "source": [
849    "$$h(p_i) = - \\log_{2}{p_i}$$"849    "$$h(p_i) = - \\log_{2}{p_i}$$"
850   ]850   ]
851  },851  },
852  {852  {
853   "cell_type": "markdown",853   "cell_type": "markdown",
854   "metadata": {},854   "metadata": {},
855   "source": [855   "source": [
856    "Log is on base 2 because we want to measure our data in bits. The negative sign ensures that the result is always positive or zero. Consider a flip of a single fair coin. The probability of heads (and tails) is 0.5. Let's make some examples with python!"856    "Log is on base 2 because we want to measure our data in bits. The negative sign ensures that the result is always positive or zero. Consider a flip of a single fair coin. The probability of heads (and tails) is 0.5. Let's make some examples with python!"
857   ]857   ]
858  },858  },
859  {859  {
860   "cell_type": "code",860   "cell_type": "code",
861   "execution_count": null,861   "execution_count": null,
862   "metadata": {},862   "metadata": {},
863   "outputs": [],863   "outputs": [],
864   "source": [864   "source": [
865    "def information_for_p(p):\n",865    "def information_for_p(p):\n",
866    "    p = p\n",866    "    p = p\n",
867    "    h = - np.log(p) / np.log(2)\n",867    "    h = - np.log(p) / np.log(2)\n",
868    "    print('p(x)=%.3f, information: %.3f bits' % (p, h))"868    "    print('p(x)=%.3f, information: %.3f bits' % (p, h))"
869   ]869   ]
870  },870  },
871  {871  {
872   "cell_type": "code",872   "cell_type": "code",
873   "execution_count": null,873   "execution_count": null,
874   "metadata": {},874   "metadata": {},
875   "outputs": [],875   "outputs": [],
876   "source": [876   "source": [
877    "information_for_p(0.5)"877    "information_for_p(0.5)"
878   ]878   ]
879  },879  },
880  {880  {
881   "cell_type": "markdown",881   "cell_type": "markdown",
882   "metadata": {},882   "metadata": {},
883   "source": [883   "source": [
884    "When the probability of our event is 0.5 it's information content for the event is 1 bit. If we flip that let's say 100 times the information for this sequence would be 100 bits. (That's called foreshadowing)"884    "When the probability of our event is 0.5 it's information content for the event is 1 bit. If we flip that let's say 100 times the information for this sequence would be 100 bits. (That's called foreshadowing)"
885   ]885   ]
886  },886  },
887  {887  {
888   "cell_type": "markdown",888   "cell_type": "markdown",
889   "metadata": {},889   "metadata": {},
890   "source": [890   "source": [
891    "Consider our coin is not fair and the chance of getting heads is 0.9."891    "Consider our coin is not fair and the chance of getting heads is 0.9."
892   ]892   ]
893  },893  },
894  {894  {
895   "cell_type": "code",895   "cell_type": "code",
896   "execution_count": null,896   "execution_count": null,
897   "metadata": {897   "metadata": {
898    "scrolled": false898    "scrolled": false
899   },899   },
900   "outputs": [],900   "outputs": [],
901   "source": [901   "source": [
902    "information_for_p(0.9)"902    "information_for_p(0.9)"
903   ]903   ]
904  },904  },
905  {905  {
906   "cell_type": "markdown",906   "cell_type": "markdown",
907   "metadata": {},907   "metadata": {},
908   "source": [908   "source": [
909    "Here our event is more frequent and thus it's information size is smaller."909    "Here our event is more frequent and thus it's information size is smaller."
910   ]910   ]
911  },911  },
912  {912  {
913   "cell_type": "markdown",913   "cell_type": "markdown",
914   "metadata": {},914   "metadata": {},
915   "source": [915   "source": [
916    "Let's plot the function $-log_2p_i$"916    "Let's plot the function $-log_2p_i$"
917   ]917   ]
918  },918  },
919  {919  {
920   "cell_type": "code",920   "cell_type": "code",
921   "execution_count": null,921   "execution_count": null,
922   "metadata": {},922   "metadata": {},
923   "outputs": [],923   "outputs": [],
924   "source": [924   "source": [
925    "probs = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]\n",925    "probs = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]\n",
926    "\n",926    "\n",
927    "info = [-np.log(p) / np.log(2) for p in probs]\n",927    "info = [-np.log(p) / np.log(2) for p in probs]\n",
928    "\n",928    "\n",
929    "plt.plot(probs, info, marker='.')\n",929    "plt.plot(probs, info, marker='.')\n",
930    "plt.title('Probability vs Information')\n",930    "plt.title('Probability vs Information')\n",
931    "plt.xlabel('Probability')\n",931    "plt.xlabel('Probability')\n",
932    "plt.ylabel('Information')\n",932    "plt.ylabel('Information')\n",
933    "plt.show()"933    "plt.show()"
934   ]934   ]
935  },935  },
936  {936  {
937   "cell_type": "markdown",937   "cell_type": "markdown",
938   "metadata": {},938   "metadata": {},
939   "source": [939   "source": [
940    "We should not be surprised by two things the first of which is that the graph is not linear. It's normall because our function is log and the second is the correlation bewteen information size and the it's probability."940    "We should not be surprised by two things the first of which is that the graph is not linear. It's normall because our function is log and the second is the correlation bewteen information size and the it's probability."
941   ]941   ]
942  },942  },
943  {943  {
944   "cell_type": "markdown",944   "cell_type": "markdown",
945   "metadata": {},945   "metadata": {},
946   "source": [946   "source": [
947    "### Calculate the Entropy for a Random Variable"947    "### Calculate the Entropy for a Random Variable"
948   ]948   ]
949  },949  },
950  {950  {
951   "cell_type": "code",951   "cell_type": "code",
952   "execution_count": null,952   "execution_count": null,
953   "metadata": {},953   "metadata": {},
954   "outputs": [],954   "outputs": [],
955   "source": [955   "source": [
956    "def entropy(probabilities, base):\n",956    "def entropy(probabilities, base):\n",
957    "    h = 0\n",957    "    h = 0\n",
958    "    for p in probabilities:\n",958    "    for p in probabilities:\n",
959    "        h += -np.log(p) / np.log(base) * p\n",959    "        h += -np.log(p) / np.log(base) * p\n",
960    "    return h "960    "    return h "
961   ]961   ]
962  },962  },
963  {963  {
964   "cell_type": "code",964   "cell_type": "code",
965   "execution_count": null,965   "execution_count": null,
966   "metadata": {},966   "metadata": {},
967   "outputs": [],967   "outputs": [],
968   "source": [968   "source": [
969    "p = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]\n",969    "p = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]\n",
970    "print('entropy: %.3f bits' % entropy(p, 2))"970    "print('entropy: %.3f bits' % entropy(p, 2))"
971   ]971   ]
972  },972  },
973  {973  {
974   "cell_type": "markdown",974   "cell_type": "markdown",
975   "metadata": {},975   "metadata": {},
976   "source": [976   "source": [
977    "Mathematically written it should look something like that \n",977    "Mathematically written it should look something like that \n",
978    "\n",978    "\n",
979    "$$H(x) = -\\sum_{i = 1}^n p_i {\\log_2{p_i}}$$"979    "$$H(x) = -\\sum_{i = 1}^n p_i {\\log_2{p_i}}$$"
980   ]980   ]
981  },981  },
982  {982  {
983   "cell_type": "markdown",983   "cell_type": "markdown",
984   "metadata": {},984   "metadata": {},
985   "source": [985   "source": [
986    "Now we are able to calculate the entropy of our distrubution. We would also need a function to calculate characters freqency and another function to calculate the probability of any single one."986    "Now we are able to calculate the entropy of our distrubution. We would also need a function to calculate characters freqency and another function to calculate the probability of any single one."
987   ]987   ]
988  },988  },
989  {989  {
990   "cell_type": "code",990   "cell_type": "code",
991   "execution_count": null,991   "execution_count": null,
992   "metadata": {},992   "metadata": {},
993   "outputs": [],993   "outputs": [],
994   "source": [994   "source": [
995    "def calculate_probability(frequencies, base_count):\n",995    "def calculate_probability(frequencies, base_count):\n",
996    "    prob = []\n",996    "    prob = []\n",
997    "    for f in frequencies.values():\n",997    "    for f in frequencies.values():\n",
998    "        prob.append(f/base_count)\n",998    "        prob.append(f/base_count)\n",
999    "    return prob\n",999    "    return prob\n",
1000    "\n",1000    "\n",
1001    "def entropy(data, base):\n",1001    "def entropy(data, base):\n",
1002    "    frequencies = calculate_frequency(data)\n",1002    "    frequencies = calculate_frequency(data)\n",
1003    "    probabilities = calculate_probability(frequencies, len(data))\n",1003    "    probabilities = calculate_probability(frequencies, len(data))\n",
1004    "    \n",1004    "    \n",
1005    "    h = 0\n",1005    "    h = 0\n",
1006    "    \n",1006    "    \n",
1007    "    for p in probabilities:\n",1007    "    for p in probabilities:\n",
1008    "        h += (np.log(p) / np.log(base)) * p\n",1008    "        h += (np.log(p) / np.log(base)) * p\n",
1009    "        \n",1009    "        \n",
1010    "    return -h "1010    "    return -h "
1011   ]1011   ]
1012  },1012  },
1013  {1013  {
1014   "cell_type": "code",1014   "cell_type": "code",
1015   "execution_count": null,1015   "execution_count": null,
1016   "metadata": {},1016   "metadata": {},
1017   "outputs": [],1017   "outputs": [],
1018   "source": [1018   "source": [
1019    "string = get_data(\"random_text\")\n",1019    "string = get_data(\"random_text\")\n",
1020    "base = 2\n",1020    "base = 2\n",
1021    "\n",1021    "\n",
1022    "entropy(string, base)"1022    "entropy(string, base)"
1023   ]1023   ]
1024  },1024  },
1025  {1025  {
1026   "cell_type": "code",1026   "cell_type": "code",
1027   "execution_count": null,1027   "execution_count": null,
1028   "metadata": {},1028   "metadata": {},
1029   "outputs": [],1029   "outputs": [],
1030   "source": [1030   "source": [
1031    "string = get_data(\"non_random_text\")\n",1031    "string = get_data(\"non_random_text\")\n",
1032    "base = 2\n",1032    "base = 2\n",
1033    "\n",1033    "\n",
1034    "entropy(string, base)"1034    "entropy(string, base)"
1035   ]1035   ]
1036  },1036  },
1037  {1037  {
1038   "cell_type": "markdown",1038   "cell_type": "markdown",
1039   "metadata": {},1039   "metadata": {},
1040   "source": [1040   "source": [
1041    "For our random text it's entrophy is approximately 6.35 which means that on average we need 6.35 bits to represent a single character. With that said with big entropy comes unpredictability which is a synonym for randomness. Same for small entrophy. Small entrophy means more predictability and you know the rest. Side note (at first I thought that there is going to be correlation between characters frequencies and the entrophy of our data but there is none)"1041    "For our random text it's entrophy is approximately 6.35 which means that on average we need 6.35 bits to represent a single character. With that said with big entropy comes unpredictability which is a synonym for randomness. Same for small entrophy. Small entrophy means more predictability and you know the rest. Side note (at first I thought that there is going to be correlation between characters frequencies and the entrophy of our data but there is none)"
1042   ]1042   ]
1043  },1043  },
1044  {1044  {
1045   "cell_type": "markdown",1045   "cell_type": "markdown",
1046   "metadata": {},1046   "metadata": {},
1047   "source": [1047   "source": [
1048    "We have come to the conclusion that Huffman coding uses entropy encoding scheme."1048    "We have come to the conclusion that Huffman coding uses entropy encoding scheme."
1049   ]1049   ]
1050  },1050  },
1051  {1051  {
1052   "cell_type": "markdown",1052   "cell_type": "markdown",
1053   "metadata": {},1053   "metadata": {},
1054   "source": [1054   "source": [
1055    "## How can we get back the uncompressed data from the Huffman tree?"1055    "## How can we get back the uncompressed data from the Huffman tree?"
1056   ]1056   ]
1057  },1057  },
1058  {1058  {
1059   "cell_type": "markdown",1059   "cell_type": "markdown",
1060   "metadata": {},1060   "metadata": {},
1061   "source": [1061   "source": [
1062    "One problem we will encounter while trying to send the compressed data is that the recivier does not have the tree to recover the data. There a couple of ways to resove this. We could agree on the coding tree in advance but for that we need to know the frequency of the letters in advance. Second option is to prefix the bit sequence with a header containing enough information to reconstruct the Huffman encoding tree. You can store that sequence at the head of the file in a human readable format. Reading this it will allow you to recreate the the tree."1062    "One problem we will encounter while trying to send the compressed data is that the recivier does not have the tree to recover the data. There a couple of ways to resove this. We could agree on the coding tree in advance but for that we need to know the frequency of the letters in advance. Second option is to prefix the bit sequence with a header containing enough information to reconstruct the Huffman encoding tree. You can store that sequence at the head of the file in a human readable format. Reading this it will allow you to recreate the the tree."
1063   ]1063   ]
1064  },1064  },
1065  {1065  {
1066   "cell_type": "markdown",1066   "cell_type": "markdown",
1067   "metadata": {},1067   "metadata": {},
1068   "source": [1068   "source": [
1069    "## Comparison with other lossless compression algorithms"1069    "## Comparison with other lossless compression algorithms"
1070   ]1070   ]
1071  },1071  },
1072  {1072  {
1073   "cell_type": "markdown",1073   "cell_type": "markdown",
1074   "metadata": {},1074   "metadata": {},
1075   "source": [1075   "source": [
1076    "We are going to use an already done comparisson. We have got the files sizes here. But first let's define what all compression algorithms mentioined below mean."1076    "We are going to use an already done comparisson. We have got the files sizes here. But first let's define what all compression algorithms mentioined below mean."
1077   ]1077   ]
1078  },1078  },
1079  {1079  {
1080   "cell_type": "markdown",1080   "cell_type": "markdown",
1081   "metadata": {},1081   "metadata": {},
1082   "source": [1082   "source": [
1083    "**Run-length encoding (RLE)** is a form of lossless data compression in which runs of data (sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run. For example this string \"aaaaaeeeeeffff\" translates to \"5a5e4f\". As you can see this works only if we have consecutive characters."1083    "**Run-length encoding (RLE)** is a form of lossless data compression in which runs of data (sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run. For example this string \"aaaaaeeeeeffff\" translates to \"5a5e4f\". As you can see this works only if we have consecutive characters."
1084   ]1084   ]
1085  },1085  },
1086  {1086  {
1087   "cell_type": "markdown",1087   "cell_type": "markdown",
1088   "metadata": {},1088   "metadata": {},
1089   "source": [1089   "source": [
1090    "**Shannon Fano Algorithm** is an entropy encoding technique for lossless data compression of multimedia. Named after Claude Shannon and Robert Fano, it assigns a code to each symbol based on their probabilities of occurrence. It is a variable length encoding scheme, that is, the codes assigned to the symbols will be of varying length."1090    "**Shannon Fano Algorithm** is an entropy encoding technique for lossless data compression of multimedia. Named after Claude Shannon and Robert Fano, it assigns a code to each symbol based on their probabilities of occurrence. It is a variable length encoding scheme, that is, the codes assigned to the symbols will be of varying length."
1091   ]1091   ]
1092  },1092  },
1093  {1093  {
1094   "cell_type": "markdown",1094   "cell_type": "markdown",
1095   "metadata": {},1095   "metadata": {},
1096   "source": [1096   "source": [
1097    "**Adaptive Huffman coding** (also called **Dynamic Huffman coding**) is an adaptive coding technique based on Huffman coding but with the expetion that it allows building the tree as the symbols are being send having no knowledge of charcter's frequencies. It's used in live video and audio compressions. "1097    "**Adaptive Huffman coding** (also called **Dynamic Huffman coding**) is an adaptive coding technique based on Huffman coding but with the expetion that it allows building the tree as the symbols are being send having no knowledge of charcter's frequencies. It's used in live video and audio compressions. "
1098   ]1098   ]
1099  },1099  },
1100  {1100  {
1101   "cell_type": "markdown",1101   "cell_type": "markdown",
1102   "metadata": {},1102   "metadata": {},
1103   "source": [1103   "source": [
1104    "**Arithmetic coding** is a form of entropy encoding used in lossless data compression. Normally, a string of characters such as the words \"hello there\" is represented using a fixed number of bits per character, as in the ASCII code."1104    "**Arithmetic coding** is a form of entropy encoding used in lossless data compression. Normally, a string of characters such as the words \"hello there\" is represented using a fixed number of bits per character, as in the ASCII code."
1105   ]1105   ]
1106  },1106  },
1107  {1107  {
1108   "cell_type": "markdown",1108   "cell_type": "markdown",
1109   "metadata": {},1109   "metadata": {},
1110   "source": [1110   "source": [
1111    "<img src=\"imgs\\datasize.png\" style=\"width: 500px;\"> "1111    "<img src=\"imgs\\datasize.png\" style=\"width: 500px;\"> "
1112   ]1112   ]
1113  },1113  },
1114  {1114  {
1115   "cell_type": "markdown",1115   "cell_type": "markdown",
1116   "metadata": {},1116   "metadata": {},
1117   "source": [1117   "source": [
1118    "<img src=\"imgs\\comparison.png\" style=\"width: 700px;\"> "1118    "<img src=\"imgs\\comparison.png\" style=\"width: 700px;\"> "
1119   ]1119   ]
1120  },1120  },
1121  {1121  {
1122   "cell_type": "markdown",1122   "cell_type": "markdown",
1123   "metadata": {},1123   "metadata": {},
1124   "source": [1124   "source": [
1125    "The overall behaviour of Shannon-Fano coding, Huffman coding and Adaptive Huffman coding is very similar with Arithmetic coding. Unlike Huffman coding, no code tree needs to be transmitted to the receiver. Here, encoding is done to a group of symbols, not symbol by symbol, which leads to higher compression ratios."1125    "The overall behaviour of Shannon-Fano coding, Huffman coding and Adaptive Huffman coding is very similar with Arithmetic coding. Unlike Huffman coding, no code tree needs to be transmitted to the receiver. Here, encoding is done to a group of symbols, not symbol by symbol, which leads to higher compression ratios."
1126   ]1126   ]
1127  },1127  },
1128  {1128  {
1129   "cell_type": "markdown",1129   "cell_type": "markdown",
1130   "metadata": {},1130   "metadata": {},
1131   "source": [1131   "source": [
1132    "## Overview"1132    "## Overview"
1133   ]1133   ]
1134  },1134  },
1135  {1135  {
1136   "cell_type": "markdown",1136   "cell_type": "markdown",
1137   "metadata": {},1137   "metadata": {},
1138   "source": [1138   "source": [
1139    "We have talked about what huffman coding is, how is generated. We made a wrong hypothesis that data with evenly distrubuted character counts is not random. We have talked about entrophy and how and why exactly we mesure one's data randomness. We made a test that confirms it. This is supposed to be a brief explanation about Huffman coding and it's properties.       "1139    "We have talked about what huffman coding is, how is generated. We made a wrong hypothesis that data with evenly distrubuted character counts is not random. We have talked about entrophy and how and why exactly we mesure one's data randomness. We made a test that confirms it. This is supposed to be a brief explanation about Huffman coding and it's properties.       "
1140   ]1140   ]
1141  },1141  },
1142  {1142  {
1143   "cell_type": "markdown",1143   "cell_type": "markdown",
1144   "metadata": {},1144   "metadata": {},
1145   "source": [1145   "source": [
1146    "Resources used:\n",1146    "Resources used:\n",
1147    "\n",1147    "\n",
1148    "Numpy"1148    "Numpy"
1149   ]1149   ]
1150  },1150  },
1151  {1151  {
1152   "cell_type": "markdown",1152   "cell_type": "markdown",
1153   "metadata": {},1153   "metadata": {},
1154   "source": [1154   "source": [
1155    "## References"1155    "## References"
1156   ]1156   ]
1157  },1157  },
1158  {1158  {
1159   "cell_type": "markdown",1159   "cell_type": "markdown",
1160   "metadata": {},1160   "metadata": {},
1161   "source": [1161   "source": [
1162    "1. [https://en.wikipedia.org/wiki/Huffman_coding](https://en.wikipedia.org/wiki/Huffman_coding)\n",1162    "1. [https://en.wikipedia.org/wiki/Huffman_coding](https://en.wikipedia.org/wiki/Huffman_coding)\n",
1163    "2. [https://www.youtube.com/watch?v=dM6us854Jk0](https://www.youtube.com/watch?v=dM6us854Jk0)\n",1163    "2. [https://www.youtube.com/watch?v=dM6us854Jk0](https://www.youtube.com/watch?v=dM6us854Jk0)\n",
1164    "3. [https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1126/handouts/220%20Huffman%20Encoding.pdf](https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1126/handouts/220%20Huffman%20Encoding.pdf)\n",1164    "3. [https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1126/handouts/220%20Huffman%20Encoding.pdf](https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1126/handouts/220%20Huffman%20Encoding.pdf)\n",
1165    "4. [http://160592857366.free.fr/joe/ebooks/ShareData/A%20Comparitive%20Study%20of%20Text%20Compression%20Algorithms.pdf](http://160592857366.free.fr/joe/ebooks/ShareData/A%20Comparitive%20Study%20of%20Text%20Compression%20Algorithms.pdf)\n",1165    "4. [http://160592857366.free.fr/joe/ebooks/ShareData/A%20Comparitive%20Study%20of%20Text%20Compression%20Algorithms.pdf](http://160592857366.free.fr/joe/ebooks/ShareData/A%20Comparitive%20Study%20of%20Text%20Compression%20Algorithms.pdf)\n",
1166    "5. [http://home.cse.ust.hk/faculty/golin/COMP271Sp03/Notes/MyL17.pdf](http://home.cse.ust.hk/faculty/golin/COMP271Sp03/Notes/MyL17.pdf)\n",1166    "5. [http://home.cse.ust.hk/faculty/golin/COMP271Sp03/Notes/MyL17.pdf](http://home.cse.ust.hk/faculty/golin/COMP271Sp03/Notes/MyL17.pdf)\n",
1167    "6. [https://www.programiz.com/dsa/huffman-coding](https://www.programiz.com/dsa/huffman-coding) - Huffman source code\n",1167    "6. [https://www.programiz.com/dsa/huffman-coding](https://www.programiz.com/dsa/huffman-coding) - Huffman source code\n",
1168    "7. [https://en.wikipedia.org/wiki/Prefix_code](https://en.wikipedia.org/wiki/Prefix_code)\n",1168    "7. [https://en.wikipedia.org/wiki/Prefix_code](https://en.wikipedia.org/wiki/Prefix_code)\n",
1169    "8. [https://www.tutorialspoint.com/huffman-codes-and-entropy-in-data-structure](https://www.tutorialspoint.com/huffman-codes-and-entropy-in-data-structure)\n",1169    "8. [https://www.tutorialspoint.com/huffman-codes-and-entropy-in-data-structure](https://www.tutorialspoint.com/huffman-codes-and-entropy-in-data-structure)\n",
1170    "9. [https://machinelearningmastery.com/what-is-information-entropy/](https://machinelearningmastery.com/what-is-information-entropy/)\n",1170    "9. [https://machinelearningmastery.com/what-is-information-entropy/](https://machinelearningmastery.com/what-is-information-entropy/)\n",
1171    "10. [https://en.wikipedia.org/wiki/Variable-length_code](https://en.wikipedia.org/wiki/Variable-length_code)\n",1171    "10. [https://en.wikipedia.org/wiki/Variable-length_code](https://en.wikipedia.org/wiki/Variable-length_code)\n",
1172    "11. [https://leimao.github.io/blog/Huffman-Coding/](https://leimao.github.io/blog/Huffman-Coding/)\n",1172    "11. [https://leimao.github.io/blog/Huffman-Coding/](https://leimao.github.io/blog/Huffman-Coding/)\n",
1173    "12. [https://www.princeton.edu/~cuff/ele201/kulkarni_text/information.pdf](https://www.princeton.edu/~cuff/ele201/kulkarni_text/information.pdf)"1173    "12. [https://www.princeton.edu/~cuff/ele201/kulkarni_text/information.pdf](https://www.princeton.edu/~cuff/ele201/kulkarni_text/information.pdf)"
1174   ]1174   ]
1175  }1175  }
1176 ],1176 ],
1177 "metadata": {1177 "metadata": {
1178  "kernelspec": {1178  "kernelspec": {
1179   "display_name": "Python 3 (ipykernel)",1179   "display_name": "Python 3 (ipykernel)",
1180   "language": "python",1180   "language": "python",
1181   "name": "python3"1181   "name": "python3"
1182  },1182  },
1183  "language_info": {1183  "language_info": {
1184   "codemirror_mode": {1184   "codemirror_mode": {
1185    "name": "ipython",1185    "name": "ipython",
1186    "version": 31186    "version": 3
1187   },1187   },
1188   "file_extension": ".py",1188   "file_extension": ".py",
1189   "mimetype": "text/x-python",1189   "mimetype": "text/x-python",
1190   "name": "python",1190   "name": "python",
1191   "nbconvert_exporter": "python",1191   "nbconvert_exporter": "python",
1192   "pygments_lexer": "ipython3",1192   "pygments_lexer": "ipython3",
1193   "version": "3.11.5"1193   "version": "3.11.5"
1194  }1194  }
1195 },1195 },
1196 "nbformat": 4,1196 "nbformat": 4,
1197 "nbformat_minor": 41197 "nbformat_minor": 4
1198}1198}
Legends
Colors
 Added 
Changed
Deleted
Links
(f)irst change
(n)ext change
(t)op