#!/usr/bin/env python # coding: utf-8 # # Data Structures # # A _data structure_ is, as the name suggests, a way to store data ("information") in a structured way. For example, a Python ```list``` is a data structure: it stores a sequence of numbers. A ```string``` is also a data structure. Some other terminology concerning data structures is _static_ versus _dynamic_ data structures. We say a data structure is static if you have to give it all the data up front, then once you create it, you can never change the data. A dynamic data structure is the opposite: it is one which supports changing the data. (Other words which have similar meanings to static and dynamic are ```immutable``` and ```mutable```, respectively). # # ```list``` is an example of a dynamic data structure, since you can for example do ```L[i] = j``` to change elements in a ```list```. An example of a static data structure in Python would be ```string```. # In[104]: s = 'abc' print(s) s[0] = 'd' # Do not be confused by the example below though. Just because we can assign the *variable* s to a different string doesn't mean strings are dynamic. In the example below, we aren't changing the first tuple ```'abc'``` that we created. We are creating a brand new string and reassigning the variable s to this brand new string. # In[11]: s = 'abc' print(s) s = 'dbc' print(s) # When discussing a particular data structure, whether static or dynamic, we have to understand what we want from our data structure. In the case of both static and dynamic data structures, there is some set of _queries_ that we want the data structure $D$ to allow us to answer about the data. Additionally for dynamic data structures, there are some types of _updates_ we want to support. # # For example, if ```L``` is a ```list``` data structure, then it supports queries of the form ```query(i)``` to a sequence of items, which returns the ith item in the sequence (in Python, the syntax for executing this query is to type ```L[i]```). # # The Dynamic Dictionary Problem # # One data structural problem that comes up a lot is the _dynamic dictionary problem_. In this problem our data is a set $S$ of $n$ items, where each item is a (key, value) pair: $(k_0,v_0),\ldots,(k_{n-1},v_{n-1})$. For example, the keys might be names of people (as strings), and the values might be the city where the person is from. We will want our data structure $D$ to support several operations: # * ```update(k, v)```: if an item with key $k$ exists in $S$, change its value to $v$; otherwise insert a new item $(k,v)$ into $S$ # * ```remove(k)```: remove the item from $S$, if such an item exists, that has key $k$ # * ```query(k)```: return the value $v$ associated with key $k$ in $S$ (or report if no such item exists) # # Python has a built-in implementation of a dynamic dictionary data structure called ```dict```. There are multiple ways to create a dictionary data structure: # In[106]: D = dict() # here's one way to initialize an empty dict D2 = {} # here's another way to do the same thing print(D) print(D2) D['Gelila Demeke'] = 'Harar' # this is how to do "update" in Python, where k='Gelila Demeke', v='Harar' print(D) print("Let's do a query") print(D['Gelila Demeke']) # this is how to do a query D['Mikias Shimeles'] = 'Kombocha' # we can have more than one person in the dictionary print(D) D.pop('Gelila Demeke') # this is the remove operation print(D) #D.pop('Gelila Demeke') # in Python, if you try to remove a key that doesn't exist, there's an error print("Let's try this again") D.pop('Gelila Demeke', None) # to not get an error, you need to tell Python what to return if the key doesn't exist D[50] = [1,2,3] # keys and values don't have to be strings print(D) print(D.keys()) # a function that gives you the keys; you can iterate over this return type for x in D.keys(): print(x) # ### How are dynamic dictionaries implemented? # # The curious student may wonder: how did the creators of Python themselves implement ```dict```? # # * One way to implement a dynamic dictionary is as a ```list```: we can store the dictionary as [[$k_0,v_0$],[$k_1$,$v_1$],$\ldots$,[$k_{n-1},v_{n-1}$]]. Then all three operations take time $O(n)$. For example, here's how we could implement ```query```: # In[32]: def query(D, k): for item in D: if item[0] == k: return item[1] return None # return None if key k is not in the set of items # In[107]: def update(D, k, v): for item in D: if item[0] == k: item[1] = v return D.append([k,v]) # ### How are dynamic dictionaries implemented? # # $O(n)$ time though is quite slow to query a dictionary of $n$ items, so one might ask: can we do better? For keys that are comparable (that means ```k < k'``` is a supported operation for two different keys ```k``` and ```k'```), then we could store the items in an array that is sorted by key. Then we could perform a query using binary search, taking only $O(\log n)$ time. # In[108]: # search for key k in D[a:b] def query(D, k, a=-1, b=-1): if a==-1: return query(D, k, 0, len(D)) elif a==b: return None else: mid = (a+b)//2 if D[mid][0] == k: return D[mid][1] elif D[mid][0] < k: return query(D, k, mid+1, b) else: return query(D, k, a, mid) D = [['Abraham', 16], ['Bemnet', 14]] print(query(D, 'Abraham')) print(query(D, 'Jelani')) # Though, there is a problem with this solution. What is it? # The problem is that it's not fast to make this solution dynamic! If you insert a new item $(k,v)$, you cannot just append it to the end of the list since that might violate the sorted order. You would have to scan to see where the new item goes in the sorted order then put it there, so that ```update``` would take $O(n)$ time! There is a way around this using a data structure called a _balanced binary search tree_, but we will not discuss that data structure in this class. # # What does Python actually do? # # Python uses _hash tables_! # # Hash tables # __Idea:__ Design a function ```h(s)``` such that for every string ```s``` , ```h(s)``` is a number between $0$ and $m$. Then, we can have a list ```L``` of length $m$ such that ```L[h(s)]``` will be the city where student ```s``` is from # So, to get the city of the student with name ```name```, all we need to do is to compute ```h(name)``` and then we can get this student in only one step. # __Problems:__ # # * How do you find such a function? # * What do you do if you have two different students of with names ```name1``` and ```name2``` such that ```h(name1)==h(name2)```? # * How small can we make $m$? Note that it costs us in computer _memory_ to make $m$ too big. # Let's start with the first problem: we want to find a function $h$ that takes every string $s$ to a somewhat "random number" between $1$ and $m$. (in our case $m=152$) # One simple function is the following: treat each letter as a number from $1$ to $26$ and add all the letters in the name modulo $m$. # # (note that this is a simple function that works well sometimes but not always, and in particular will always map "jelani" and "lanjei" to the same number; there are better "hash functions", though it turns out Python doesn't use those $\ldots$) # In[42]: def letter_to_number(c): if c==' ': return 27 if c=='-': return 28 return 1+ord(c)-ord('a') letter_to_number('c') # In[52]: def h(s,m): res = 0 for c in s.lower(): res += letter_to_number(c) return res % m # In[51]: h("jelani nelson",152) # In[110]: h("timnit gebru",152) # Let's test how well it works for the students in this class: # In[58]: students = [['Nigusu Solomon','Bahir Dar'],['Hailemichael Sileshi','Bahir Dar'],['Befker Gezahegn','Addis Ababa'],['Selamawit Dejene','Dukem'],['Simon Kebede','Dawro'],['Winner Admasu','Kambata Tembaru'],['Ashenafi Alemu','Hawassa'],['Milkias Henok','Hawassa'],['Tsegaye Dereje','East Wollega'],['Hermela Mesele','Arsi'],['Yekalfire Tesafa','Jimma'],['Nazerawit Wondimu','Gurage'],['Netsanet Wodajo','Bonga'],['Kaleb Kinde','Gurage'],['Dawit Tesfaye','Hawassa'],['Abdulahi Ararso','West Arsi'],['Alganesh Mikael','Abala'],['Habtamua Isaias','Jinka'],['Mekdes Burayu','Jimma'],['Eman Yesuf','Komborcha'],['Olensa Birhanu','Ambo'],['Ayansa Benti','Ambo'],['Nazreth Ketema','SNNPR'],['Epiphany Gedeno','Segen'],['Yonatan Yohanas','Semegn Shoa'],['Abrham Mulatu','Semegn Shoa'],['Diriba Nemomsa','Horo Guduru Welega'],['Milion Asfaw','West Shewa'],['Shuminnet Lemma','West Shewa'],['Chekol Ambelu','West Gojam'],['Chernet Solomon','Terja'],['Yekalfire Tesafa','Jimma'],['Yeabsira Birhanu','Borena'],['Ali Shami','Dubti'],['Henok Biadglign','Bahir Dar'],['Furno Mulu','Kawo'],['Mekdes Bune','Asela'],['Eyob Zebene','Dukem'],['Kenean Tilahun','Ginnir'],['Yeabsira Tsegaye','Awash'],['Betsegaw Yohannes','Dara'],['Tariku Wendwessen','Harar'],['Amir Shilega','Emidber'],['Natnael Mekonen','Adama'],['Demelka Darko','Arba Minch'],['Dawit Mekonnen','Bichana, Gojam'],['Yonas Mulu','Dejen, Gojam'],['Tariku Asefa','Baruda'],['Genet Tamrat','Dawro'],['Israel Abera','Kambata'],['Afomia Eshetu','Dara, Sidama'],['Ezedin Seid','Assaita'],['Mihiretab Tesfahun','Kambata'],['Rufeyda Ahmed','Harar'],['Gelila Demeke','Harar'],['Birhanu Merera','Gelan'],['Dagm Kassa','Eteya'],['Haileamlak Mirachew','Agana'],['Muluken Wale','Debre Markos'],['Maruf Mohamed','Jimma'],['Mikias Shimeles','Kombocha'],['Juhar Hussen','Werailu'],['Ali Hassen','Kemise'],['Addisu Seteye','South Wollo'],['Hussien Abdu','Dessie'],['Hassen Abdullahi','Harerge'],['Destaw Fentie','North Gondar'],['Mikiyas Gashaw','South Wollo'],['Dessalegn Amare','South Wollo'],['Mieso Shalo','Batu'],['Gojamnesh Asnakew','Dangla, Gojam'],['Semira Miruts','Addis Ababa'],['Hazrim Mohammed','East Harerge'],['Degnesh Mulat','Gondar'],['Kalkidan Dereje','Bale Robe'],['Tewodros Teka','Woldiya'],['Tamiru Haile','Bonga'],['Kidist Enyew','Amanuel, East Gojam'],['Betlehem Asmamaw','Feres Bet, West Gojam'],['Ries Tegegn','Debre Birhan'],['Fikirte Mehari','Bichena, East Gojam'],['Asrat Ayele','Bona'],['Birtukan Degu','Pawi'],['Sewnet Belay','Saja'],['Kuleni Geda','Welonciti, Misrak Shewa'],['Nardos Hailemeskel','Ejere, Misrak Shewa'],['Bezawit Assaye','Debrework'],['Bitewush Getie','Sedie'],['Dibora Anbesaw','Gondar'],['Mahlet Ababaw','Kolladiba'],['Tigist Teshale','Areka'],['Timotwos Natnael','Hawassa'],['Aminat Musa','Dessie'],['Ararsa Abebe','Tulubolo'],['Fante Teshome','Fiche'],['Mekdes Gashaw','Sekota'],['Melat Mesele','Kobo'],['Tsinat Woldeyohannes','Boditi'],['Firehiwot Dagne','Sendefa'],['Biftu Zenu','Jimma'],['Hikma Seifu','Jimma'],['Nemera Abera','Mojo'],['Amanuel Yosef','Konso'],['Tesfaye Adugna','Assela'],['Semira Abdu','Bati'],['Zelalem Alebel','Debre Markos'],['Michael Teshome','Addis Ababa'],['Munira Mohamed','Homi'],['Derartu Milkias','Kelem Wollega'],['Ekram Abdukaadir','East Harerge'],['Birtukan Birku','Gamo Gofa'],['Amena Sirnesa','Kelem Wollega'],['Demera Soboka','Western Wollega'],['Fasika Takele','Bedele'],['Ararso Mohammed','Harar'],['Addisu Tafete','Waghimra, Sekota'],['Iftu Anbese','Batu'],['Bora Murti','West Guji'],['Mideksa Girma','Abusera'],['Sisay Dinku','Adama'],['Abdurahman Bedri','Harar'],['Efrem Birhan','Kone'],['Natnael Mekuria','Adama'],['Jarso Abgudo','Borena'],['Shemsiya Hule','West Guji'],['Sabona Terefe','Burayu'],['Tesfaye Shifera','Legetafo'],['Telile Kenae','Illubabur'],['Zewdu Shumete','Finoteselam'],['Cherinet Abebe','Bedele'],['Abenezer Abera','Wolayta-Sodo'],['Sintayehu Asfaw','Bahir Dar'],['Birtukan Fitamo','Hadiya'],['Bizunesh Habte','Hadiya'],['Bethel Tedla','Addis Ababa'],['Melkamu Birhanu','Gedeo'],['Deme Demise','Holeta'],['Eyerusalem Abate','Butajira'],['Eyael Birhanu','Dila'],['Selamawit Abel','Segen'],['Sena Temesgen','Burayu'],['Melal Zewge','Wolliso'],['Ayantu Girma','Adama'],['Atitegeb Ashebit','Sagure'],['Solomon Gizaw','Wolliso'],['Rediet Girma','Alaba'],['Nuritu Sheisa','Alaba'],['Mekdes Diriba','Ambo'],['Samuel Assefa','Sebeta'],['Elias Meskelo','Metu'],['Gudina Mengistu','Nekemte'],['Geleta Niguse','Bulehora']] print(students) len(students) # In[55]: get_ipython().run_line_magic('', "run 'boaz_utils.ipynb'") integer_hist([h(pair[0],152) for pair in students]) # We see that the function is "almost" good, in that most places only have one student matched to it, but several places have two students and a few even have seven. # So now we can have a list ```cities_list``` of length 152, where for every ```i```, ```cities_list[i]``` will contain the list of all pairs corresponding to the students with name ```s``` such that ```h(s)==i```. # # For every ```i```, the list ```cities_list[i]``` will have at most seven pairs. # So, if we want to get the group of a student with name ```s``` we need to do it in at most four steps: # # * We let ```L=cities_list[h(s)]``` # * Then we scan this short list ```L``` to find the pair of the form ```[ t , s]``` # ## Summary of data structures # # Often the right data structure can make all the difference: # # # |Data structure |Query(key).....|Update(key)..|Other properties | # |-----------------------------|---------------|-------------|------------------------------| # |Unsorted list |$n$ |$n$ | Supports any objects | # |Sorted array |$\log n$ |$n$ | Supports range queries | # |Balanced binary search tree |$\log n$ |$\log n$ | Support range queries | # |Hash table |$<10$ |$<10$ | Supports non-comparable keys | # # # __Note:__ Data structures is a _huge_ topic and if you study more computer science you will hear about more concepts such as stacks, queues, heaps, and many more. # # One application of dictionaries: memoization # # Rather than make memory be a ```list``` or ```list``` of ```lists```, we could use a ```dict```. # In[67]: def fibMemo(n, mem): if n <= 1: return n elif n in mem: return mem[n] else: mem[n] = fibMemo(n-1, mem) + fibMemo(n-2, mem) return mem[n] def fib(n): mem = dict() return fibMemo(n, mem) # In[71]: for i in range(10): print(fib(i)) # We haven't seen some of the below syntax yet, but it's even possible to automatically create a memoized version of a function! # In[12]: def memoize(func): mem = dict() def f(*params, calls=mem): key = repr(params) if not key in calls: calls[key] = func(*params) return calls[key] return f # In[13]: # recursive, non-memoized version of fibonacci def fib(n): if n <= 1: return n else: return fib(n-1) + fib(n-2) # In[5]: fib(36) # In[14]: fib = memoize(fib) fib(36)