Data Structures

A data structure is, as the name suggests, a way to store data ("information") in a structured way. For example, a Python list is a data structure: it stores a sequence of numbers. A string is also a data structure. Some other terminology concerning data structures is static versus dynamic data structures. We say a data structure is static if you have to give it all the data up front, then once you create it, you can never change the data. A dynamic data structure is the opposite: it is one which supports changing the data. (Other words which have similar meanings to static and dynamic are immutable and mutable, respectively).

list is an example of a dynamic data structure, since you can for example do L[i] = j to change elements in a list. An example of a static data structure in Python would be string.

In [104]:
s = 'abc'
print(s)
s[0] = 'd'
abc
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-104-6124c60bee6f> in <module>()
      1 s = 'abc'
      2 print(s)
----> 3 s[0] = 'd'

TypeError: 'str' object does not support item assignment

Do not be confused by the example below though. Just because we can assign the variable s to a different string doesn't mean strings are dynamic. In the example below, we aren't changing the first tuple 'abc' that we created. We are creating a brand new string and reassigning the variable s to this brand new string.

In [11]:
s = 'abc'
print(s)
s = 'dbc'
print(s)
abc
dbc

When discussing a particular data structure, whether static or dynamic, we have to understand what we want from our data structure. In the case of both static and dynamic data structures, there is some set of queries that we want the data structure $D$ to allow us to answer about the data. Additionally for dynamic data structures, there are some types of updates we want to support.

For example, if L is a list data structure, then it supports queries of the form query(i) to a sequence of items, which returns the ith item in the sequence (in Python, the syntax for executing this query is to type L[i]).

The Dynamic Dictionary Problem

One data structural problem that comes up a lot is the dynamic dictionary problem. In this problem our data is a set $S$ of $n$ items, where each item is a (key, value) pair: $(k_0,v_0),\ldots,(k_{n-1},v_{n-1})$. For example, the keys might be names of people (as strings), and the values might be the city where the person is from. We will want our data structure $D$ to support several operations:

  • update(k, v): if an item with key $k$ exists in $S$, change its value to $v$; otherwise insert a new item $(k,v)$ into $S$
  • remove(k): remove the item from $S$, if such an item exists, that has key $k$
  • query(k): return the value $v$ associated with key $k$ in $S$ (or report if no such item exists)

Python has a built-in implementation of a dynamic dictionary data structure called dict. There are multiple ways to create a dictionary data structure:

In [106]:
D = dict() # here's one way to initialize an empty dict
D2 = {} # here's another way to do the same thing
print(D)
print(D2)

D['Gelila Demeke'] = 'Harar' # this is how to do "update" in Python, where k='Gelila Demeke', v='Harar'
print(D)

print("Let's do a query")
print(D['Gelila Demeke']) # this is how to do a query

D['Mikias Shimeles'] = 'Kombocha' # we can have more than one person in the dictionary
print(D)

D.pop('Gelila Demeke') # this is the remove operation
print(D)

#D.pop('Gelila Demeke') # in Python, if you try to remove a key that doesn't exist, there's an error

print("Let's try this again")
D.pop('Gelila Demeke', None) # to not get an error, you need to tell Python what to return if the key doesn't exist

D[50] = [1,2,3] # keys and values don't have to be strings
print(D)

print(D.keys()) # a function that gives you the keys; you can iterate over this return type
for x in D.keys():
    print(x)
{}
{}
{'Gelila Demeke': 'Harar'}
Let's do a query
Harar
{'Gelila Demeke': 'Harar', 'Mikias Shimeles': 'Kombocha'}
{'Mikias Shimeles': 'Kombocha'}
Let's try this again
{50: [1, 2, 3], 'Mikias Shimeles': 'Kombocha'}
dict_keys([50, 'Mikias Shimeles'])
50
Mikias Shimeles

How are dynamic dictionaries implemented?

The curious student may wonder: how did the creators of Python themselves implement dict?

  • One way to implement a dynamic dictionary is as a list: we can store the dictionary as [[$k_0,v_0$],[$k_1$,$v_1$],$\ldots$,[$k_{n-1},v_{n-1}$]]. Then all three operations take time $O(n)$. For example, here's how we could implement query:
In [32]:
def query(D, k):
    for item in D:
        if item[0] == k:
            return item[1]
    return None # return None if key k is not in the set of items
In [107]:
def update(D, k, v):
    for item in D:
        if item[0] == k:
            item[1] = v
            return
    D.append([k,v])

How are dynamic dictionaries implemented?

$O(n)$ time though is quite slow to query a dictionary of $n$ items, so one might ask: can we do better? For keys that are comparable (that means k < k' is a supported operation for two different keys k and k'), then we could store the items in an array that is sorted by key. Then we could perform a query using binary search, taking only $O(\log n)$ time.

In [108]:
# search for key k in D[a:b]
def query(D, k, a=-1, b=-1):
    if a==-1:
        return query(D, k, 0, len(D))
    elif a==b:
        return None
    else:
        mid = (a+b)//2
        if D[mid][0] == k:
            return D[mid][1]
        elif D[mid][0] < k:
            return query(D, k, mid+1, b)
        else:
            return query(D, k, a, mid)
        
D = [['Abraham', 16], ['Bemnet', 14]]
print(query(D, 'Abraham'))
print(query(D, 'Jelani'))
16
None

Though, there is a problem with this solution. What is it?

The problem is that it's not fast to make this solution dynamic! If you insert a new item $(k,v)$, you cannot just append it to the end of the list since that might violate the sorted order. You would have to scan to see where the new item goes in the sorted order then put it there, so that update would take $O(n)$ time! There is a way around this using a data structure called a balanced binary search tree, but we will not discuss that data structure in this class.

What does Python actually do?

Python uses hash tables!

Hash tables

Idea: Design a function h(s) such that for every string s , h(s) is a number between $0$ and $m$. Then, we can have a list L of length $m$ such that L[h(s)] will be the city where student s is from

So, to get the city of the student with name name, all we need to do is to compute h(name) and then we can get this student in only one step.

Problems:

  • How do you find such a function?
  • What do you do if you have two different students of with names name1 and name2 such that h(name1)==h(name2)?
  • How small can we make $m$? Note that it costs us in computer memory to make $m$ too big.

Let's start with the first problem: we want to find a function $h$ that takes every string $s$ to a somewhat "random number" between $1$ and $m$. (in our case $m=152$)

One simple function is the following: treat each letter as a number from $1$ to $26$ and add all the letters in the name modulo $m$.

(note that this is a simple function that works well sometimes but not always, and in particular will always map "jelani" and "lanjei" to the same number; there are better "hash functions", though it turns out Python doesn't use those $\ldots$)

In [42]:
def letter_to_number(c):
    if c==' ':
        return 27
    if c=='-':
        return 28
    return 1+ord(c)-ord('a')

letter_to_number('c')
Out[42]:
3
In [52]:
def h(s,m):
    res = 0
    for c in s.lower():
        res += letter_to_number(c)
    return res % m
In [51]:
h("jelani nelson",152)
Out[51]:
5
In [110]:
h("timnit gebru",152)
Out[110]:
13

Let's test how well it works for the students in this class:

In [58]:
students = [['Nigusu Solomon','Bahir Dar'],['Hailemichael Sileshi','Bahir Dar'],['Befker Gezahegn','Addis Ababa'],['Selamawit Dejene','Dukem'],['Simon Kebede','Dawro'],['Winner Admasu','Kambata Tembaru'],['Ashenafi Alemu','Hawassa'],['Milkias Henok','Hawassa'],['Tsegaye Dereje','East Wollega'],['Hermela Mesele','Arsi'],['Yekalfire Tesafa','Jimma'],['Nazerawit Wondimu','Gurage'],['Netsanet Wodajo','Bonga'],['Kaleb Kinde','Gurage'],['Dawit Tesfaye','Hawassa'],['Abdulahi Ararso','West Arsi'],['Alganesh Mikael','Abala'],['Habtamua Isaias','Jinka'],['Mekdes Burayu','Jimma'],['Eman Yesuf','Komborcha'],['Olensa Birhanu','Ambo'],['Ayansa Benti','Ambo'],['Nazreth Ketema','SNNPR'],['Epiphany Gedeno','Segen'],['Yonatan Yohanas','Semegn Shoa'],['Abrham Mulatu','Semegn Shoa'],['Diriba Nemomsa','Horo Guduru Welega'],['Milion Asfaw','West Shewa'],['Shuminnet Lemma','West Shewa'],['Chekol Ambelu','West Gojam'],['Chernet Solomon','Terja'],['Yekalfire Tesafa','Jimma'],['Yeabsira Birhanu','Borena'],['Ali Shami','Dubti'],['Henok Biadglign','Bahir Dar'],['Furno Mulu','Kawo'],['Mekdes Bune','Asela'],['Eyob Zebene','Dukem'],['Kenean Tilahun','Ginnir'],['Yeabsira Tsegaye','Awash'],['Betsegaw Yohannes','Dara'],['Tariku Wendwessen','Harar'],['Amir Shilega','Emidber'],['Natnael Mekonen','Adama'],['Demelka Darko','Arba Minch'],['Dawit Mekonnen','Bichana, Gojam'],['Yonas Mulu','Dejen, Gojam'],['Tariku Asefa','Baruda'],['Genet Tamrat','Dawro'],['Israel Abera','Kambata'],['Afomia Eshetu','Dara, Sidama'],['Ezedin Seid','Assaita'],['Mihiretab Tesfahun','Kambata'],['Rufeyda Ahmed','Harar'],['Gelila Demeke','Harar'],['Birhanu Merera','Gelan'],['Dagm Kassa','Eteya'],['Haileamlak Mirachew','Agana'],['Muluken Wale','Debre Markos'],['Maruf Mohamed','Jimma'],['Mikias Shimeles','Kombocha'],['Juhar Hussen','Werailu'],['Ali Hassen','Kemise'],['Addisu Seteye','South Wollo'],['Hussien Abdu','Dessie'],['Hassen Abdullahi','Harerge'],['Destaw Fentie','North Gondar'],['Mikiyas Gashaw','South Wollo'],['Dessalegn Amare','South Wollo'],['Mieso Shalo','Batu'],['Gojamnesh Asnakew','Dangla, Gojam'],['Semira Miruts','Addis Ababa'],['Hazrim Mohammed','East Harerge'],['Degnesh Mulat','Gondar'],['Kalkidan Dereje','Bale Robe'],['Tewodros Teka','Woldiya'],['Tamiru Haile','Bonga'],['Kidist Enyew','Amanuel, East Gojam'],['Betlehem Asmamaw','Feres Bet, West Gojam'],['Ries Tegegn','Debre Birhan'],['Fikirte Mehari','Bichena, East Gojam'],['Asrat Ayele','Bona'],['Birtukan Degu','Pawi'],['Sewnet Belay','Saja'],['Kuleni Geda','Welonciti, Misrak Shewa'],['Nardos Hailemeskel','Ejere, Misrak Shewa'],['Bezawit Assaye','Debrework'],['Bitewush Getie','Sedie'],['Dibora Anbesaw','Gondar'],['Mahlet Ababaw','Kolladiba'],['Tigist Teshale','Areka'],['Timotwos Natnael','Hawassa'],['Aminat Musa','Dessie'],['Ararsa Abebe','Tulubolo'],['Fante Teshome','Fiche'],['Mekdes Gashaw','Sekota'],['Melat Mesele','Kobo'],['Tsinat Woldeyohannes','Boditi'],['Firehiwot Dagne','Sendefa'],['Biftu Zenu','Jimma'],['Hikma Seifu','Jimma'],['Nemera Abera','Mojo'],['Amanuel Yosef','Konso'],['Tesfaye Adugna','Assela'],['Semira Abdu','Bati'],['Zelalem Alebel','Debre Markos'],['Michael Teshome','Addis Ababa'],['Munira Mohamed','Homi'],['Derartu Milkias','Kelem Wollega'],['Ekram Abdukaadir','East Harerge'],['Birtukan Birku','Gamo Gofa'],['Amena Sirnesa','Kelem Wollega'],['Demera Soboka','Western Wollega'],['Fasika Takele','Bedele'],['Ararso Mohammed','Harar'],['Addisu Tafete','Waghimra, Sekota'],['Iftu Anbese','Batu'],['Bora Murti','West Guji'],['Mideksa Girma','Abusera'],['Sisay Dinku','Adama'],['Abdurahman Bedri','Harar'],['Efrem Birhan','Kone'],['Natnael Mekuria','Adama'],['Jarso Abgudo','Borena'],['Shemsiya Hule','West Guji'],['Sabona Terefe','Burayu'],['Tesfaye Shifera','Legetafo'],['Telile Kenae','Illubabur'],['Zewdu Shumete','Finoteselam'],['Cherinet Abebe','Bedele'],['Abenezer Abera','Wolayta-Sodo'],['Sintayehu Asfaw','Bahir Dar'],['Birtukan Fitamo','Hadiya'],['Bizunesh Habte','Hadiya'],['Bethel Tedla','Addis Ababa'],['Melkamu Birhanu','Gedeo'],['Deme Demise','Holeta'],['Eyerusalem Abate','Butajira'],['Eyael Birhanu','Dila'],['Selamawit Abel','Segen'],['Sena Temesgen','Burayu'],['Melal Zewge','Wolliso'],['Ayantu Girma','Adama'],['Atitegeb Ashebit','Sagure'],['Solomon Gizaw','Wolliso'],['Rediet Girma','Alaba'],['Nuritu Sheisa','Alaba'],['Mekdes Diriba','Ambo'],['Samuel Assefa','Sebeta'],['Elias Meskelo','Metu'],['Gudina Mengistu','Nekemte'],['Geleta Niguse','Bulehora']]
print(students)
len(students)
[['Nigusu Solomon', 'Bahir Dar'], ['Hailemichael Sileshi', 'Bahir Dar'], ['Befker Gezahegn', 'Addis Ababa'], ['Selamawit Dejene', 'Dukem'], ['Simon Kebede', 'Dawro'], ['Winner Admasu', 'Kambata Tembaru'], ['Ashenafi Alemu', 'Hawassa'], ['Milkias Henok', 'Hawassa'], ['Tsegaye Dereje', 'East Wollega'], ['Hermela Mesele', 'Arsi'], ['Yekalfire Tesafa', 'Jimma'], ['Nazerawit Wondimu', 'Gurage'], ['Netsanet Wodajo', 'Bonga'], ['Kaleb Kinde', 'Gurage'], ['Dawit Tesfaye', 'Hawassa'], ['Abdulahi Ararso', 'West Arsi'], ['Alganesh Mikael', 'Abala'], ['Habtamua Isaias', 'Jinka'], ['Mekdes Burayu', 'Jimma'], ['Eman Yesuf', 'Komborcha'], ['Olensa Birhanu', 'Ambo'], ['Ayansa Benti', 'Ambo'], ['Nazreth Ketema', 'SNNPR'], ['Epiphany Gedeno', 'Segen'], ['Yonatan Yohanas', 'Semegn Shoa'], ['Abrham Mulatu', 'Semegn Shoa'], ['Diriba Nemomsa', 'Horo Guduru Welega'], ['Milion Asfaw', 'West Shewa'], ['Shuminnet Lemma', 'West Shewa'], ['Chekol Ambelu', 'West Gojam'], ['Chernet Solomon', 'Terja'], ['Yekalfire Tesafa', 'Jimma'], ['Yeabsira Birhanu', 'Borena'], ['Ali Shami', 'Dubti'], ['Henok Biadglign', 'Bahir Dar'], ['Furno Mulu', 'Kawo'], ['Mekdes Bune', 'Asela'], ['Eyob Zebene', 'Dukem'], ['Kenean Tilahun', 'Ginnir'], ['Yeabsira Tsegaye', 'Awash'], ['Betsegaw Yohannes', 'Dara'], ['Tariku Wendwessen', 'Harar'], ['Amir Shilega', 'Emidber'], ['Natnael Mekonen', 'Adama'], ['Demelka Darko', 'Arba Minch'], ['Dawit Mekonnen', 'Bichana, Gojam'], ['Yonas Mulu', 'Dejen, Gojam'], ['Tariku Asefa', 'Baruda'], ['Genet Tamrat', 'Dawro'], ['Israel Abera', 'Kambata'], ['Afomia Eshetu', 'Dara, Sidama'], ['Ezedin Seid', 'Assaita'], ['Mihiretab Tesfahun', 'Kambata'], ['Rufeyda Ahmed', 'Harar'], ['Gelila Demeke', 'Harar'], ['Birhanu Merera', 'Gelan'], ['Dagm Kassa', 'Eteya'], ['Haileamlak Mirachew', 'Agana'], ['Muluken Wale', 'Debre Markos'], ['Maruf Mohamed', 'Jimma'], ['Mikias Shimeles', 'Kombocha'], ['Juhar Hussen', 'Werailu'], ['Ali Hassen', 'Kemise'], ['Addisu Seteye', 'South Wollo'], ['Hussien Abdu', 'Dessie'], ['Hassen Abdullahi', 'Harerge'], ['Destaw Fentie', 'North Gondar'], ['Mikiyas Gashaw', 'South Wollo'], ['Dessalegn Amare', 'South Wollo'], ['Mieso Shalo', 'Batu'], ['Gojamnesh Asnakew', 'Dangla, Gojam'], ['Semira Miruts', 'Addis Ababa'], ['Hazrim Mohammed', 'East Harerge'], ['Degnesh Mulat', 'Gondar'], ['Kalkidan Dereje', 'Bale Robe'], ['Tewodros Teka', 'Woldiya'], ['Tamiru Haile', 'Bonga'], ['Kidist Enyew', 'Amanuel, East Gojam'], ['Betlehem Asmamaw', 'Feres Bet, West Gojam'], ['Ries Tegegn', 'Debre Birhan'], ['Fikirte Mehari', 'Bichena, East Gojam'], ['Asrat Ayele', 'Bona'], ['Birtukan Degu', 'Pawi'], ['Sewnet Belay', 'Saja'], ['Kuleni Geda', 'Welonciti, Misrak Shewa'], ['Nardos Hailemeskel', 'Ejere, Misrak Shewa'], ['Bezawit Assaye', 'Debrework'], ['Bitewush Getie', 'Sedie'], ['Dibora Anbesaw', 'Gondar'], ['Mahlet Ababaw', 'Kolladiba'], ['Tigist Teshale', 'Areka'], ['Timotwos Natnael', 'Hawassa'], ['Aminat Musa', 'Dessie'], ['Ararsa Abebe', 'Tulubolo'], ['Fante Teshome', 'Fiche'], ['Mekdes Gashaw', 'Sekota'], ['Melat Mesele', 'Kobo'], ['Tsinat Woldeyohannes', 'Boditi'], ['Firehiwot Dagne', 'Sendefa'], ['Biftu Zenu', 'Jimma'], ['Hikma Seifu', 'Jimma'], ['Nemera Abera', 'Mojo'], ['Amanuel Yosef', 'Konso'], ['Tesfaye Adugna', 'Assela'], ['Semira Abdu', 'Bati'], ['Zelalem Alebel', 'Debre Markos'], ['Michael Teshome', 'Addis Ababa'], ['Munira Mohamed', 'Homi'], ['Derartu Milkias', 'Kelem Wollega'], ['Ekram Abdukaadir', 'East Harerge'], ['Birtukan Birku', 'Gamo Gofa'], ['Amena Sirnesa', 'Kelem Wollega'], ['Demera Soboka', 'Western Wollega'], ['Fasika Takele', 'Bedele'], ['Ararso Mohammed', 'Harar'], ['Addisu Tafete', 'Waghimra, Sekota'], ['Iftu Anbese', 'Batu'], ['Bora Murti', 'West Guji'], ['Mideksa Girma', 'Abusera'], ['Sisay Dinku', 'Adama'], ['Abdurahman Bedri', 'Harar'], ['Efrem Birhan', 'Kone'], ['Natnael Mekuria', 'Adama'], ['Jarso Abgudo', 'Borena'], ['Shemsiya Hule', 'West Guji'], ['Sabona Terefe', 'Burayu'], ['Tesfaye Shifera', 'Legetafo'], ['Telile Kenae', 'Illubabur'], ['Zewdu Shumete', 'Finoteselam'], ['Cherinet Abebe', 'Bedele'], ['Abenezer Abera', 'Wolayta-Sodo'], ['Sintayehu Asfaw', 'Bahir Dar'], ['Birtukan Fitamo', 'Hadiya'], ['Bizunesh Habte', 'Hadiya'], ['Bethel Tedla', 'Addis Ababa'], ['Melkamu Birhanu', 'Gedeo'], ['Deme Demise', 'Holeta'], ['Eyerusalem Abate', 'Butajira'], ['Eyael Birhanu', 'Dila'], ['Selamawit Abel', 'Segen'], ['Sena Temesgen', 'Burayu'], ['Melal Zewge', 'Wolliso'], ['Ayantu Girma', 'Adama'], ['Atitegeb Ashebit', 'Sagure'], ['Solomon Gizaw', 'Wolliso'], ['Rediet Girma', 'Alaba'], ['Nuritu Sheisa', 'Alaba'], ['Mekdes Diriba', 'Ambo'], ['Samuel Assefa', 'Sebeta'], ['Elias Meskelo', 'Metu'], ['Gudina Mengistu', 'Nekemte'], ['Geleta Niguse', 'Bulehora']]
Out[58]:
152
In [55]:
% run 'boaz_utils.ipynb'
integer_hist([h(pair[0],152) for pair in students])
Using matplotlib backend: TkAgg
Populating the interactive namespace from numpy and matplotlib

We see that the function is "almost" good, in that most places only have one student matched to it, but several places have two students and a few even have seven.

So now we can have a list cities_list of length 152, where for every i, cities_list[i] will contain the list of all pairs corresponding to the students with name s such that h(s)==i.

For every i, the list cities_list[i] will have at most seven pairs.

So, if we want to get the group of a student with name s we need to do it in at most four steps:

  • We let L=cities_list[h(s)]
  • Then we scan this short list L to find the pair of the form [ t , s]

Summary of data structures

Often the right data structure can make all the difference:

Data structure Query(key)..... Update(key).. Other properties
Unsorted list $n$ $n$ Supports any objects
Sorted array $\log n$ $n$ Supports range queries
Balanced binary search tree $\log n$ $\log n$ Support range queries
Hash table $<10$ $<10$ Supports non-comparable keys

Note: Data structures is a huge topic and if you study more computer science you will hear about more concepts such as stacks, queues, heaps, and many more.

One application of dictionaries: memoization

Rather than make memory be a list or list of lists, we could use a dict.

In [67]:
def fibMemo(n, mem):
    if n <= 1:
        return n
    elif n in mem:
        return mem[n]
    else:
        mem[n] = fibMemo(n-1, mem) + fibMemo(n-2, mem)
        return mem[n]

def fib(n):
    mem = dict()
    return fibMemo(n, mem)
In [71]:
for i in range(10):
    print(fib(i))
0
1
1
2
3
5
8
13
21
34

We haven't seen some of the below syntax yet, but it's even possible to automatically create a memoized version of a function!

In [12]:
def memoize(func):
    mem = dict()
    def f(*params, calls=mem):
        key = repr(params)
        if not key in calls:
            calls[key] = func(*params)
        return calls[key]
    return f
In [13]:
# recursive, non-memoized version of fibonacci
def fib(n):
    if n <= 1:
        return n
    else:
        return fib(n-1) + fib(n-2)
In [5]:
fib(36)
Out[5]:
14930352
In [14]:
fib = memoize(fib)
fib(36)
Out[14]:
14930352