Exercise 2.1, Depth-first search revisited

This iPython notebook revisits material from the first week of the University of Twente Module 'Web Science', and further explains the Python syntax. With this notebook, you can check if you are up-to-date with the course material. This exercise is not mandatory. If you think you do not need this exercise, either quickly finish it, or skip it all together and go directly to Exercise 2.2.

Read the Wikipedia page on Depth-first search, in particlar the example that contains the following undirected graph:

Example graph

In Python, an easy way to represent a graph is with the following data structure, in which the graph is represented as a dictionary with a key for each node, where the value of each key is a list of connected nodes. This results in the following representation of our example graph in Python: (Note that each edge is encoded twice in the representation. For instance the edge A -- B is encoded by having 'B' in the list for the key 'A', and also having 'A' in the list for the key 'B'.)

In [1]:
graph = {'A': ['B', 'C', 'E'],
         'B': ['A', 'D', 'F'],
         'C': ['A', 'G'],
         'D': ['B'],
         'E': ['A', 'F'],
         'F': ['B', 'E'],
         'G': ['C'] }

Exercise 2.1.1: the recursive implementation

From the pseudo code on Wikipedia, implement the recursive function visited_nodes_dfs(G, start, discovered) that will be called with a graph G, a start node start, and a list of nodes that are already discovered (discovered).

In [2]:
def visited_nodes_dfs(G, start, discovered):
    #
    # Put your solution here
    #
    return discovered

So, depth-first search starting at node 'A' is given by the following function call: (Tip: check the order of the nodes in the returned list with the correct order mentioned on Wikipedia.)

In [3]:
visited_nodes_dfs(graph, 'A', [])
Out[3]:
['A', 'B', 'D', 'F', 'E', 'C', 'G']

Exercise 2.1.2: the iterative implementation

From the pseudo code on Wikipedia, implement the iterative function visited_nodes_dfs_iterative(G, start, discovered) that uses a stack. Note that Python lists can be easily used as a stack too, as they implement the special method pop() as follows stack.pop(), whereas the standard methods append() can be used for push as follows: stack.append().

In [4]:
def visited_nodes_dfs_iterative(G, start, discovered):
    #
    # Put your solution here
    #
    return discovered


visited_nodes_dfs_iterative(graph, 'A', [])
Out[4]:
['A', 'E', 'F', 'B', 'D', 'C', 'G']

Introducing NetworkX

Standard functions that are often needed, like depth-first search, are provided in standard python modules. The module networkx implements many standard graph operations. You can save days of programming by spending a few hours reading the NetworkX documentation. (However, programming depth-first search is for educational purposes!). We import the module using the following command: (Tip: with dir(networkx) you get a list of all functions implemented by the module.)

In [5]:
import networkx

Of course, a standard module should also use a standard format of graphs. Such a standard format enables us to exchange graphs with our friends, or share them on-line. One such a standard format for graphs is DOT. Let's use Python to write a DOT file of the example graph from Exercise 2.1.1, as follows:

Tip: Strings can be surrounded in a pair of matching triple-quotes: """ or '''. End of lines do not need to be escaped when using triple-quotes, but they will be included in the string.

In [6]:
dot_graph = """
strict graph G {
  A -- B; A -- C; A -- E;
  B -- D; B -- F;
  C -- G;
  E -- F;
}
"""

f = open('figure-example.dot', 'w') # open file for writing  
f.write(dot_graph)
f.close()

Now, let's open the DOT file with NetworkX, and assign it to the variable G:

Tip: Read the NetworkX Basics completely before starting.

Beware: *NetworkX supports 4 graph types: Graph, DiGraph, MultiGraph, and MultiDiGraph. Our graph should use a Digraph (a directed graph, a single edge between two nodes). The function networkx.read_dot() uses the MultiGraph or MultiDiGraph as its standard representation. To force the system in using a simple Graph, put the read_dot() statement inside the networkx.Graph() statement as follows:

In [7]:
G = networkx.Graph(networkx.read_dot('figure-example.dot'))

Let's introduce another module: IPython, which contains amongst others Python functions for displaying images. To display our graph we first have to transform it to another representation and assign it to the variable viz (In fact, this other representation is used by yet another the Python module, pygraphviz, but we do not need to worry about that).

In [8]:
import IPython

viz = networkx.to_agraph(G)
viz.layout(prog='dot')
viz.draw('figure-example.png')

IPython.display.Image('figure-example.png')
Out[8]:

Is this graph the same graph as the graph from the figure from Wikipedia? Explain your answer.

Exercise 2.1.3: the NetworkX implementation

Search the NetworkX documentation (see the link above) for the depth-first search function. Use the preorder version of depth-first search on the graph G starting at node 'A'. Insert your answer below:

In [9]:
#
# Put your solution here
#
Out[9]:
<generator object <genexpr> at 0x7f76581c10a0>

If all is well, you see a message like the following message:

<generator object <genexpr> at 0x7fa5a14a9f00>

where 0x7fa5a14a9f00 is some random (hexadecimal) number.

About Python generators

The people that programmed the NetworkX module have put a lot of effort in making its functions efficient, also if the graphs are really big. Most NetworkX functions therefore do not produce lists or graphs directly, but instead they produce generator objects.

Generator objects do not actually contain a full data structure like a full list or a full graph. Instead, they contain code that will produce the next node of a list or a graph, one node at a time, each time we ask the generator object for the next node. This way, the system does not need to hold the full list in memory, and in many cases, the full list does not need to be generated, ever.

Suppose we want to know if there is a path between node 'A' and 'G' in a graph. We might execute a depth first search, and loop over the result of depth first search (starting at 'A') until we find node 'G'. If indeed there is a path from 'A' to 'G', we can stop the algorithm and return True, otherwise we will return False. In pseudo code:

1  procedure pathExists(G, v1, v2):
2      for all nodes n in G.dfs(v1) do
3          if n equals v2 then
4              return True
5      return False

to be called as pathExists(G, 'A', 'G'). The full depth first search (dfs) result is never created if G.dfs() produces a generator object (and there is a path between 'A' and 'G'). If G.dfs() produces the actual full list of nodes, then the algorithm still works (be it would be slower if there is indeed a path from 'A' to 'G'). Python objects that contain all data, and that allow looping over all items are called iterators.

Exercise 2.1.4: iterators and generators

To produce the actual data structure (list) from a generator, simply loop over the results using a for statement to produce the data structure (list). For advanced Python users: try to use list comprehensions to obtain compact and elegant code. Insert your answer below:

In [10]:
#
# Put your solution here
#
Out[10]:
['A', 'C', 'G', 'B', 'D', 'F', 'E']

Beware: Once you looped over the results of a generator object, the results are no longer available. Whereas an iterator object can be re-used (we can iterate over it again and again), a generator object cannot be re-used.

Exercise 2.1.5: copy by reference vs. deep copy

When you copy data in Python by assigning one variable to another, you are often not copying the data itself: You just copy the reference to the data. So, if you first copy the data (by reference), and then add something to the copy of the data (say, add an edge to a graph), then you are actually (also) adding the edge to the original data. Let's see this at work. Remove the # character below, so the edge A -- Z is added to the new graph newG:

In [11]:
newG = G
#newG.add_edge('A', 'Z')

Now go to the figure just above Exercise 2.1.3, and re-run the IPython notebook cell by clicking control-shift.

OUCH! The figure for graph G changed too! To make a real copy that we can change without affecting the old graph, use the method copy of G to get a deep copy. Try it out, and check whether the figure above still changes or not.

Tip: From the drop down menu of Jupyter use Cell and then Run All to rerun all statements in your notebook.