A Review of Basic Algorithms and Data Structures in Python: Graph AlgorithmsDiogo RibeiroBlockedUnblockFollowFollowingApr 29IntroductionRecently, while reviewing basic graph algorithms, I decided to write down my study notes as an article in case someone else finds them useful.

To verify my understanding, I wrote minimal implementations of the algorithms in Python which make up the bulk of this article.

Simple unit tests accompany the code.

The unit tests can also be used as examples of using the code.

I’m hoping to write at least a few follow-up posts, focusing on combinatorial algorithms, string algorithms, and maybe even one on computational geometry.

Most of the code was written to be easy to understand without having to reference much else (with a few exceptions, for example, Kruskal’s algorithm uses the disjoint set structure defined in another section).

This results in some duplication, especially in the unit tests.

I consider this to be acceptable, given that the purpose of the code is to be used as educational material and not as code in production use that needs a day to day maintenance.

One last thing before we start: I wrote the article and all the code relatively quickly.

Mistakes and bugs are definitely possible.

Corrections are appreciated; please comment below if you find any.

Table Of ContentsAlgorithms and data structures in this article:Disjoint Set (Union-Find)Kruskal’s Minimum Spanning Tree (MST)Depth First Search (DFS)Breadth First Search (BFS)Kahn’s Topological Sort AlgorithmDijkstra’s Shortest Path AlgorithmBellman-Ford Shortest Path AlgorithmDisjoint Set (Union-Find)The disjoint set structure is used to keep track of a partitioning of a set of objects into subsets.

The main question it needs to answer is “do X and Y belong to the same subset?” and the main operation it needs to support is joining two subsets so that elements in either of the subsets will belong to the same larger subset afterward.

Quick and minimal implementation is provided below.

The implementation below uses a forest to keep track of the subsets in the partition.

Each tree in the forest is one subset, and the root of the tree is the “representative” element of the subset.

To check if two elements belong to the same subset, we check if they have the same representative element.

Noting that the ideal tree in this implementation is a star (this minimizes the number of recursive find calls), we "compress" the paths on each call to find.

That is, we set the parent of all the elements on the path to the representative to the representative as we unwind down the recursive call stack.

class DisjointSet(object): def __init__(self, n): """ Initializes a disjoint set structure consisting of n disjoint sets.

""" self.

parent = list(range(n)) def find(self, x): """Returns the representative element of the set x belongs to.

""" if self.

parent[x] != x: self.

parent[x] = self.

find(self.

parent[x]) return self.

parent[x] def union(self, x, y): """Joins the sets containing x and y.

""" self.

parent[self.

find(x)] = self.

find(y)And the accompanied unit test:import unittestfrom union_find import DisjointSetclass DisjointSetTest(unittest.

TestCase): def test_initialized_state(self): d = DisjointSet(3) self.

assertEqual(d.

find(0), 0) self.

assertEqual(d.

find(1), 1) self.

assertEqual(d.

find(2), 2) def test_basic_union(self): d = DisjointSet(3) d.

union(0, 1) self.

assertEqual(d.

find(0), d.

find(1)) self.

assertNotEqual(d.

find(1), d.

find(2)) def test_basic_union_idempotent(self): d = DisjointSet(2) d.

union(0, 1) d.

union(0, 1) self.

assertEqual(d.

find(0), d.

find(1)) def test_union_all(self): d = DisjointSet(100) for i in range(1, 100): d.

union(i – 1, i) for i in range(1, 100): self.

assertEqual(d.

find(0), d.

find(i))Kruskal’s Minimum Spanning Tree (MST)Kruskal’s minimum spanning tree algorithm is a good example of a greedy algorithm.

Starting with a forest consisting of individual disjoint vertices, at each step we pick the next best edge (one with minimal weight) provided it does not introduce a cycle into the forest, and continue until the forest becomes a tree.

It’s rather easy to prove that the resulting tree is a minimum spanning tree.

Using the disjoint set structure shown above to keep track of the minimum spanning forest, the implementation below is very simple:from collections import namedtuplefrom union_find import DisjointSet# Putting weight as the first element means Edges will sort by weight first,# then source and target (lexicographically).

Edge = namedtuple('Edge', ['weight', 'source', 'target'])def kruskal_mst(n, edges): """ Given a positive integer n (number of vertices) and a collection of Edge namedtuple objects representing the undirected edges of a graph, returns a list of edges forming a minimal spanning tree of the graph.

Assumes the vertices are numbers in the range 0 to n – 1.

Also assumes input is a valid connected undirected graph and that for two vertices v and w only one of (v, w) or (w, v) is an edge in the input.

Output is undefined if these assumptions are not satisfied.

""" d = DisjointSet(n) mst_tree = [] for edge in sorted(edges): if d.

find(edge.

source) != d.

find(edge.

target): mst_tree.

append(edge) if len(mst_tree) == n – 1: break d.

union(edge.

source, edge.

target) return mst_treeAnd the accompanied unit test:import unittestfrom kruskal import kruskal_mst, Edgeclass KruskalMSPTest(unittest.

TestCase): def test_single_vertex_graph(self): self.

assertEqual(kruskal_mst(1, []), []) def test_single_edge_graph(self): edges = [Edge(source=0, target=1, weight=10)] self.

assertEqual(kruskal_mst(2, edges), edges) def test_cycle_5(self): edges = [ Edge(source=0, target=1, weight=50), Edge(source=1, target=2, weight=30), Edge(source=2, target=3, weight=60), Edge(source=3, target=4, weight=20), Edge(source=4, target=0, weight=10), ] # Everything except the heaviest edge.

Output sorted by weight.

self.

assertEqual(kruskal_mst(5, edges), [ Edge(source=4, target=0, weight=10), Edge(source=3, target=4, weight=20), Edge(source=1, target=2, weight=30), Edge(source=0, target=1, weight=50), ]) def test_complete_graph_4(self): edges = [ Edge(source=0, target=1, weight=10), Edge(source=0, target=2, weight=30), Edge(source=0, target=3, weight=40), Edge(source=1, target=2, weight=20), Edge(source=1, target=3, weight=50), Edge(source=2, target=3, weight=60), ] self.

assertEqual(kruskal_mst(4, edges), [ Edge(source=0, target=1, weight=10), Edge(source=1, target=2, weight=20), Edge(source=0, target=3, weight=40), ])Depth First Search (DFS)Depth-first search is arguably the simplest graph traversal algorithm.

It’s a simple recursive algorithm that just needs to keep track of which vertices have already been processed.

In fact, many other recursive algorithms can be thought of as a DFS on some underlying graph (e.

g.

binary search is guided DFS on the binary search tree).

DFS can be used to determine if there is a path from a vertex to another and to visit every vertex starting from a source vertex.

Variations of DFS can be used for determining connected components and doing topological sorting.

The code below simply uses DFS to return all vertices reachable from a starting vertex.

def dfs(graph, source): """ Given a directed graph (format described below), and a source vertex, returns a set of vertices reachable from source.

The graph parameter is expected to be a dictionary mapping each vertex to a list of vertices indicating outgoing edges.

For example if vertex v has outgoing edges to u and w we have graph[v] = [u, w].

""" visited = set() def _recurse(v): if v in visited: return visited.

add(v) for w in graph[v]: _recurse(w) _recurse(source) return visitedAnd the accompanied unit test:import unittestfrom dfs import dfsclass DFSTest(unittest.

TestCase): def test_single_vertex(self): graph = {0: []} self.

assertEqual(dfs(graph, 0), {0}) def test_single_vertex_with_loop(self): graph = {0: [0]} self.

assertEqual(dfs(graph, 0), {0}) def test_two_vertices_no_path(self): graph = { 0: [], 1: [], } self.

assertEqual(dfs(graph, 0), {0}) self.

assertEqual(dfs(graph, 1), {1}) def test_two_vertices_with_simple_path(self): graph = { 0: [1], 1: [], } self.

assertEqual(dfs(graph, 0), {0, 1}) self.

assertEqual(dfs(graph, 1), {1}) def test_complete_graph(self): def _complete_graph(n): return {v: list(set(range(n)) – {v}) for v in range(n)} for n in range(2, 10): graph = _complete_graph(n) for v in range(n): self.

assertEqual(dfs(graph, v), set(range(n))) def test_cycle_5(self): graph = { 0: [1], 1: [2], 2: [3], 3: [4], 4: [0], } for v in range(5): self.

assertEqual(dfs(graph, v), {0, 1, 2, 3, 4})Breadth First Search (BFS)BFS is one of the simplest graph algorithms and a good algorithm to understand prior to Dijkstra’s, which is coming up next.

It can be used to simply traverse a graph and visit every vertex, to search for a particular vertex, or find the shortest path (assuming edges don’t have weights) to every vertex starting from a single vertex.

from collections import dequedef bfs(graph, source, target): """ Given a directed graph (format described below), and source and target vertices, returns a shortest unweighted path as a list of vertices going from source to target, or None if no such path exists.

Returned path will not include the source vertex in it.

The graph parameter is expected to be a dictionary mapping each vertex to a list of vertices indicating outgoing edges.

For example if vertex v has outgoing edges to u and w we have graph[v] = [u, w].

""" q = deque([source]) # previous_vertex[v] holds the immediate vertex before v in the shortest # path from source to v.

This dictionary also acts as our "visited" set # since we set previous_vertex[v] as soon as the vertex enters our queue.

previous_vertex = {source: source} while q: v = q.

popleft() if v == target: return _construct_path(previous_vertex, source, target) for w in graph[v]: if w not in previous_vertex: previous_vertex[w] = v q.

append(w) return Nonedef _construct_path(previous_vertex, source, target): if source == target: return [] return _construct_path(previous_vertex, source, previous_vertex[target]) + [target]And the accompanied unit test:import unittestfrom bfs import bfsclass BFSTest(unittest.

TestCase): def test_single_vertex(self): graph = {0: []} self.

assertEqual(bfs(graph, 0, 0), []) def test_single_vertex_with_loop(self): graph = {0: [0]} self.

assertEqual(bfs(graph, 0, 0), []) def test_two_vertices_no_path(self): graph = { 0: [], 1: [], } self.

assertEqual(bfs(graph, 0, 1), None) def test_two_vertices_with_simple_path(self): graph = { 0: [1], 1: [], } self.

assertEqual(bfs(graph, 0, 1), [1]) def test_complete_graph(self): def _complete_graph(n): return {v: list(set(range(n)) – {v}) for v in range(n)} for n in range(2, 10): graph = _complete_graph(n) for v in range(n): for w in range(n): self.

assertEqual(bfs(graph, v, w), [] if v == w else [w]) def test_cycle_5(self): graph = { 0: [4, 1], 1: [0, 2], 2: [1, 3], 3: [2, 4], 4: [3, 0], } self.

assertEqual(bfs(graph, 0, 2), [1, 2]) self.

assertEqual(bfs(graph, 0, 3), [4, 3])Kahn’s Topological Sort AlgorithmGiven a directed acyclic graph (DAG) representing a set of, say, tasks and their dependencies, the topological sort is the problem of finding an order of task execution that will satisfy all the dependencies.

This problem arises in a variety of applications.

Examples include task scheduling, build systems (e.

g.

Bazel), parallel pipelines (e.

g.

Hadoop), and formula evaluation (e.

g.

in spreadsheets).

While a variation of DFS can be used for topological sorting, my personal favorite algorithm for doing topological sorts is Kahn’s algorithm, due to its intuitiveness.

The idea behind the algorithm is simple: start with vertices with no incoming edges, process them, and then remove them and all their outgoing edges from the graph and continue until there’s nothing left in the graph.

In the code below, instead of returning a particular topological sort, the algorithm assigns a “sequence” to each vertex, such that if sequence[v] < sequence[w] then v should be before w in any topological sort of the graph.

This simplifies unit testing, and also allows for easier use of the output in cases where parallelization is possible (since all tasks with the same sequence number can be executed in parallel).

from collections import deque, namedtupleVertex = namedtuple('Vertex', ['name', 'incoming', 'outgoing'])def build_doubly_linked_graph(graph): """ Given a graph with only outgoing edges, build a graph with incoming and outgoing edges.

The returned graph will be a dictionary mapping vertex to a Vertex namedtuple with sets of incoming and outgoing vertices.

""" g = {v:Vertex(name=v, incoming=set(), outgoing=set(o)) for v, o in graph.

items()} for v in g.

values(): for w in v.

outgoing: if w in g: g[w].

incoming.

add(v.

name) else: g[w] = Vertex(name=w, incoming={v}, outgoing=set()) return gdef kahn_top_sort(graph): """ Given an acyclic directed graph (format described below), returns a dictionary mapping vertex to sequence such that sorting by the sequence component will result in a topological sort of the input graph.

Output is undefined if input is a not a valid DAG.

The graph parameter is expected to be a dictionary mapping each vertex to a list of vertices indicating outgoing edges.

For example if vertex v has outgoing edges to u and w we have graph[v] = [u, w].

""" g = build_doubly_linked_graph(graph) # sequence[v] < sequence[w] implies v should be before w in the topological # sort.

q = deque(v.

name for v in g.

values() if not v.

incoming) sequence = {v: 0 for v in q} while q: v = q.

popleft() for w in g[v].

outgoing: g[w].

incoming.

remove(v) if not g[w].

incoming: sequence[w] = sequence[v] + 1 q.

append(w) return sequenceAnd the accompanied unit test:import unittestfrom kahn import kahn_top_sortclass KahnTopSortTest(unittest.

TestCase): def test_single_vertex(self): graph = { 0: [], } self.

assertEqual(kahn_top_sort(graph), { 0: 0, }) def test_total_order_2(self): graph = { 0: [1], 1: [], } self.

assertEqual(kahn_top_sort(graph), { 0: 0, 1: 1, }) def test_total_order_3(self): graph = { 0: [1], 1: [2], 2: [], } self.

assertEqual(kahn_top_sort(graph), { 0: 0, 1: 1, 2: 2, }) def test_two_independent_total_orders(self): # 0 -> 1 -> 2 # 3 -> 4 -> 5 graph = { 0: [1], 1: [2], 2: [], 3: [4], 4: [5], 5: [], } self.

assertEqual(kahn_top_sort(graph), { 0: 0, 3: 0, 1: 1, 4: 1, 2: 2, 5: 2, }) def test_simple_dag_1(self): # 0 -> 1 -> 2 # / # 3 graph = { 0: [1, 3], 1: [2], 2: [], 3: [1], } self.

assertEqual(kahn_top_sort(graph), { 0: 0, 3: 1, 1: 2, 2: 3, })Dijkstra’s Shortest Path AlgorithmDijkstra’s shortest path algorithm is very similar to BFS, except a priority queue is used instead of a regular queue.

A proper implementation would use a priority queue with an “update key” operation which would reduce the redundant items in the queue.

The implementation below, for the sake of simplicity, uses the built-in Python PriorityQueue which does not support "update key".

The invariant in the algorithm is that each time we get an item from the queue, we know that we have the shortest path from source to it already (this is where the guarantee of non-negative weights is key, as this invariant can fail if we have negative weights.

)from collections import namedtuple, defaultdictfrom Queue import PriorityQueueEdge = namedtuple('Edge', ['target', 'weight'])def dijkstra(graph, source, target): """ Given a directed graph (format described below), and source and target vertices, returns a shortest path as a list of vertices going from source to target, along with the distance of the shortest path, or None if no such path exists.

Returned path will not include the source vertex in it.

Assumes non-negative weights.

The graph parameter is expected to be a dictionary mapping each vertex to a list of Edge named tuples indicating the vertex's outgoing edges.

For example if vertex v has outgoing edges to u and w with weights 10 and 20 respectively, we have graph[v] = [Edge(u, 10), Edge(w, 20)].

""" q = PriorityQueue() q.

put((0, source)) # previous_vertex[v] holds the immediate vertex before v in the shortest # path from source to v.

This dictionary also acts as our "visited" set # since we set previous_vertex[v] as soon as the vertex enters our queue.

previous_vertex = {source: source} # Arguably not the best way to represent infinity but it works for the sake # of learning the algorithm.

shortest_distance = defaultdict(lambda: float('inf')) shortest_distance[source] = 0 while not q.

empty(): (distance, v) = q.

get() if v == target: return (distance, _construct_path(previous_vertex, source, target)) for edge in graph[v]: alt_distance = edge.

weight + distance if alt_distance < shortest_distance[edge.

target]: shortest_distance[edge.

target] = alt_distance q.

put((alt_distance, edge.

target)) previous_vertex[edge.

target] = v return Nonedef _construct_path(previous_vertex, source, target): if source == target: return [] return _construct_path(previous_vertex, source, previous_vertex[target]) + [target]And the accompanied unit test:import unittestfrom dijkstra import dijkstra, Edgeclass DijkstraTest(unittest.

TestCase): def test_single_vertex(self): graph = {0: []} self.

assertEqual(dijkstra(graph, 0, 0), (0, [])) def test_two_vertices_no_path(self): graph = { 0: [], 1: [], } self.

assertEqual(dijkstra(graph, 0, 1), None) def test_two_vertices_with_path(self): graph = { 0: [Edge(target=1, weight=10)], 1: [], } self.

assertEqual(dijkstra(graph, 0, 1), (10, [1])) def test_cycle_3(self): graph = { 0: [Edge(target=1, weight=10), Edge(target=2, weight=30)], 1: [Edge(target=0, weight=10), Edge(target=2, weight=10)], 2: [Edge(target=0, weight=30), Edge(target=1, weight=30)], } self.

assertEqual(dijkstra(graph, 0, 2), (20, [1, 2])) def test_clrs_example(self): graph = { 's': [ Edge(target='t', weight=3), Edge(target='y', weight=5), ], 't': [ Edge(target='x', weight=6), Edge(target='y', weight=2), ], 'y': [ Edge(target='t', weight=1), Edge(target='z', weight=6), ], 'x': [ Edge(target='z', weight=2), ], 'z': [ Edge(target='x', weight=7), Edge(target='s', weight=3), ], } distance, path = dijkstra(graph, 's', 'z') self.

assertEqual(distance, 11) self.

assertIn(path, [ ['y', 'z'], ['t', 'y', 'x', 'z'], ]) distance, path = dijkstra(graph, 's', 'x') self.

assertEqual(distance, 9) self.

assertIn(path, [ ['t', 'x'], ['y', 'x'], ])Bellman-Ford Shortest Path AlgorithmBellman-Ford is another single-source shortest path algorithm.

It’s very easy to implement but has worse running time than Dijkstra’s.

While in Dijkstra’s we relax edges greedily based on the next closest vertex to the source, in Bellman-Ford we relax every edge exactly n-1 times.

Each such iteration guarantees to increase the number of vertices for which we have the shortest path by at least one, and hence after n-1 iterations, we have the shortest path to every vertex.

We then do a final loop over all the edges and try to relax further.

If we succeed, we know a negative cycle exists.

This is the key advantage of Bellman-Ford as compared to Dijkstra’s (Dijkstra’s algorithm does not work if negative weights exist.

)Here’s a basic implementation:from collections import namedtuple, defaultdictEdge = namedtuple('Edge', ['target', 'weight'])def bellman_ford(graph, source, target): """ Given a directed graph (format described below), and source and target vertices, returns a shortest path as a list of vertices going from source to target, along with the distance of the shortest path, or None if no such path exists and -1 if a negative loop is found.

Returned path will not include the source vertex in it.

Assumes non-negative weights.

The graph parameter is expected to be a dictionary mapping each vertex to a list of Edge named tuples indicating the vertex's outgoing edges.

For example if vertex v has outgoing edges to u and w with weights 10 and 20 respectively, we have graph[v] = [Edge(u, 10), Edge(w, 20)].

""" # previous_vertex[v] holds the immediate vertex before v in the shortest # path from source to v.

This dictionary also acts as our "visited" set # since we set previous_vertex[v] as soon as the vertex enters our queue.

previous_vertex = {source: source} # Arguably not the best way to represent infinity but it works for the sake # of learning the algorithm.

shortest_distance = defaultdict(lambda: float('inf')) shortest_distance[source] = 0 # Run n – 1 times.

We start by knowing the shortest path to 1 vertex # (source itself) and each iteration below increases the vertices for which # we have the shortest path to by one.

This means at the end we have the # shortest path to 1 + (n – 1) = n vertices.

for i in range(len(graph) – 1): for v in graph: for edge in graph[v]: alt_distance = shortest_distance[v] + edge.

weight if alt_distance < shortest_distance[edge.

target]: shortest_distance[edge.

target] = alt_distance previous_vertex[edge.

target] = v # Final loop over all edges to check for negative loops.

If at this point # we find a shorter alternative path it means a negative loop exists.

for v in graph: for edge in graph[v]: alt_distance = shortest_distance[v] + edge.

weight if alt_distance < shortest_distance[edge.

target]: return -1 if shortest_distance[target] < float('inf'): return (shortest_distance[target], _construct_path(previous_vertex, source, target)) return Nonedef _construct_path(previous_vertex, source, target): if source == target: return [] return _construct_path(previous_vertex, source, previous_vertex[target]) + [target]And as before, accompanied unit test, which is a copy of the one used for Dijkstra’s, with an additional test for negative cycles:import unittestfrom bellman import bellman_ford, Edgeclass BellmanFordTest(unittest.

TestCase): def test_single_vertex(self): graph = {0: []} self.

assertEqual(bellman_ford(graph, 0, 0), (0, [])) def test_two_vertices_no_path(self): graph = { 0: [], 1: [], } self.

assertEqual(bellman_ford(graph, 0, 1), None) def test_two_vertices_with_path(self): graph = { 0: [Edge(target=1, weight=10)], 1: [], } self.

assertEqual(bellman_ford(graph, 0, 1), (10, [1])) def test_cycle_3(self): graph = { 0: [Edge(target=1, weight=10), Edge(target=2, weight=30)], 1: [Edge(target=0, weight=10), Edge(target=2, weight=10)], 2: [Edge(target=0, weight=30), Edge(target=1, weight=30)], } self.

assertEqual(bellman_ford(graph, 0, 2), (20, [1, 2])) def test_negative_cycle_3(self): graph = { 0: [Edge(target=1, weight=10), Edge(target=2, weight=30)], 1: [Edge(target=0, weight=10), Edge(target=2, weight=10)], 2: [Edge(target=0, weight=-30), Edge(target=1, weight=30)], } self.

assertEqual(bellman_ford(graph, 0, 2), -1) def test_clrs_example(self): graph = { 's': [ Edge(target='t', weight=3), Edge(target='y', weight=5), ], 't': [ Edge(target='x', weight=6), Edge(target='y', weight=2), ], 'y': [ Edge(target='t', weight=1), Edge(target='z', weight=6), ], 'x': [ Edge(target='z', weight=2), ], 'z': [ Edge(target='x', weight=7), Edge(target='s', weight=3), ], } distance, path = bellman_ford(graph, 's', 'z') self.

assertEqual(distance, 11) self.

assertIn(path, [ ['y', 'z'], ['t', 'y', 'x', 'z'], ]) distance, path = bellman_ford(graph, 's', 'x') self.

assertEqual(distance, 9) self.

assertIn(path, [ ['t', 'x'], ['y', 'x'], ]).