The big D with big O

A value is said to be asymptotic if it approaches an infinite limit but never reaches it.

When we say that we are doing asymptotic analysis of space and time complexity, we’re saying that we are looking for the worst possible runtime or space needed to complete some operation.

How do we calculate Big O notation?Let’s apply this knowledge to problem solving in the context of writing algorithms.

An algorithm is just the process used in solving a problem.

It’s the functions you’re writing taking some input and outputting the solution.

Programming is a double-edged sword when it comes to problem solving.

Often times there’s more than one way to get the correct output, which can be really empowering when you get it working for the first time (????), but really frustrating when you go into code review (????).

This is where analyzing the bottleneck before you put in a Pull Request can help you ahead of time.

Let’s consider a function for a feature I wrote that takes a set of data from all of your Github repositories.

The goal of using this data was to create a graph representing the amount of code I’ve written in Github per language, to display next to my resume.

A graph of languages (tutorial coming soon!)So the data needed to be a sum of all the bytes of each language across all of my repositories with the name of the language, along with the sum of all bytes of all languages across all of the repositories.

The data I got back was by repository, and then by language.

So I needed to write an algorithm to reduce the data to a single array of languages and a total.

It’s impossible for us to consider the complexity until we actually work through the problem’s solution, but we should start thinking of it immediately, so we can consider how we can reduce it if it’s not optimal, or worse, if it’s horrible.

If we consider a high level iteration of our problem, we can consider the runtime we need to try to work down from.

The first step is creating an array.

This is very fast, and we’re only doing it once.

This takes a constant time, so its time complexity is O(1).

It’s never going to be our complexity bottleneck.

We know we’re going to have an array of size r (for repositories) which is the number of repositories that we will have to loop through at least one time, depending on how we code.

Its time complexity then is linear, so looping through each repository takes O(r) time.

Each repository has an array of size l (for languages) which is the number of languages it contains.

We have to loop through each of these one time, and perform an operation of placing it in the new array.

While this sounds like O(l + 1), we will have to match it to a unique language in the new array, the size of which we don’t actually know, and we are completing each sum operation l times.

Let’s say that the new array is a size of s (for summed languages).

In order to find the matching language, we’ll need to loop over the new array, and then we can perform the summing operation.

We are completing s ✗ 1 operations for each l, so we have to multiply the two terms.

For each l, and for each s complete 1 summing operation.

So the complexity of this operation is O(l ✗ s ✗ 1), which simplifies to O(ls).

Next, we need the sum of all of the summed language sizes, which we can get by looping through the new array, and adding the sum of each language to a running total sum.

This will then have a complexity of O(s).

So what is the total worst-case complexity of this theoretical algorithm?.We have O(1) + O(r) ✗ O(ls) + O(s).

We multiply the second and fourth terms because we are doing l ✗ s operations per r.

Here’s where things get imprecise and hand-wavy.

????We don’t actually care about l or s.

Why?.Because we only care about how the complexity compares to our input.

How does our runtime or space requirement grow as the number of our repositories increases?.Since it grows with the repositories, we only care about the size of r.

In notation, we don’t need to be specific for the variable name, so we just refer to as the input as n.

So now we can change this to O(1) + O(n) ✗ O(n²) + O(n).

Breakdown of each termIf we were to graph this, we would say: for each n, increase the number of operations by n times n.

We are also increasing linearly, but we can ignore the lower terms, as we know they are arbitrary in this case.

Why?.Again, we care only about the growth of our time or space requirements compared to our input.

We can consider n³ and n³ + n + 1 to be roughly equal because n + 1 has very little impact on a cubed value.

Let’s say n = 10.

10³ = 1000.

10³ + 10 + 1 = 1011.

See?The leading-order term then, the part with the biggest size, is n³.

So our Big O is O(n³).

We’ll write n ∈ O(n³).

∈ just means “element of”.

Like repository ∈ repositories.

We don’t need to be precise, we just want to know roughly how the runtime (or space requirements) will increase as our input gets larger.

Exponential increase in operations per elementIf you’re thinking that this looks pretty horrible, you’d be right.

We can see that we probably don’t want to code the algorithm the way we’ve described, and we’ll want to consider whether or not we can reduce our complexity before we write any of it.

Sometimes it’s not possible to avoid this type of worst-case complexity, depending on the data you get in and what you need to get out.

This solution is not scalable, but for a really small n, does it need to scale?.No.

However, in this case, we can get an unknown number of repositories.

Let’s see if we can reduce.

So we’ve identified at least one place we can reduce n³, while we identified a second spot, we’re going to first focus on the line that we know we can change.

How can we change it?.The key is in the first question we asked.

How else can we hold a list of data?.If we know we are going to compare and compute a running sum, is there any data type that will help us?.If you guessed a hash map, you’d be spot on!Our third loop then can be done away with, and we can use our previous loop to build and access it.

Even though moving the last step into a previous loop won’t change the Big O notation, we should still do it.

If it’s an optimization that doesn’t trade off readability or much space, that’s a choice we can make.

Be aware though that the variable we create will be living for the entirety of the first loop, which has a few operations, so this may not be right decision for every runtime optimization.

Let’s calculate the new complexity.

We create a hash map: O(1).

We loop through each repository: O(n).

a.

We loop through each language: O(n).

 — Add name as key to hashmap and size as value, if it exists, add to size to current value O(1) — Add value to running sum of total size O(1)Now you can see we only multiply n times n once, and again, we drop the O(1) terms, and we can see that we have a big O of O(n²).

This is undoubtedly the best upper bound complexity we are going to get, but there’s a catch.

We know that there’s a fixed number of languages, so as our input is lower, our number of operations may actually be higher than n², and as we go above the fixed number of languages, our number of operations will be lower than n².

????So what does this mean?.Well, the worst-case complexity is still O(n²), and we don’t really care how precise that is, but maybe we can still optimize our code.

We could find the number of unique languages available on Github, sure, but we would drop any coefficient of n² we added anyway (because it’s negligible for numbers high above n compared to n), so it’s better just to look to optimize our code knowing this piece of information.

We can’t really improve in the scope of this function without changing the shape of the data of our input, so we will just leave it at n ∈ O(n²) until we can change the bottleneck causing the exponential complexity.

O can be Big, but can it ever be little?What about cases where we can optimize in scope, and the runtime or space requirements will never reach their Big O notation?Let’s consider this example:Some naive initial consideration here would be to take the difference (d) between x and y, and take the second power to the base of d and loop through iterations the size of the product + d, giving us a big O of O(n²).

But are all of those iterations necessary?Half + d of the iterations are just the inverse of an earlier iteration!.So we can iterate on ½n².

This is an optimization we won’t express in big O, because the difference between n to n² and n to ½n² is negligible as you approach infinity.

Even though say, 1 trillion is a lot more than 500 billion, they are both astronomically larger than 1.

1 trillion is 1 trillion times larger than 1, and 500 billion is a half a trillion times larger than 1.

Our upper bound worst-case complexity won’t change, so we’ll still drop the coefficient of ½.

We can, however, express that our function will never run at n² with little o notation.

We can just write o(n²) and that will communicate that it will explicitly never reach the theoretical limit of n², without having to get precise.

What about Omega and Theta?Ω (Big Omega): The best run time for the smallest input (n).

Let’s consider our Github languages problem.

Let’s say we’re really specific in our languages and for some reason we only write in javascript files.

No html files, no CSS files, nothing else.

Just bytes and bytes of javascript code, in every repository.

What’s the best runtime if we don’t have to run a second loop?.Well that’s just n ✗ 1, or n.

But what if there’s only one repository and one language?. More details

Leave a Reply