How Git-diff algorithm works? — With F#

It’s a pretty complex algorithm.

My intent here, isn’t to explain how exactly git-diff works, but how it could be implemented using F# (which is a really good programming language, but it’s topic for a different post).

We’ll use the LCS (longest common subsequence) algorithm to compare two different texts, the original one and the edited.

You can check the full implementation here, but I strongly recommend you to read all the explanation before check it out on Github.

Let’s take a social media comment to understand how the algorithm works:The movie on Saturday was great!After posting it, our random social media friend realize that the movie was actually on Friday, then, he edit the post to:The movie on Friday was great!When applying the algorithm the result needs to be the longest common subsequence, for instance:The movie on ___day was great!After got this result, we are able to create new comparisons (not included in this post) to find which part of the text was deleted and which one is new.

We can know which part of the original text was deleted by comparing the longest subsequence (The movie on day was great!)with the original string (The movie on Saturday was great!).

By doing the same thing with the edited string we can figure out what is new (The movie on Friday was great!).

Simple, right?As you probably already noticed, the longest common subsequence doesn’t require necessarily to be continuous, it’s just the longest, ok?Let’s take another example:firstString = "BMOAL"secondString = "BLOA"With these strings we can find out “BL” as result, but the correct answer is actually “BOA”, because it need to be the longest one.

Ok, now you know how the problem works, but how we can solve it?The answer can sound a little scary for some developers (don’t be this guy): recursion.

We need to compare char by char, so, it can be decomposed in a recursion solution easily.

There’s a few cases that we need to solve.

The first case is: comparing the same char, this case is pretty simple, you just need to increase the compared char to the result, then continue to compare, by skipping this char in both strings.

The LCS of “BMOAL” and “BLOA” is equivalent to “B” + LCS of “MOAL” and “LOA”.

Let’s start to code!We will create a function called “longest Common Subsequence Solve”, this function is responsible to solve this algorithm as a whole.

In F# we need use the rec keyword, to describe a function as recursive, besides that, we need to receive two different parameters, the strings.

As we discussed before, we need to compare char by char, so, the function also need to receive the index of char that will be compared.

Here, we can create two different solutions.

The first one is simply adding more parameters to our function, but this isn’t the best one, because you would be passing the responsability of controlling indexes to the consumer, to the start, at least.

The best scenario is don’t require the consumer pass the index parameters, as it’s used just for internal control.

We can create a nested function to accomplish this requirement.

By doing this, the main function will just bypass the problem for the nested one, with the correct initial indexes (0 and 0).

Before we start to solve the problem itself, we’ll create two enhancements to make the solution more easy.

The first enhancement is to create a derivative version of the solve function by already passing the strings as parameters.

We could choose to omit the string parameters and get them by scope, but in order to avoid any kind of side effect it is better to keep the functions pure.

The second enhancement is to create an operator to concatenate a char with a string by calling the sprintf native function.

Now, let’s implement the first case, when the two chars are the same.

We’ll do it with a simple pattern matching syntax:And that’s it.

Simple, right?Let’s continue with the next (and obvious) case, two different chars.

This case is a little bit harder, because the solution starts to branch.

Let’s take our example used before:"B" + LCS "MOAL" "LOA"In order to solve it, we need to follow the two possibles paths:Skipping the char of the first stringSkipping the char of the second stringAfter that, we compare the two obtained results and continue to solve the algorithm by keeping the longest result and discarding the other.

"B" + LCS "MOAL" "LOA" ="B" MAX_LENGTH(LCS "MOAL, LCS "LOA")Now, let’s implement it with a new nested function called “solve when different chars”, this function needs to call the derivative solve function twice, one for each possible path, then returns the longest result.

You can use the “≥” comparison as well, it’s just a definition question that isn’t necessarily related with the answer is right or wrong.

Now, let’s include a new branch on our pattern matching:The last case is the simplest one, we need to stop when the algorithm reach the end of string.

We can simply compare the indexes with the string lenght:And it’s ready!.Let’s test!And it’s looks right, isn’t?Now, let’s try with a real function, like Git-diff:… And it works?Maybe… But we aren’t able to see, because it is TOO SLOW.

How we can improve it?By using dynamic programming, a fancy name to a simple thing.

Ok, it’s not THAT simple, but we can understand dynamic programming as a process that store the intermediate results to avoid repeating the same computations.

We’ll doing this by implementing a cache matrix, yes, it need to be a mutable value, but let me tell you a secret, this isn’t necessarily wrong, in some scenarios (like this one), is totally acceptable to create a mutable values to increase performance.

This isn’t a rule, always avoid mutable values, right?They aren’t forbidden, but they do are problematic.

How exactly this cache matrix will work?.Simple, we already have two different indexes indexOriginal and indexEdited .

We’ll use they as positions in the matrix.

We also need to create this matrix internally and pass as parameter to solve function as code bellow.

Now, we need to create a new function to cache the result into the matrix and update the pattern matching to call it:Finally, we’ll add a new pattern that check the cache before compute a LCS of two chars:Yeah!.Now it’s done!Let’s try again:If your solution were developed correctly the result is shown almost instantly.

This isn’t the best solution yet, because it can throws a StackOverflowException with large texts.

We can solve it, but this is enough for now!See you in the next post!.

. More details

Leave a Reply