When to Use Each of the Git Diff Algorithms

 git diff is a Git command to show changes between the working tree and the index or a tree, changes between the index and a tree, changes between two trees, changes resulting from a merge, changes between two blob objects, or changes between two files on disk (source).

As of time of writing, the latest version of git is 2.28.0, and it supports a total of 4 diff algorithms, namely myers, minimal, patience, and histogram. In the following sections, I give my take on when each of these algorithms should be used. This post does not cover a breakdown of how the algorithm works, and/or its complexity – you can find these via a quick search.

Before diving into the details, one caveat is that YMMV based on the 1) types of files being compared, 2) changes within the files, and possibly 3) package/version of git you are using. It is possible for one out of a plethora of changes to impact the effectiveness of the diff algorithm(s). I recommend to treat this post as a starter guide to figure out which algorithm suits your needs the best.

‐‐diff-algorithm={myers|default}

The default greedy algorithm, originally developed by Eugene W. Myers in 1986, is used when myers, default or no explicit algorithm is specified. It works well for the majority of scenarios, providing a good readable diff, with reasonable size and runtime.

When to use myers:

  • Works well for most types of files and changes
  • If you are unsure which diff algorithm to use

‐‐diff-algorithm=minimal

The myers algorithm uses a heuristic to ensure a decent trade-off between execution time and input size. However, there are cases where one might favor having the smallest diff size over the execution time. In the Linux diff utility, one can pass in a flag to disable the heuristic and run the full set of algorithm iterations. The following paragraph about the Linux diff command sums it up well:

The basic algorithm is described in: “An O(ND) Difference Algorithm and its Variations”, Eugene Myers, Algorithmica Vol. 1 No. 2, 1986, pp. 251-266; … Unless the –minimal option is specified, this code uses the TOO_EXPENSIVE heuristic, by Paul Eggert, to limit the cost to O(N**1.5 log N) at the price of producing suboptimal output for large inputs with many differences.

https://wiki.c2.com/?DiffAlgorithm

The minimal diff algorithm is the equivalent in Git context.

We can see a comparison of diff changes between myers vs minimal by using a large enough input size. In the example below, I compare the number of changes between two sets of Wikipedia articles when using each algorithm:

# Prepare the data for diff'ing
$ curl {article-A,article-B,article-C,article-D} > wikipediaSet1
$ curl {article-E,article-F,article-G,article-H} > wikipediaSet2

# Diff with myers algorithm
$ git diff --diff-algorithm=myers --shortstat wikipediaSet1 wikipediaSet2
 1 file changed, 7990 insertions(+), 4463 deletions(-)

# Diff with minimal algorithm
~$ git diff --diff-algorithm=minimal --shortstat wikipediaSet1 wikipediaSet2
 1 file changed, 7712 insertions(+), 4185 deletions(-)

Note that its possible to use the minimal diff algorithm with a small input size, though the output is likely to be identical to myers as the problem space to iterate over is within the heuristic.

When to use minimal:

  • Comparing many changes between large files
  • Do not mind spending more time to generate a smaller set of diffs/changes

‐‐diff-algorithm=patience

The patience algorithm was invented by Bram Cohen (creator of BitTorrent), and the key difference is that it matches similar lines between two files differently. This paragraph from James Coglan’s post sums it up nicely:

The first thing to note is that patience diff is not a diff algorithm in and of itself. What it really is is a method of matching up lines in two versions of a document in order to break them into smaller pieces, before using an actual diff algorithm like Myers on those pieces.

https://blog.jcoglan.com/2017/09/19/the-patience-diff-algorithm/

To demonstrate how patience can produce better diffs than myers, let’s first look at the diff produced by myers based on two pseudo-code Java files (for raw files refer to this post’s appendices):

$ git diff --diff-algorithm=myers File1 File2

diff --git a/File1 b/File2
index 38ee31b..2624ba3 100644
--- a/File1
+++ b/File2
@@ -1,20 +1,24 @@
 public class File1 {
 
-  public int add (int a, int b)
+  public int sub (int a, int b)
   {
+    // TOOD: JIRA1234
+    if ( isNull(a, b) )
+    {
+        return null
+    }
     log();
-    return a + b;
+    return a - b;
   }
 
-  public int sub (int a, int b)
+  public int mul (int a, int b)
   {
-    if (a == b)
+    if ( isNull(a, b) )
     {
-        return 0;
+        return null;
     }
     log();
-    return a - b;
-    // TOOD: JIRA1234
+    return a * b;
   }
 
 }

We can observe that the myers diff :

  • Matched lines with just brackets between the two files (not useful to understand changes)
  • Added and removed the sub() method in the same file (in actual fact it was just reordered)

It is not easy to understand at a diff like this, as almost every line is a change. The reviewer can easily miss edge cases and/or bugs in the logic.

Let’s now see the diff produced by patience:

$ git diff --diff-algorithm=patience File1 File2

diff --git a/File1 b/File2
index 38ee31b..2624ba3 100644
--- a/File1
+++ b/File2
@@ -1,20 +1,24 @@
 public class File1 {
 
- public int add (int a, int b) - { - log(); - return a + b; - } -
public int sub (int a, int b) {
- if (a == b) - { - return 0; - } - log(); - return a - b;
// TOOD: JIRA1234
+ if ( isNull(a, b) ) + { + return null + } + log(); + return a - b; + } + + public int mul (int a, int b) + { + if ( isNull(a, b) ) + { + return null; + } + log(); + return a * b;
} }

With this diff produced by patience, its much easier to observe the changes:

  • add() method was removed
  • mul() method was added
  • sub() method has some logic changes

When to use patience:

  • Reordering code/content and myers diff matches trivial lines
  • The same lines are added and removed in the same file

‐‐diff-algorithm=histogram

The histogram diff algorithm is an improvement on patience, which originated from jgit. This stackoverflow question explains it well, and links to the jgit HistogramDiff class that contains a detailed explanation:

By always selecting a LCS position with the lowest occurrence count, this algorithm behaves exactly like Bram Cohen’s patience diff whenever there is a unique common element available between the two sequences. When no unique elements exist, the lowest occurrence element is chosen instead. This offers more readable diffs than simply falling back on the standard Myers’ O(ND) algorithm would produce.

https://github.com/eclipse/jgit/blob/ebfd62433a58d23af221adfdffed56d9274f4268/org.eclipse.jgit/src/org/eclipse/jgit/diff/HistogramDiff.java#L65-L70

To better understand why this approach is better, let’s see one of the earlier changes when using the patience diff algorithm:

$ git diff --diff-algorithm=patience File1 File2
[...]

   public int sub (int a, int b)
   {
- if (a == b) - { - return 0; - } - log(); - return a - b;
// TOOD: JIRA1234
+ if ( isNull(a, b) ) + { + return null + } + log(); + return a - b; + }
[...]

As part of the logic changes within the sub() method, I moved the TODO comment from the last to first line. Unfortunately, patience seems to think that the comment remained unchanged between the two files while the original lines were removed.

However, this representation is not incorrect per se, but it does suggest that all original logic was thrown out (including the log() and return statements) and a new set inserted.

We can show the changes better with historgram:

$ git diff --diff-algorithm=histogram File1 File2

[..]

   public int sub (int a, int b)
   {
-    if (a == b)
+    // TOOD: JIRA1234
+    if ( isNull(a, b) )
     {
-        return 0;
+        return null
     }
     log();
     return a - b;
-    // TOOD: JIRA1234
+  }

[...]

Now we can clearly see that the two changes are: 1) reordering of the comment, and 2) a new input sanitization check.

When to use histogram:

  • Comparing code changes, according to this fairly recent research done by Nugroho et al (2019)
  • When patience seems to have trouble representing some of the changes made

Summary

Given the set of all possible changes across various types of files (e.g. source code, YAML, documentation), each diff algorithm of the git diff utility is better suited for a specific subset. Although the default myers algorithm works well most of the time, there would be times where a change can be better represented using another approach.

I believe there will be a set of changes/files where none of the provided diff algorithms will work perfectly. In such cases, either go with a sub-optimal diff output, or look outside the git diff utility.

Citations

[1] Nugroho, Y.S., Hata, H. & Matsumoto, K. How different are different diff algorithms in Git?. Empir Software Eng 25, 790–823 (2020). https://doi.org/10.1007/s10664-019-09772-z

Appendices

File1

public class File1 {

  public int add (int a, int b)
  {
    log();
    return a + b;
  }

  public int sub (int a, int b)
  {
    if (a == b)
    {
        return 0;
    }
    log();
    return a - b;
    // TOOD: JIRA1234
  }

}

File2

public class File1 {

  public int sub (int a, int b)
  {
    // TOOD: JIRA1234
    if ( isNull(a, b) )
    {
        return null
    }
    log();
    return a - b;
  }

  public int mul (int a, int b)
  {
    if ( isNull(a, b) )
    {
        return null;
    }
    log();
    return a * b;
  }

}

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s