Oto Šťáva
Professional problem solver
September 12, 2025

Diffing Word and Writer documents with Git

TL;DR: Skip my blabbing and go directly to the configuration

Diff of a DOCX file with a review comment in Forgejo

At my day job, we have recently been evaluating Forgejo—obviously primarily for hosting of our software sources with Git—but we have also been pondering other possible use-cases. Since the company is pretty small, we want to keep the number of tools we use rather low, so as to keep the system-administrative workload at a minimum.

One such pondered use-case for Forgejo is the review of non-technical legal and policy documents. These are generally created with Microsoft Word (docx) and/or LibreOffice Writer (odt), from templates, complete with the company’s standardized visual identity. To emphasize, the goal of this endeavour cannot be that my non-technical colleagues will magically switch from Word to Markdown, reStructuredText, or LaTeX. That may be my perfectionist-programmer-ass pipe dream, but we live in the real world.

So the question is this: can we get Git and Forgejo to show meaningful differences between two versions of a Word/Writer document? Spoiler: the answer is of course yes, but first, let’s look at what failed (or you can always go TL;DR).

What does not work

At this point I think it is pretty common knowledge amongst the tech-savvy crowd that both docx and odt are “just” specifically structured ZIP archives containing a bunch of XML files and assets (like images) that comprise the document. My initial idea then was very simple – could we get Word and Writer to open and edit documents unpacked? That way, we could simply keep the whole structure in a Git repository, making diffing and even merging pretty easy, since XML is a text-based format.

Unfortunately, Writer only supports working with properly packaged documents and a cursory internet search indicated that Word is the same in this regard. I have checked with LibreOffice both in the GUI, which offers no way of opening entire directories, as well as trying to make it run with the terminal. Simply running libreoffice . in a directory containing an unpacked document results in LibreOffice not even starting – it shows the splash screen, then exits with code 0 as if everything was cool and dandy.

While I was at it, I also took a look at an ODT’s XML files in a text editor. I really should not be surprised, but apart from the nicely formatted manifest.rdf, all of the other XML documents are simply dumped on a single line. This would make diffing quite the nightmare. Even if I did some automated reformatting, the OOXML format used by Office would still be quite unreadable due to historical cruft. Not to mention that I probably cannot expect non-techies to dig through XML syntax while reviewing document changes.

Again, none of this is particularly surprising. The files are not really meant to be perused by humans. But it further drives the point home that this approach is not really feasible.

Pandoc to the rescue

So, as a surprise to virtually no-one, Git is really powerful. One source of this power is that you can hook arbitrary programs to many parts of its mechanism, modifying the outcome. Today, we are going to be using the diff drivers, to hook programs to the diffing mechanism in Git. Diff drivers can be declared by adding [diff "driver-name"] sections in your Git configuration, allowing us to change git diff’s behaviour quite substantially – we can even completely substitute the diffing program this way. What we are currently interested in the most, though, is the textconv property.

In simple terms, textconv allows Git to call upon an arbitrary program to do a one-way conversion from an arbitrary file format to a preferably plain-text representation. Emphasis on one-way, because this is really only useful for reviewing diffs. You cannot use this to patch files.

How to actually configure this

First, make sure you have pandoc installed. It is available in most of the major Linux distributions’ repositories, so just call upon your apt, pacman, or whatever floats your distro’s boat.

Update (2025-09-12): Shortly after publishing the post it was brought to my attention that pandoc may access external files when run on certain types of untrusted inputs. Based on my understanding of Pandoc’s official note on security this should not be a problem for odt/docx inputs and Markdown outputs, but other combinations might be risky. It is recommended—especially when dealing with untrusted files—to use the --sandbox option when calling Pandoc like this. The examples in this post were amended to include the --sandbox option to be on the safe side.

Git

Once pandoc is installed, define this diff driver in your Git configuration:

[diff "pandoc"]
	textconv = pandoc --sandbox --to=markdown

Finally, to actually use the driver, you also need to configure Git attributes for odt and docx files. You may do that either globally in your ~/.config/git/attributes file, or in a .gitattributes file inside of your repositories:

*.odt diff=pandoc
*.docx diff=pandoc

Note that pandoc only supports the newer docx format and not doc. Hopefully nobody is using that old thing anymore, but just so you know.

Forgejo

If you want to do this same transformation in Forgejo’s diffs, you may put this [git.config] section in your app.ini:

[git.config]
core.attributesFile = /path/to/git/attributes
diff.pandoc.textconv = pandoc --sandbox --to=markdown

Bonus round: Document previews in Forgejo

Additionally, you can use pandoc to preview documents (when not diffing) in Forgejo without downloading them in a very similar fashion. Mind you it will not be perfect – only properly semantically formatted text will work well, the document will not be split into pages, and you won’t see any images embedded in the document. But for basic previews to make sure you are looking at the right document, or to quickly check some facts in said document, I think this is pretty neat.

To do this, you can use the [markup.*] section in the app.ini file:

[markup.docx]
ENABLED = true
NEED_POSTPROCESS = false
FILE_EXTENSIONS = .docx
RENDER_COMMAND = pandoc --sandbox --from=docx --to=html

[markup.odt]
ENABLED = true
NEED_POSTPROCESS = false
FILE_EXTENSIONS = .odt
RENDER_COMMAND = pandoc --sandbox --from=odt --to=html

The main difference here is that Forgejo (as opposed to Git) does not provide the file name to Pandoc. It instead provides the file contents on the standard input, meaning that Pandoc cannot determine the file type from the name. To remedy this, we need to specify the --from= argument to let Pandoc know what the source file format is. We are also converting directly to HTML instead of Markdown, because we don’t need the extra step – Forgejo would need to convert from Markdown to HTML to display the file to the user anyway.