Git's Data Model (and Other Document Updates)

Hello! This past fall, I decided to take some time to work on Git’s documentation. I’ve been thinking about working on open source docs for a long time – usually if I think the documentation for something could be improved, I’ll write a blog post or a zine or something. But this time I wondered: could I instead make a few improvements to the official documentation?

So Marie and I made a few changes to the Git documentation!

a data model for Git

After a while working on the documentation, we noticed that Git uses the terms “object”, “reference”, or “index” in its documentation a lot, but that it didn’t have a great explanation of what those terms mean or how they relate to other core concepts like “commit” and “branch”. So we wrote a new “data model” document!

You can read the data model here for now. I assume at some point (after the next release?) it’ll also be on the Git website.

I’m excited about this because understanding how Git organizes its commit and branch data has really helped me reason about how Git works over the years, and I think it’s important to have a short (1600 words!) version of the data model that’s accurate.

The “accurate” part turned out to not be that easy: I knew the basics of how Git’s data model worked, but during the review process I learned some new details and had to make quite a few changes (for example how merge conflicts are stored in the staging area).

updates to `git push`, `git pull`, and more

I also worked on updating the introduction to some of Git’s core man pages. I quickly realized that “just try to improve it according to my best judgement” was not going to work: why should the maintainers believe me that my version is better?

I’ve seen a problem a lot when discussing open source documentation changes where 2 expert users of the software argue about whether an explanation is clear or not (“I think X would be a good way to explain it! Well, I think Y would be better!”)

I don’t think this is very productive (expert users of a piece of software are notoriously bad at being able to tell if an explanation will be clear to non-experts), so I needed to find a way to identify problems with the man pages that was a little more evidence-based.

getting test readers to identify problems

I asked for test readers on Mastodon to read the current version of documentation and tell me what they find confusing or what questions they have. About 80 test readers left comments, and I learned so much!

People left a huge amount of great feedback, for example:

terminology they didn’t understand (what’s a pathspec? what does “reference” mean? does “upstream” have a specific meaning in Git?)
specific confusing sentences
suggestions of things things to add (“I do X all the time, I think it should be included here”)
inconsistencies (“here it implies X is the default, but elsewhere it implies Y is the default”)

Most of the test readers had been using Git for at least 5-10 years, which I think worked well – if a group of test readers who have been using Git regularly for 5+ years find a sentence or term impossible to understand, it makes it easy to argue that the documentation should be updated to make it clearer.

I thought this “get users of the software to comment on the existing documentation and then fix the problems they find” pattern worked really well and I’m excited about potentially trying it again in the future.

the man page changes

We ended updating these 4 man pages:

git add (before, after)
git checkout (before, after)
git push (before, after)
git pull (before, after)

The git push and git pull changes were the most interesting to me: in addition to updating the intro to those pages, we also ended up writing:

a section describing what the term “upstream branch” means (which previously wasn’t really explained)
a cleaned-up description of what a “push refspec” is

Making those changes really gave me an appreciation for how much work it is to maintain open source documentation: it’s not easy to write things that are both clear and true, and sometimes we had to make compromises, for example the sentence “git push may fail if you haven’t set an upstream for the current branch, depending on what push.default is set to.” is a little vague, but the exact details of what “depending” means are really complicated and untangling that is a big project.

on the process for contributing to Git

It took me a while to understand Git’s development process. I’m not going to try to describe it here (that could be a whole other post!), but a few quick notes:

Git has a Discord server with a “my first contribution” channel for help with getting started contributing. I found people to be very welcoming on the Discord.
I used GitGitGadget to make all of my contributions. This meant that I could make a GitHub pull request (a workflow I’m comfortable with) and GitGitGadget would convert my PRs into the system the Git developers use (emails with patches attached). GitGitGadget worked great and I was very grateful to not have to learn how to send patches by email with Git.
Otherwise I used my normal email client (Fastmail’s web interface) to reply to emails, wrapping my text to 80 character lines since that’s the mailing list norm.

I also found the mailing list archives on lore.kernel.org hard to navigate, so I hacked together my own git list viewer to make it easier to read the long mailing list threads.

Many people helped me navigate the contribution process and review the changes: thanks to Emily Shaffer, Johannes Schindelin (the author of GitGitGadget), Patrick Steinhardt, Ben Knoble, Junio Hamano, and more.

(I’m experimenting with comments on Mastodon, you can see the comments here)

a data model for Git

updates to git push, git pull, and more

getting test readers to identify problems

the man page changes

on the process for contributing to Git

updates to `git push`, `git pull`, and more