Creating Clean Patches

From PostgreSQL wiki

Revision as of 20:28, 12 June 2011 by Gsmith (Talk | contribs)

Jump to: navigation, search

Examples of creating better formatted patches from Greg Smith, PostgreSQL Major Contributor and proud formatting curmudgeon

If you surveyed people about how important they felt "hygiene" was to being a successful programmer, you may not get very flattering responses. But good coders are notorious for being picky about code hygiene, particularly formatting. There are some bad ways to learn this. Submitting something that's formatted badly to an open-source project with strong standards is one. Until you've collaborated with someone by exchanging patches, it's really not obvious what you should do though.

Watching new patch submitters struggle to learn good practice inspired this article. It provides a tour of the sort of boring trivia that you need to go through in order to make a good patch. In this case the target is meeting the coding standards for PostgreSQL (see "What's the formatting style used in PostgreSQL source code?" in the Developer FAQ) using git as your version control tool. But the basic concepts for what makes a good patch should apply to any project that exchanges code in the form of source code patches.

Patch coding

We need a patch to tweak here first, so here's a trivial but useful example. There can be some bottlenecks in PostgreSQL when adding new data to a table, what the database calls "relation extension". The idea is that if you allocate new space in a table or index (two common types of relations), the database sometimes makes the relation a database page (8K) bigger. A first step toward experimenting with this code would be to add some logging that confirms it's happening where I expect in the source code; that's the patch I'll explore here. I wrote a quick patch, borrowing pieces from nearby code, and the result looks like this when I populate a database using the pgbench command:

INFO:  extending relation base/16494/16507 to add block 16373
CONTEXT:  COPY pgbench_accounts, line 998754: "998754	10	0	"
INFO:  extending relation base/16494/16507 to add block 16374
CONTEXT:  COPY pgbench_accounts, line 998815: "998815	10	0	"

That's a start. I can see exactly how fast the file is growing, decode the relation filename if I want to, and I can even estimate how many rows of input data are fitting into a database page with this logging. After making the code changes, I run "git diff" to get a patch out.

Respect your reader

This is where some people make their first mistake: since the code worked, they "ship" the patch as a submission with no further work. The new code you wrote is not the final product here; the patch diff is. You should review that patch as carefully as you did the new code, look at everything it touches, before submitting it to a project. This can catch formatting errors, and fixing some of those things will make your patch smaller. A smaller patch means less work for a reviewer who is trying to decide if it's worthwhile, and less work for a committer to apply. (In smaller projects that may be the same person) Minimizing how long it takes someone to review your patch highly increases the odds it will be accepted into an open-source project.

Reducing patch diff size

Let's see what comes out of my trivial logging patch:

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f96685d..a257b9d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -420,11 +420,14 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
        bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
 
        if (isExtend)
-       {
+    { 
                /* new buffers are zero-filled */
                MemSet((char *) bufBlock, 0, BLCKSZ);
                smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
-       }
+        elog(INFO, "extending relation %s to add block %u",
+            relpath(smgr->smgr_rnode, forkNum),
+            blockNum);
+    }
        else
        {
                /*

What's all that garbage doing there? What I was hoping for is the three new "elog" lines I added to be the only thing here. But there's all this extra junk. How come? Well, I intentionally made a mistake many people make accidentally. The lines containing braces: { } I touched while developing the patch, and my editor mangled them even though the final content looks the same in the file. PostgreSQL convention says that indentation should be done with 4 character tabs. My editor changed them to spaces. This is why you see a "-" sign removing a line with the opening brace, then a changed line after it adding it right back in. To fix this, I need to go back and reformat all the lines I touched, changing the spaces before the braces to the right size tabs. I do a round of that, then get another diff:

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f96685d..a240c79 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -420,10 +420,13 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
        bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
 
        if (isExtend)
-       {
+       { 
                /* new buffers are zero-filled */
                MemSet((char *) bufBlock, 0, BLCKSZ);
                smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
+               elog(INFO, "extending relation %s to add block %u",
+                       relpath(smgr->smgr_rnode, forkNum),
+                       blockNum);
        }
        else
        {

That's better...but the remaining problem is even more subtle. Why in the world is it deleting the opening brace here and then putting it right back again? Well, in this case the error I introduced was adding some trailing whitespace. You can't see it here, but the line I touched is actually "{ ", where the original was just "{". Try highlighting it in your web browser, you may be able to see it there. That's making this diff bigger than it needs to be, as well as violating code conventions.

Invisible whitespace

At this point you might be thinking "how am I supposed to find invisible whitespace to remove it?", which is an excellent question. One option is to use a color terminal and execute the diff like this:

git diff --color

In that mode, the unwanted space I added jumps right out. Edit to fix, new diff:

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f96685d..eec261b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -424,6 +424,9 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
                /* new buffers are zero-filled */
                MemSet((char *) bufBlock, 0, BLCKSZ);
                smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
+               elog(INFO, "extending relation %s to add block %u",
+                       relpath(smgr->smgr_rnode, forkNum),
+                       blockNum);
        }
        else
        {

Now this is what a patch diff you can be proud to submit should look like. It's really obvious exactly what was changed. There are no unrelated changes confusing things. Tabs and whitespace are used correctly so that all of the new code lines up with the existing lines.

A clean patch submission is more likely to be considered for its code merits, where a messy one wastes time for the reviewer and automatically biases them against you from the start. Go back and look at what a mess the original patch was to read compared to this one. If you'd never seen this code change before, exactly what is modified is clear here. In the earlier versions, all the whitespace changes made it much harder to extract that meaning. This is despite the fact that there is no difference in the actual code here! The compiler will do exactly the same thing in both cases. This is about making the patch readable to a person though, and following the coding standards for a project improves your odds.

Squashing fixes with rebasing

Getting even more specific to git (most of the ideas here apply to any version control system), regularly rebasing is useful for several aspects to cleaner patch creation. Let's say you didn't notice the whitespace error until after you'd committed the change in your local branch. You can fix that in an extra commit, but you're going to have a messier code history. If you didn't send your local tree anywhere else yet, there's one way you can deal with this easily though. Commit your whitespace fix as a new commit, then run:

git rebase -i origin/master

Where you substitute the name of your origin branch, the one you want to create a diff against, for the last part there. You'll see a history like this:

pick 57c7bfe Awesome new feature
pick 08dcf68 Whitespace cleanup

What you can do now is edit this so that the second patch is "squashed" into the first, by changing this to look like:

pick 57c7bfe Awesome new feature
s 08dcf68 Whitespace cleanup

Save that file, and you'll get a second editor screen to adjust the combined commit message. Erase the one related to the whitespace cleanup, save that, and now this little whitespace mistake is history. I'll sometimes do that shuffle several times, each time compacting a bit more of the whitespace or code down:

git diff --color HEAD^ HEAD
vi <file>
git commit -a
git rebase -i origin/master

This cycle is worth practicing on test code: commit, diff, fix whitespace, rebase squash, new diff. It's unfortunately easy to corrupt your work using rebase, so be careful with it. I will normally save a copy of the initial patch, just out of paranoia, each time through this loop. It's useful as a backup if you make a mistake, and you can use tools like "wc" to track improvements in reducing its size during each loop. (There's certainly a faster way to accomplish this series of steps using git, but this approach is simple to understand and remember)

Learning how to use rebase well is also important if you developed your patch over a long enough period of time that the master tree you're working against has changed significantly. You can move earlier commits so they are the most recent ones in the commit history with it, which is also critical to get a clean diff that applies to a project you're submitting to.

Patient diffs

Sometimes diff produces output that, while clean, isn't the best way to represent a new addition. The most common case is where a code block using braces ends up out of sync by one line, where your new code is shown as re-using an opening or closing brace from existing code in an odd way. There is an alternate method diff can used, called "patience", that often does better in this situation:

git diff --color HEAD^ HEAD --patience

The description and example at Patience Diff Advantages show how this might work. If your code is offset strangely because it re-uses existing lines that happen to be in your addition, you should try patience mode and see if it produces a cleaner patch.

Code reading and patch diff creation

Reading your code changes as a patch is a great exercise for lots of reasons. I've found plenty of errors by reviewing my own code this way. Sometimes seeing the diff makes me think of ways I can reduce the code footprint of what I've changed, which can lead to a smaller and cleaner patch. For a substantial patch, successful patch contributors can easily spend an hour or more doing this sort of work before submitting to a project, tweaking the formatting to shrink the patch to the bare minimum. It makes your code better, and it makes it more likely to be read, accepted, and committed. When I am merging submissions from contributors to one of the projects I lead, or reviewing submissions to PostgreSQL, seeing basic formatting errors or a sloppy diff means I start out assuming the rest of the code isn't likely to be clean, either. That's not the first impression you want to leave.

You might think this is all boring grunt work, but getting the details of your code perfect is one of the habits that great programmers cultivate. Minimizing the size of your diff, and making sure it's as clean as possible, is one of the most important skills you can achieve if you want to contribute code to any software project. The patch is not just the output from a coding process: it's a product in its own right, one that you should carefully and directly monitor for good quality.

Personal tools