It’s wonderful… except for one flaw.
Forking in git (without GitHub)
One the best features of git (and also mercurial/hg) is it’s almost painless to clone/fork, make changes and continue to merge in code from other forks as you develop. Git also makes it nearly painless to share your changes and for others to merge your changes in back in. This is why it’s called a distributed source control.
Before git and mercurial, merging was a nightmare in most monolithic source control systems and something that was to be avoided. The only real solution was have everyone work in one repo and work on a single version at a time which doesn’t scale.
Even though git makes it easy to fork code and share changes, you can end up with a lot of good code lurking around that never makes its way back upstream or into the hands of others who could use your changes. If I fork and fix a bug that is useful to others, it can be a lot of effort on my part to share my changes. It’s advantageous for me to get my code back upstream so that I don’t have continue to support the changes I made, nor do I have to have to worry about potential conflicts when an upstream codebase I follow makes changes later.
While git makes it easier to share, it’s still a lot of work to actually get the word about your changes. Depending on the project, the options you have vary. You might have post a bug report with a patch, send an email out to the other developers, post your own fork up somewhere, etc. It’s all a little hit or miss if anyone will see it and it’s often more trouble then it’s worth.
GitHub gives forks visibility
Github does an amazing job of solving this problem. It gives forks a centralized hub to find other forks and increases code visibility. In a way, it creates a real social community of code.
This is why I love GitHub so much. Their tools give me a unique value that I can’t get anywhere else. It’s not like SourceForge or another project host since it provides real value and that couldn’t provide on my own by just by getting my own server. It’s the community that makes it valuable.
There is still plenty of room for improvement around forking on GitHub though.
The problem with Github’s approach
The entire problem can be summed up into one statement:
Not all forks are equal and no one repo is necessarily any more important than any other (including the original repo).
Forking is easy as hell on GitHub. It takes just a single click and you have a publicly visible hosted fork of a project for free. Why you forked and what you are doing with that fork is not visible to anyone else though. You could be making magic or just experimenting.
Not every fork is equal
I fork for a lot of different reasons.
- I want to add a feature or fix a bug but I want to share the changes upstream or with other people that may find it actually useful.
- I fixed a bug for my own environment. These changes may break everyone else and so no body should probably use this fork.
- I want to lock a version of a project away in a safe place that I know won’t change or break later and may use it as a point to send changes back up later. (This partially due to some design issues with git submodules.)
- I want to experiment. My changes are probably interesting but not ready for primetime, but if it works out it maybe come something fruitful.
On Github, every fork is just listed in the network of forks. There is no clear way to figure why someone forked.
When I come across a fork, I have no idea if the fork is important and if someone should be really looking at it at all to find anything of value. I’m forced to look at the commits and make a judgement on my own.
Bitbucket solves this partially by allowing you to give your fork a label.
Github gives too much value to the repo that you forked from
There are few problems with this. Github treats forks differently than the original source. They call this repo the “root”.
This is horribly broken. It’s impossible to say my version isn’t more important than the guy I forked from. If you watch Linus’s Google Talk on Git, this goes against one of the values he talked about with Git.
Here is what network membership pages look like right now:
Github displays all the projects in a hierarchy of forks.
Contrast that with the design bitbucket uses for their membership page.
Bitbucket gives details as to why the project was forked, when it was created, when it was last updated, and how popular the fork is.
The current way Github handles this fails for several reasons.
- When I pull changes from a project, there are times I’m not always pulling from one upstream location. Who I chose to fork from is irrelevant.
- My version my be a fundamental new project on it’s own. (like a port for a specific platform)
- Upstream is no longer being maintained and I’ve moved the project on.
Rather then have a “root repo”, every repo should just have piers. I wish they would turn this off and treat forks in a uniform way (more on that later).
This design is broken.
My co-worker Michael at SeatMe has a project that has over 330 forks on a project that he created but hasn’t maintained at all since 2009. There are several big forks but there is one big one has taken over the main development of his old project. Unfortunately everyone still follows, forks, and pushes pull requests to Michael’s repo and not the other forks. They have no idea the other fork exists.
At SeatMe, we maintain a fork of AFNetworking with a number of improvements that solve bugs. Early on it started out where we had made some minor changes to fix some critical issues to get our code running.
Our changes never made it upstream because the upstream maintainer didn’t feel it was necessary saying the fix seemed to “complicated for something that does not appear to be an issue for a vast majority of people”. Since we can’t get upstream to take our changes it means that we need to continue to maintain the changes in our fork to fix the bug.
As we continue to innovate new features on top of those changes to support our app, our fork is growing more and more divergent. While we do try to send as many changes as we can upstream that we know may not conflict, we are the road to having a hard fork that others maybe interested in.
We also have a fork of Chameleon that targets iOS 5 features. Upstream doesn’t want to merge in changes right now because it will break their current release schedule but others are continuing to fork them without being able to find the latest and greatest code. Often this leads to the same bugs people are having getting fixed in several different forks.
This is all really suboptimal.
No real work around
One work around is to upload the project as brand new repo, but you get a warning lightbox on the create repo page.
Doing this also breaks the other features of the site like being see a list of other forks and it prevents being able to send pull requests to other repos that you share code with. If you already had the fork up and you want to push a new repo still, you will also loose all the people watching the project as well.
I asked Github what to do in this situation in support request and got this response.
From: xxx (GitHub Staff) Subject: Contact “Forked from” line on pages
There are two options:
Unfork you fork by deleting it, creating a fresh independent repo and pushing your local copy again to it
Ask the parent repo to ask us to make yours the new network root
Neither of these is viable.
Ideas to improve GitHub
Stop treating the “root” special.
GitHub needs to get rid of the link to repo that the current repo was forked from. Maybe replace it with a link to network tab instead.
Bitbucket shows the description instead of the link everywhere. When you fork, it changes the description default text of the fork so links back by to the repo you forked from. You are free to change this text if I want at any time to say what the fork is about.
This is probably arguable but I believe that implying the root repo is any more special is against the spirt behind git. Linus once said that people pull from his git repo but the who the hell am I? You can do what you want with your version and if people like your changes, they can use them.
Through out the site (like the dashboard for example) GitHub should not make a distinction if my code is the original source or a fork. More than 80% of my public code is in the form of a fork and not an original repo. For that I don’t get nearly as much credit. Over and over, people that visit my repo pages will people skip my fork they landed on and go right to that link instead and follow the upstream.
I was able to test this with my hackathon project, CSSApply, where we started in one repo under a user and instead of moving it we created a fork of it in the organization before we realized we couldn’t undo it without loosing all the watchers or followers. People still follow the root project following that root link on the fork even though the latest code is in the CSSApply organization.
Replace the “source” and “parent” link in the repo API
When you hit the GitHub api, you get a source and parent node for the repo.
Replace this with repo network api to list find associated forks. Change the graph to be a bunch of loose association instead of a hierarchy.
We need need better visibility into the forks.
We need place to list out forks and show them off with everyone else that has a fork. Github us post details about our fork and give a summary of the changes we are making (this could be in the description possibly).
This should be the first thing you see on the “Network” tab on each repo instead of the graph. Branches in each repo also need to be treated as separate forks for this list.
Maybe give us the ability to sort the list of forks by followers, number of differences, or last commit date.
Forks with no changes need to be hidden.
Forks are almost to easy to create. Forks get created constantly and go no where.
Forks with no changes are more like tagged placeholders at a version. They don’t need to be listed to anyone else.
The network graph page solves this but the membership page has no filter currently.
I would love if GitHub supported a model where if I forked a repo at a version and made no changes, it treated it like a private repo. It shouldn’t be visible to anyone except me (unless someone hits the url directly) until I push my first commit that is different than the upstream. At that point it should flip to public. This would clean up some of the fork soup we see on pages. (This may be partly why bitbucket allows private repos for free).
GitHub needs a way to link projects together and they use the fork button as their mechanism for doing so.
However this can be done in a much better way. We have a way to detect common ancestries in git object graphs already (used to merge changes) that could be extended and scaled up to automatically find links across repos. If GitHub is saving space by only storing objects once through out the site (and I hope they are), all they need to do is add back references. Each object is already has SHA1 hash, so in theory, you cloud use a big key-value store to build a simple graph of links between projects. You could build links progressively on each push.
This can’t even be more clear than this case where the Doom3GPL was forked, fixed to compile on MacOSX, and then updated again to a new repo. As a hackernews commenter points out tonight, It’s now not possible to see the project as a fork of the other project in the network graph.
Pull requests need more options
Some projects don’t want pull requests and it would be nice to opt-out. The project could be an unmaintained fork or maybe a mirror. Linus mentioned this when he was asked about his thoughts on Github.
Some projects have a policy on how to send changes upstream and require copyright assignment is given.
A project should be able to require that pull requests assign copyright or they require the person sending the pull request does so under a specific or compatible license.
If GitHub can change this model it would dramatically improve the experience around forking.
Even given all these issues, I still love GitHub.