Blog entries tagged git :: mwop.net

Advent 2023: (n)vim Plugins: vim-fugitive

contact@mwop.net (Matthew Weier O'Phinney) — Mon, 18 Dec 2023 15:34:00 -0600

Because I've spent most of my professional life coding, I've also spent a lot of time using source control. I've been using specifically git for many years (even pre-dating the Zend Framework migration from Subversion). While I typically use a terminal multiplexer (for me, that's tmux; for others, that might be screen), and can move to another pane or create one quickly in order to run source control commands, doing so interrupts flow.

That's where vim-fugitive comes into play.

What does it solve?

Fugitive integrates with git, plain and simple. It exposes a number of commands and functions that allow you to do common operations quickly, but also has some deeper bindings to allow doing more complex things such as viewing a file from previous commits, or performing a diff between the staged and working version, or using git blame within vim.

How do I use it?

Admittedly, I use a very small subset of what Fugitive provides.

On a daily basis, I use :Gwrite to stage changes, and :G to view the status of the working tree. When in the status view, I often use cc to commit changes, which splits open a pane for writing the commit message. I also use :GRemove when I want to remove a file from the tree.

Something else that has come in handy when reviewing code with others: :GBrowse can open the file in the canonical repository, using the visual selection as the line range, allowing you to quickly share a link to specific code to review.

Final Thoughts

This plugin does exactly what it says on the tin. I love the fact that it integrates with the underlying git command, as that follows the Unix Philosophy of doing one thing well, and piping out to other processes to perform complex behavior. For me, the fact that I can stay directly within my editor and still get full access to git when needed is tremondously powerful.

Advent 2023: (n)vim Plugins: vim-fugitive was originally published 18 December 2023 on https://mwop.net by Matthew Weier O'Phinney.

Splitting the ZF2 Components

contact@mwop.net (Matthew Weier O'Phinney) — Fri, 15 May 2015 19:30:00 -0500

Today we accomplished one of the major goals towards Zend Framework 3: splitting the various components into their own repositories. This proved to be a huge challenge, due to the amount of history in our repository (the git repository has history going back to 2009, around the time ZF 1.8 was released!), and the goals we had for what component repositories should look like. This is the story of how we made it happen.

Why split them at all?

"But you can already install components individually!"

True, but if you knew how that occurs, you'd cringe. We've tried a variety of solutions, and every single one has failed us at some point or another, typically when we move to a new minor version of the framework, but occasionally even on trivial bugfix releases. We've tried filter-branch with subdirectory-filter, we've tried subtree split, and even subsplit. We've used manual scripts that rsync the contents of each commit and create a reference commit. Our current version is a combination of several approaches, but we've found we must run it manually and verify the results before pushing, as we've had a number of situations, as recently as the 2.4.0 release, where contents were not correct.

On top of all this, there's another concern: why do all components get bumped in version, even when no changes are present? As an example, a number of components have had zero new features since the 2.0 release; they're either stable, or have smaller user bases. It doesn't make sense to bump their versions, but they get bumped regardless whenever we do a new release of the framework. When we start considering a new major version of the framework, it doesn't necessarily make sense to bump such components, as there will be literally zero breaking changes, and, in many cases, no new features.

In other cases, such as the EventManager, ServiceManager, and a handful of other components, we know that these will require major versions due to necessary architectural changes. However, as long as we're still developing minor release branches of the framework, we cannot have meaningful development on those features due to the complexities of keeping changes in sync between branches.

In short, we'd like to be able to version the individual components separately, in their own cycles.

On top of that, when we look at maintenance, having a monolithic repository poses a challenge: we have to limit the number of developers with commit rights to ensure that those who can commit are aware of the impact a change might have across the framework. This means that a number of developers with time and energy to spend on improving a single component or small subset of components are hampered by how quickly their changes can be reviewed by the maintainers.

Splitting the components gives us the opportunity to expand the number of contributors with commit access. The framework itself can pin to specific versions of components, and maintainers with commit access to the framework can review and change those versions based on integration and smoke tests. In the meantime, a larger set of contributors can be gradually improving the individual components, and users can selectively adopt those new versions into their applications, on their own review cycles.

In the end:

We get components that follow Semantic Versioning properly.
We get accelerated development in components that need it.
We expand the number of active, able maintainers.
We enable users to adopt new features at their own pace.
We retain framework stability.

The Goal

Since we branched ZF2 development, our repository has looked something like the following:

.coveralls.yml
.gitattributes
.gitignore
.php_cs
.travis.yml
bin/
CHANGELOG.md
composer.json
CONTRIBUTING.md
demos/
INSTALL.md
library/
    Zend/
        {component directories}
LICENSE.txt
README-GIT.md
README.md
resources/
tests/
    _autoload.php
    Bootstrap.php
    phpunit.xml.dist
    run-tests.php
    run-tests.sh
    TestConfiguration.php.dist
    TestConfiguration.php.travis
    ZendTest/
        {component directories}

The structure follows PSR-0, with each component below the library/Zend/ directory.

The goal is to have individual component repositories, each with the following structure:

.coveralls.yml
.gitattributes
.gitignore
.php_cs
.travis.yml
composer.json
CONTRIBUTING.md
src/
LICENSE.txt
phpunit.xml.dist
phpunit.xml.travis
README.md
test/
    bootstrap.php
    {component test cases}

In the above structure, note the following differences:

Source and unit test files now follow PSR-4, and can be found directly beneath the new src/ and test/ directories (which replace library/ and tests/, respectively), without any directory nesting based on namespace (unless any subnamespaces are present).
The README.md file will need to be specific to the component. Additionally, it can incorporate what was in the INSTALL.md file originally.
The composer.json file will need to be for the component, not the framework. Additionally, we don't currently list dev/testing dependencies in our component repos, so those will need to be added.
The TestConfiguration.php.* files define constants referenced by the unit tests; those can be migrated to the phpunit.xml.* files — which we can move to the project root to simplify testing.
The .travis.yml file can be streamlined, as we're now only testing one component.
Most testing infrastructure can be removed, as it's around simplifying running tests for individual components within the larger framework. The Bootstrap.php gets renamed to bootstrap.php to avoid being confused with unit test files.
README-GIT.md gets replaced with a lengthier CONTRIBUTING.md file.

On top of all this, we had the following requirements:

The components MUST have full history from 2.0.0rc7 forward. This is so those working on the components can see the why and who behind commits.
Commit messages MUST reference original issues and pull requests on the ZF2 repository; again, this is to facilitate the why behind changes.
Ideally, history should contain only history for the given component.
The directory structure in each commit, including (and especially!) tags, MUST follow the proposed structure.

How we got there

One of the huge benefits to using Git is the ability to rewrite history. (It's also one of its scariest features.) It provides a number of facilities for doing so, from rebase to grafts to subtree to filter-branch. In our component split research, we evaluated several solutions.

Grafts

Grafts provide a way to merge two different lines of history together, but, for our purposes, also allow us to prune history. Why would we do this? Because we don't really need history prior to 2.0.0 development at this point. In large part, this is because it's irrelevant; files were moved around and changed so much between forking from the 1.X tree and 2.0 that tracing the history is quite difficult.

I eventually found a methodology for pruning that looks like this:

$ echo bb50be26b24a9e0e62a8f4abecce53259d707b61 > .git/info/grafts
$ git filter-branch --tag-name-filter cat -- --all
$ git reflog expire --expire=now --all
$ git gc --prune=now --aggressive
$ rm .git/info/grafts

It's supposed to essentially remove history before the given sha1. What I found was that by itself, I noticed little to no change in the repository, other than size; I could still reach earlier commits. However, when coupled with the final techniques we used, it meant that we effectively saw no commits prior to this point.

subtree

git subtree is a "contributed" git command; it's not available in default distributions of git, but often available as an add-on package; if you install git from source, it's in the contrib tree, where you can compile and install it. Subtree provides a rich set of functionality around dealing with repository subtrees, allowing you to split them off, add subtrees from other projects, and even push commits back and forth between them.

At first blush, it seems like an ideal, simple solution:

Split each of the library/ and tests/ component subtrees into their own branches.
Create a new repository, and add each of the above as subtrees.

$ git clone zendframework/zf2
$ git init zend-http
$ cd zf2
$ git subtree split --prefix=library/Zend/Http -b src
$ git subtree split --prefix=tests/ZendTest/Http -b test
$ cd ../zend-http
$ # add in basic assets, and create initial commit
$ git remote add zf2 ../zf2
$ git subtree add --prefix=src/ zf2 src
$ git subtree add --prefix=test/ zf2 test

Indeed, if you do the above, when done, the directory looks exactly like it should! However, the history is all wrong; if you check out any tags, you get the full ZF2 tree for the tag. As such, subtree fails one of the most important criteria right off the bat: that each commit and tag represent only the component.

subdirectory-filter

subdirectory-filter is one of the git filter-branch strategies. It operates similarly to subtree, but also rewrites history. We used this approach when splitting the various "service" (API wrapper) components from the main repository prior to the first ZF2 stable release.

The basic idea is similar to that of subtree; the difference is that you have to begin with separate checkouts for each of the source and tests.

$ git clone zendframework/zf2 zend-http-src
$ git clone zendframework/zf2 zend-http-test
$ cd zend-http-src
$ git filter-branch --subdirectory-filter library/Zend/Http --tag-name-filter cat -- -all
$ cd ../zend-http-test
$ git filter-branch --subdirectory-filter tests/ZendTest/Http --tag-name-filter cat -- -all
$ cd ..
$ git init zend-http
$ cd zend-http
# add in basic assets, and create initial commit
$ git remote add -f src ../zend-http-src
$ git remote add -f test ../zend-http-test
$ git merge -s ours --no-commit src/master
$ git read-tree -u --prefix=src/ src/master
$ git commit -m 'Merging src tree'
$ git merge -s ours --no-commit test/master
$ git read-tree -u --prefix=test/ test/master
$ git commit -m 'Merging test tree'

Again, this looks great at first blush; all the contents for the given component are rewritten perfectly. But when you start looking at previous tags and commits, you see an interesting picture: based on the commit and which remote you added first, you'll see a completely different directory structure. Like subtree, this fails our criteria that the repo be in a usable state at any given commit.

tree-filter

Like subdirectory-filter, tree-filter is a filter-branch strategy. tree-filter allows you to rewrite the tree contents any way you want, while retaining the commit message and metadata. This turned out to be what we were looking for!

However, there were a few more pieces we needed to address:

Rewriting commit messages referencing issues and pull requests to link to the main ZF2 repository.
Pruning empty commits.
Ensuring tags contain the expected tree.

Fortunately, filter-branch has other strategies for just these purposes:

msg-filter allows you to rewrite commit messages.
commit-filter provides tools for detecting and removing empty commits.
tag-name-filter ensures that tag references are rewritten when the parent commits change or are removed.

So, what we ended up with was something like the following:

git filter-branch -f \
    --tree-filter "php /path/to/tree-filter.php" \
    --msg-filter "sed -re 's/(^|[^a-zA-Z])(\#[1-9][0-9]*)/zendframework\/zf2/g'" \
    --commit-filter 'git_commit_non_empty_tree "$@"' \
    --tag-name-filter cat \
    -- --all

/path/to/tree-filter.php is a script that contains the logic for re-arranging the directory structure, as well as rewriting the contents of files as necessary (e.g., rewriting the contents of composer.json, or filling in the name of the component in the CONTRIBUTING.md). The msg-filter looks for issue and pull request identifiers (a # character followed by one or more digits), and rewrites them to reference the repository. The commit-filter checks to see if the repository contents have changed in this commit, and, if not, instructs git to ignore the commit (and, since tree-filter always executes before commit-filter, the comparison is always between rewritten trees). The tag-name-filter MUST be present, and essentially just ensures that the tag is rewritten; if absent, tags are not rewritten, and refer to the original contents!

Stumbling blocks

We had a few stumbling blocks getting the above to work. The first was that, for purposes of testing, we had to specify a commit range, instead of -- --all. This was necessary because of the size of the repo; at ~27k commits, running over every single commit can take between 5 and 12 hours, depending on git version, HDD vs ramdisk, speed of I/O, etc. For small subsets, we could get consistent results. When we expanded the range, we started seeing strange errors, such as some tags not getting written.

To compound the situation, we also made a last minute change to only do history from the 2.0.0rc7 tag forward, and this is when things completely fell apart. A large number of tags would not get rewritten, the set of malformed tags varied between components, and we couldn't figure out why.

At a certain point, I recalled that git stores commits as a tree, and that's when I realized what was happening: when we specified a commit range, we were essentially specifying a specific path through the commits. If a tag was made on a branch falling outside that path, it would not get rewritten.

This meant that the only way to get consistent results that met our criteria was to run a test over the full history. Fortunately, sometime around that point, a community member, Renato, suggested I try a run using a tmpfs filesystem — essentially a ramdisk. This sped up runs by a factor of 2, and I was able to validate my hypothesis within an evening.

Another stumbling block was empty commits. We originally used filter-branch's --prune-empty switch, but found it was generally unreliable when used with tree-filter. The solution to this problem is the commit-filter as listed above; it did a stellar job.

Empty merge commits

There was one lingering issue, however: when inspecting the filtered repository, we still had a large number of empty merge commits that had nothing to do with the component. After a lot of searching, I found this gem:

$ git filter-branch -f \
> --commit-filter '
>    if [ z$1 = z`git rev-parse $3^{tree}` ];then
>        skip_commit "$@";
>    else
>        git commit-tree "$@";
> fi' \
> --tag-name-filter cat -- --all
$ git reflog expire --expire=now --all
$ git gc --prune=now --aggressive

The above uses a commit-filter which internally uses rev-parse to determine if the commit is a merge and that both parents are present in the repository; if not, it skips (removes) the commit. The reflog expire and gc commands clean up and remove any objects in the repository that are now no longer reachable.

Final Solution

With a working graft, tree-filter, and commit-filter in place, we could finally proceed. We created a repository containing all scripts we needed, as well as the assets necessary for rewriting the component repository trees. We then had a tool that could be executed as simply as:

$ ./bin/split.sh -c Authentication 2>&1 | tee authentication.log

And with that, we could sit back and watch the component get split, and push the results when done.

You can see the work in our component-split repository.

But what about the speed?

"But didn't you say it takes between 5 and 12 hours to run per component? And aren't there something like 50 components? That would take weeks!"

You're quite astute! And for that, we had a secret weapon: a community contributor, Gianluca Arbezzano working for an AWS partner, Corley, which sponsored splitting all components in parallel at once, allowing us to complete the entire effort in a single day. I'll let others tell that story, though!

The results

I'm quite pleased with the results. The ZF2 repository has ~27k commits, 67 releases, and over 700 contributors; a clean checkout is around 150MB. As a contrast, the rewritten zend-http component repository ended up with ~1.7k commits, 50 releases, ~160 contributors, and a clean checkout clocks in at 5.4MB! So the individual components are substantially leaner! Additionally, they contain all the QA tooling necessary to start developing against for those wanting to patch issues or create features, making development a simpler process.

The lessons learned:

tree-filter is your friend, if your restructuring involves more than one directory and/or adding or removing files.
tag-name-filter MUST be used anytime you use filter-branch; otherwise your tags may end up invalid!
filter-branch should be used on ranges sparingly, and ideally only if you're not worried about tags. In most cases, you want to run over the entire history.
commit-filter is your best option for ensuring empty commits of any type are stripped, particularly if you're using tree-filter; the --prune-empty flag is not terribly reliable.
Always do a full test run. It's tempting to use a commit range to verify that your filters work, but the results will differ from running over the entire history. Which leads to:
Schedule plenty of time, particularly if your repository is large. Those full test runs will take time, and, if you follow the scientific process and make one change at a time, you may need quite a few iterations to get your scripts right.

All-in-all, this was a stressful, time-consuming, thankless task. But I am quite happy with the results; our components look like they are and were always developed as first-class components, and have a rich history referencing their original development as part of the encompassing framework.

Kudos!

I cannot thank Gianluca and Corley enough for their generous efforts! What looked like a task that would take days and/or weeks happened literally overnight, allowing us to complete a major task in Zend Framework 3 development, and setting the stage for a ton of new features. Grazie!

Splitting the ZF2 Components was originally published 15 May 2015 on https://mwop.net by Matthew Weier O'Phinney.

Automatic deployment with git and gitolite

contact@mwop.net (Matthew Weier O'Phinney) — Sun, 24 Jun 2012 21:50:00 -0500

I read a post recently by Sean Coates about deploy on push. The concept is nothing new: you set up a hook that listens for commits on specific branches or tags, and it then deploys your site from that revision.

Except I'd not done it myself. This is how I got there.

Sean's approach uses Github webhooks, which are a fantastic concept. Basically, once your commit completes, Github will send a JSON-encoded payload to a specific URI. Sean uses this to trigger an API call to a specific page in his website, which will then trigger a deployment activity.

Awesome, this should be easy; I already have a deploy script written that I trigger manually.

One small problem: my site, while in Git, is not on Github. I maintain it on my own Gitolite repository. Which means I needed to write my own hooks.

I originally went down the route of using a post-receive hook. However, I had problems determining what branch the given commit was on, despite a variety of advice I found on the subject on StackOverflow and git mailing lists. I ended up finding a great example using post-update, which was actually perfect for my needs.

In order to keep the post-update script non-blocking when I commit, I made it do very little: It simply determines what branch the commit was on, and if it was the master branch, it touches a specific file on the filesystem and finishes. The entire hook looks like this:

#!/bin/bash
branch=$(git rev-parse --symbolic --abbrev-ref $1)
echo "Commit was for branch $branch"
if [[ "$branch" == "master" ]];then
    echo "Preparing to deploy"
    echo "1" > /var/local/mwop.net.update
fi

Now I needed something to detect such a push, and act on it.

I considered using cron for this; it'd be relatively easy to have it fire up once a minute, and simply act on it. But I decided instead to write a simple little daemon using perl. Perl daemons are trivially easy to write, and if you use module such as Proc::Daemon and follow a few trivial defensive coding practices, you can keep memory leaks contained (or at least minimal). Besides, it gave me a chance to dust off my perl chops.

I decided I'd have it check for the file in 30 second intervals, simply sleeping if no changes were detected. If the file was found, however, it should attempt to deploy. Additionally, I wanted it to quit if it was unable to remove the file (as this could lead to multiple deploy attempts), and log success and failure status of the deploy. The full script looks like this:

#!/usr/bin/perl
use strict;
use warnings;
use Proc::Daemon;

Proc::Daemon::Init;

my $continue = 1;
$SIG{TERM} = sub { $continue = 0 };

my $updateFile   = "/var/local/mwop.net.update";
my $updateScript = "/home/matthew/bin/deploy-mwop";
my $logFile      = "/var/local/mwop.net-deploy.log";
while ($continue) {
    # 30s intervals between iterations
    sleep 30;

    # Check for update file, and restart loop if not found
    unless (-e $updateFile) {
        next;
    }

    # Remove update file
    if (!unlink($updateFile)) {
        # If unable to unlink, we need to quit
        system('echo "' . time() . ': Failed to REMOVE ' . $updateFile . '" >> ' . $logFile);
        $continue = 0;
        next;
    }

    # Deploy
    system($updateScript);
    if ( $? == -1 ) {
        system('echo "' . time() . ': FAILED to deploy: ' . $! . '" >> ' .  $logFile);
    } else {
        system('echo "' . time() . ': Successfully DEPLOYED" >> ' . $logFile);
    }
}

The system() calls for logging could have been done using Perl, but I didn't want to deal with additional error handling and file pointers; simply proxying to the system seemed reasonable and expedient.

When all was ready, I started the above listener, which automatically daemonizes itself. I then installed the post-update hook into my bare repository, and tested it out. And it runs! When I push to master, my site is automatically deployed, typically within 15-20 seconds from completion.

Caveats

This solution, of course, relies on a daemonized process. If that process were to terminate, I'd have no idea until I discovered my site didn't refresh after the most recent push. Clearly, some sort of monitor checking for the status of the daemon should be in place.

Also, note that I'm having this update on changes to the master branch; you may need to adapt it for your own needs, depending on your branching strategy.

Finally, this approach does not address issues that might require a roll-back. Ideally, the script should probably log what revision was current prior to the deployment, allowing roll-back to the previous state. Alternately, the deployment script should create a new clone of the site and swap symlinks to allow quick roll-back when required.

Automatic deployment with git and gitolite was originally published 24 June 2012 on https://mwop.net by Matthew Weier O'Phinney.

git-svn Tip: don't use core.autocrlf

contact@mwop.net (Matthew Weier O'Phinney) — Wed, 24 Sep 2008 12:16:27 -0500

I've been playing around with Git in the past couple months, and have been really enjoying it. Paired with subversion, I get the best of all worlds — distributed source control when I want it (working on new features or trying out performance tuning), and non-distributed source control for my public commits.

Github suggests that when working with remote repositories, you turn on the autocrlf option, which ensures that changes in line endings do not get accounted for when pushing to and pulling from the remote repo. However, when working with git-svn, this actually causes issues. After turning this option on, I started getting the error "Delta source ended unexpectedly" from git-svn. After a bunch of aimless tinkering, I finally asked myself the questions, "When did this start happening?" and, "Have I changed anything with Git lately?" Once I'd backed out the config change, all started working again.

In summary: don't use git config --global core.autocrlf true when using git-svn.

git-svn Tip: don't use core.autocrlf was originally published 24 September 2008 on https://mwop.net by Matthew Weier O'Phinney.