Release Management Tooling: Past, Present, and Future

Release Management Tooling: Past, Present, and Future

As I was interviewing a potential intern for the summer of 2015 I realized I had outlined all our major tools and what the next enhancement for each could be but that this wasn’t well documented anywhere else yet.

By coming to Release Management from my beginnings as a Release Engineer, I’ve been a part of seeing our overall release automation improve across the whole spectrum of what it takes to put out packaged software for multiple platforms and we’ve come a long way so this post is also intended to capture how the main tools we use have gotten to their current state as well as share where they are heading.

Ship-It

Past: Release Manager on point for a release sent an email to the Release-Drivers mailing list with an hg changeset, a version, build number, and this was the “go” to build for Release Engineering to take over and execute a combination of automated/manual steps (there was even a time when it was only said in IRC, email became the constant when Joduinn pushed for consistency and a traceable trail of events). Release Engineers would update a config files & locale changes, get them attached to a bug, approved, uplifted, then go reconfigure the build machines so they could kick off the release build automation.

Present: Ship-It is an app developed by Release Engineering (bhearsum) that allows a Release Manager to input the configurations needed (changeset, version, build number, partials to be created, l10n changesets) all in one place, and on submit the build automation picks up this change from a db, reconfigures the build machine, and triggers builds. When all goes well, there are zero human hands between the “go” and the availability of builds to QA.

Future: In two parts:
1. To have a simple app that can take a list of bug numbers and check them for landing to {branch} (where branch is Beta, Release, or ESR), once all the bug numbers listed have landed, check tree herder for green status on that last changeset, submit to Ship-It if builds are successful. Benefits: hands off even sooner, knowing that all the important fixes are on the branch in question, and that the tree is totally green prior to build (sometimes we “go” without all the results because of human timing needs).
2. Complete End-To-End Release Checklist, dynamically updated to show what stage a release job is at and who’s got the ball in their court. This should track from buglist added (for the final landings a RM is waiting on) all the way until the release notes are live and QA signs off on updates for the general release being in the wild.

Nucleus (aka Release Note App)

Past: Oh dear, you probably don’t even want to know how our release notes used to be made. It’s worse than sausage. There was a sqlite db file, a script that pulled from that db and generated html based on templates and then the Release Manager had to manually re-order the html to get the desired appearance on final pages, all this was then committed to SVN and with that comes the power to completely break mozilla.org properties. Fun stuff. Really. Also once Release Management was more than just one person we shared this sqlite db over Dropbox which had some fun quirks, like clobbering your changes if two people had the file open at the same time. Nowhere to go but up from here!

Present: Thanks to the web production team (jgmize, hoosteeno, craigcook, jbertsch) we got a new Django app in place that gives us a proper databse that’s redundant, production quality, and not in our hands. We add in release notes as well as releases and can publish notes to both staging and production without any more commits to SVN. There’s also an API that can be scripted to.

Future: The future’s so bright in this area, let me get my shades. We have a flag in Bugzilla for relnote-firefox where it can get set to ? when something is nominated and then when we decide to take on that bug as a release note we can set it to {versionNum}+. With a little tweaking on the Bugzilla side of things we could either have a dedicated field for “release-note text” or we could parse it out of a syntax in a comment (though that’s gonna be more prone to user error, so I prefer the former) and then automatically grab all the release notes for a version, create the release in Nucleus, add the notes, publish to staging, and email the link around for feedback without any manual interference. This also means we can dynamically adjust release notes using Bugzilla (and yes, this will need to be really cautiously done), and it makes sure that our recent convention of having every release note connect to a bug persist and become the standard.

Release Dash

Past: Our only way to visualize the work we were doing was a spreadsheet, and graphs generated from it, of how many crasher bugs were tracked for a version, how many bugs tracked/fixed over the course of 18 weeks for a version, and not much else. We also pay attention to the crash rate at ship time, whether we had to do a dot release or chemspill, and any other release-version-specific issues are sort of lost in the fray after we’re a couple of weeks out from a release. This means we don’t have a great sense of our own history, what we’re doing that works in generating a more stable/successful release, and whether a release is in fact ready to go out the door. It’s a gamble, and we take it every 6 weeks.

Present: We have in place a dashboard that is supposed to allow us to view the current crash data, select Talos (performance) data, custom bug queries, and be able to compare a current release coming down the pipe to previous releases. We do not use this dashboard yet because it’s been a side project for the past year and a half, primarily being created and improved upon by fabulous – yet short-term – interns at Mozilla. The dashboard relies on Elastic Search for Bugzilla data and the cluster it points to is not always up. The dash is written in php and that’s no one’s strong suit on our current team, our last intern did his work by creating a Python Flask app that would work into the current dash. The present situation is basically: we need to work on this.

Future: In the future, this dashboard will be robust, reliable, production-quality (and supported), and it will be able to go up on Mozilla office screens in the dashboard rotation where it will make clear to any viewer:
* Where we are in the current release cycle
* What blockers remain for releas
* How our stability is (over/under acceptable rates)
* If we’re meeting performance expectations
And hopefully more. We have to find more ways to get visibility into issues a release might hit once it’s with the larger population. I’d love to see us get more of our Beta user’s feedback by asking for it on specific features/fixes, get a broader Beta audience that is more reflective of our overall release population (by hardware, location, language, user types) and then grow their ability to report issues well. Then we can find ways to get that front and center too – including to developers because they are great at confirming if something unusual is happening.

What Else?

Well, we used to have an automated script that reminded teams of their open & tracked bugs on Beta/Aurora/Nightly in order to provide a priority order that was visible to devs & their managers. It’s a finicky script that breaks often. I’d like to see that replaced with something that’s not just a cronjob on my personal VPS. We’re also this close to not needed to update product-details (still in SVN) on every release. The fact that the Release Management team has the ability to accidentally take down all mozilla.org properties when a mistake is made submitting svn propedits is not desireable or necessary. We should get the heck away from that asap.

We’ll have more discussions of this in Portland, especially with the teams we work closely with and Sylvestre and I will be talking up our process & future goals at FOSDEM in 2015 as well as following it with a work week in Paris where we can put our heads down and code. Next summer we get an intern again and so we’ll have another set of skilled hands to put on tooling & web service improvements.

Always improving. Always automating. These are the things that make me excited for the next year of Release Management.

Why isn’t Autoland working?

This question comes up enough that I figured a quick blog post/status update would be helpful.

What does Autoland do?

Poll individual Bugzilla bugs for an autoland token (currently using whiteboard tags, future Autoland has an extension & webservice that makes polling the entirety of bugzilla no longer needed). When an autoland request is found, the serviced does an automated landing to try for you of all non-obsolete patches attached to the bug if they can be landed cleanly on tip of mozilla-central and either the patch author or a feedback provider have appropriate hg permissions, otherwise it reports back to the bug what the issues were. Upon completion of the try run, a comment is left in the bug stating the results and if a final repo destination (or destinations) was specified (the hg permissions must match up between requester/reviewer and the destination repo(s)) the service can continue on to autolanding the patch(es) to the destination repo.  A comment would be left on the bug when the push to final destination(s) is done.  There would be no reporting back of final build results, that would be handed back over to human eyes on TBPL.

What is Autoland’s status now?

In April of 2012, right before Marc’s second internship ended, we launched a very experimental public-facing version of Autoland and announced it a bit so we could get more people testing it.  This had varying degrees of success.  We got more bugs ironed out but also discovered that Autoland’s daemons for hgpushing and bugzilla polling tend to fall over a bit too often.  When we moved Autoland off it’s staging VM to a more permanent home we lost the status page that would tell us (and developers) what the modules were doing and that really made the workings of Autoland quite opaque.  With Marc leaving at the end of April and my switch over to help with Release Management in Feb 2012  I had kept meeting regularly with Marc and driving the project to completion as much as possible but hadn’t been able to pull my weight on coding for the last 3 months of our time together. This left us with an Autoland that stopped working and no one available to continue to massage it into being the robust system we needed it to be.  I took mention of Autoland out of our trychooser page, try server docs, and have generally tried to downplay it’s existence while still keeping a plan on the back burner for how we will resurrect it as soon as there is some time.

What does Autoland need to be publicly usable again?

  • A status page that can show what modules are running or down, display what’s in the queue, and give a quick visual to users if Autoland is up or down as a result.
  • Nagios alerts on the Autoland modules that let me (and other people interested in helping to maintain Autoland) know when things fall over.
  • At least one person, if not several, who can access the Autoland master VM and ‘kick’ it as needed

This is what’s needed for a short term solution. I know that we have some bugs with our hg pusher module as well as some trickiness in our message queue that, once fixed, would make the overall system more robust.  We need people using the system to be able to catch more of those bugs though, so in the meantime having as close to on-click restoring of the system would be a huge win here.

What does Autoland need to be truly ‘production’ ready?

  • Security review on the code and the BMO extension so that we can move away from whiteboard tags and let people use the BMO extension instead — this gives much cleaner input to Autoland of what is being requested.
  • More VMs to run hgpusher modules on so that Autoland can handle a larger load. Each VM can run 2 hgpushers max so we’d want to be able to grow our pushing farm as the usage of the system increases.
  • Being able to push to repos other than Try.

There is no clear plan to be able to get the system beyond Try landings, but I see automated try landings as still being a huge help so I’d be super happy just to see that part get back to a working state.  This project is no longer a RelEng priority now that I’ve permanently moved to Release Management and Marc has gone on to other internships and more schooling. I can’t promise anything time-wise, but I wanted to provide some clarity into what is needed and put out the “patches welcome” call.  I see Autoland as being a great option for a community-managed project and I want to keep working on it when time permits. If you are looking to become a Mozilla contributor and are interested in automation and web APIs – this might be a good starter project for you. Please get in touch.

 

 

Want to help? Encouraging community contributions

In a timely confluence with Mozilla’s new Steward initiative, I’m preparing to get some community contributors engaged with some of the projects we work on in Release Engineering.  A fair amount of our production infrastructure has to be locked behind VPN and sekrit passwords (we have 400+ million users to protect) but there are more and more RelEng side projects. We provide tools to the larger developer community and solve interesting scalability challenges with our unique (and massive) automation systems that can be worked on by any interested person in their own local test environment and then integrated into our /build repos. My personal goal is to try and get 2 or 3 regular community contributors to come work with us on tackling these.

In order to solicit contributions I have been working with David Boswell. We added Release Engineering to the mozilla.org/contribute ‘areas of interest’ page and I have created the beginnings of a RelEng-specific contribution page. The first two areas that I think would be a great introduction to working with RelEng code & tools are the TryChooser and our upcoming Autoland system.  For the latter, our intern Marc Jessome is sticking around this fall as a contributor to carry on the amazing work he put into this system over the summer.  He’ll be continuing to debug the code and improve the portability of it so that we can get it into a beta testing stage by the end of October.  As that work is being done we also need someone to help us write the API functionality that will allow sheriffs and developers to write tools that utilize this new hands-off landing queue.  We’d also be happy to have people work on the issues that come up when we take Autoland to the next level – auto-landing on a production branch.  To do this we’ll want some automated backing out, bisection, and the ability to wait on getting patches reviewed before continuing.

Another great area for someone interested in helping out Firefox developers is working on the TryChooser syntax and features.  There is a whole tracking bug dedicated to try_enhancements and most of those bugs are ones that can be worked on in a local staging environment.  It’s a chance to get your feet wet with buildbot and our custom scheduling setup. Some of these smaller bugs would be short on time commitment and high on developer appreciation if you fix them. That can be a winning combination for a new contributor, I speak from experience on that 🙂

So, if you’re reading this post and you or someone you know is interested in dipping their toes into becoming a Mozilla contributor and these projects make you curious then come find me and we’ll get you set up with a staging environment so that you can start fixing real world tools and automation bugs in no time.