I was so thrilled to discover Splunk that I installed it on one of the buildbot masters – qm-rhel02 – without realizing that in fact, Splunk starts to quickly eat up disk space and hogs memory usage. Yesterday afternoon some Talos boxes started to go down because of this, and once I stopped the Splunk server everything started to right itself.
Do not play with the buildbot master.
Do not look directly at the buildbot master.
Do not taunt the buildbot master.
So today’s tasks include getting access to the samba share that was set up, creating a cron job that will rsync the buildbot master logs to said share and then finding a safe place to set up Splunk again.
We really need to have a way to look at data from the buildbot master over a long period of time – otherwise filing bugs on these intermittent failures is just a shot in the dark. Take yesterday for example. qm-win2k3-pgo01 is being “unreliable” and had the same errors in refest for two consecutive builds. I file a bug, and the response is “grab me a copy of the offending objdir so we can poke at it”. Wouldn’t you know that the very next build does not have the same error output – this time it has mochitest issues that are seemingly unrelated. This morning I check again and it’s had a compile failure, an exception (the most hideous purple) and then a completely green run.
Intermittent failures == needle in a haystack