Why did BBC iPlayer go down?
James R | On 23, Jul 2014
Fans of Formula 1 and EastEnders would have noticed over the weekend that BBC iPlayer wasn’t working properly. As we reported this week, the BBC’s flagship VOD service experiences some severe downtime, as problems first surfaced on Saturday 19th July in the morning and plagued the platform all the way through until Monday.
So what happened? Why did BBC iPlayer break?
Richard Cooper, Controller of Digital Distribution at BBC Future Media, has written a lengthy blog post to explain exactly what went wrong.
It starts with some pretty detailed technical language:
“We have a system comprising 58 application servers and 10 database servers that provides programme and clip metadata.”
Translation: The BBC has a computer system.
It continues:
“This data powers various BBC iPlayer applications for the devices that we support (which is over 1200 and counting) as well as modules of programme information and clips on many sites across BBC Online. This system is split across two data centres in a “hot-hot” configuration (both running at the same time), with the expectation that we can run at any time from either one of those data centres.”
Translation: The BBC iPlayer service streams a lot of video to a lot of devices, so they’ve got the thing running through two systems to make sure it works at all the time.
So what happened?
“At 9.30 on Saturday morning (19th July 2014) the load on the database went through the roof, meaning that many requests for metadata to the application servers started to fail.”
Translation: Bad things happened.
“The immediate impact of this depended on how each product uses that data. In many cases the metadata is cached at the product level, and can continue to serve content while attempting to revalidate. In some cases (mostly older applications), the metadata is used directly, and so those products started to fail.”
Translation: Things broke, but the BBC has a cache to keep things running. So everything’s ok.
“At almost the same time we had a second problem…”
Translation: Bad things got worse.
How?
“We use a caching layer in front of most of the products on BBC Online, and one of the pools failed. The products managed by that pool include BBC iPlayer and the BBC homepage, and the failure made all of those products inaccessible. That opened up a major incident at the same time on a second front.”
Translation: That cache? Yeah, it failed. This is Major Trouble Central right here.
“The failure was a complex one (we’re still doing the forensics on it), and it has repeated a number of times.”
Translation: We’re still trying to work out how to stop it happening again.
“It was this failure that resulted in us switching the homepage to its emergency mode (“Due to technical problems, we are displaying a simplified version of the BBC Homepage”). We used the emergency page a number of times during the weekend, eventually leaving it up until we were confident that we had completely stabilised the cache.”
Translation: FUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU.
“Restoring the metadata service was complex. Isolating the source of the additional load proved to be far from straightforward, and restoring the service itself is not as simple as rebooting it (turning it off and on again is the ultimate solution to most problems). Performance of the system remained sufficiently poor that in the end we decided to do some significant remedial work on Saturday afternoon, which ran on until the evening. During that period, BBC iPlayer was effectively not useable.”
Translation: They hadn’t tried turning it off and on again.
From Sunday on, Cooper says that iPlayer was in a “walking wounded state”, with near-normal operation. The problem? Sunday night is the busiest ime of the week, so turning it off and on again would have meant no iPlayer for everyone, rather than iPlayer for some people.
The reboot occurred on Monday morning, which left the service running as usual for everyone.
To sum up? High demand made things break. The back-up things broke. So the BBC did the only thing that is guaranteed to get a flagship video on-demand service working again: turn it off and turn it on again.
What does that mean for programmes people may have missed? “I’m afraid we can’t simply turn back the clock,” says Cooper, “and as such the availability for you to watch some programmes in the normal seven day catch-up window was reduced.”
Many titles, though, are still available now, he points out: it is “essentially programmes aired on Saturday 12th July and Sunday 13th July” that are now not available.
“It’s small consolation but that was the weekend of the World Cup Final, Scottish Open, Women’s Open and other live sporting events which are less likely to be viewed on catch-up,” he notes, correctly.
Nonetheless, he is right to point out that “instances like this are incredibly rare”. VOD failures have been more common on other platforms, such as Sky’s NOW, which went down during live streams of several Game of Thrones episodes, as well as the final day of the Premier League season. Even during the World Cup, the BBC’s live streaming only suffered lag, compared to ITV’s temporary, full-blown drop-out. For a free product, which is used by millions, this is the first major problem that BBC iPlayer has encountered. Luckily, we now know that if it ever happens again – the Beeb is trying to ensure it will not – the on-off switch is always handy,
He concluded:
“We’re sorry for the inconvenience.”
Translation: We’re sorry.