Recovery of Pavlovia services and data following the OVH fire

You can now access your pre-fire data. For further information about getting to your data see this post

For a recap of developments and advice for researchers here.

9 Likes

For information about the fire itself see here from Reuters: Millions of websites offline after fire at French cloud services firm | Reuters
Unfortunately it looks like it will be some time to resume full service. The main Pavlovia server is, we believe, undamaged but its building currently has no power. Unfortunately, OVH have stated that it will take several days to restore the building power source (it’s a 20 Kilovolt supply).
We are already rebuilding the server, at another data center, from our off-site backups (see info below about backups), but copying over the terabytes of data will take some time as well. We are exploring the option to bring the server back, live, while data continues to arrive from backup (fancy!).

5 Likes

What was our backup policy (and Disaster Recovery Plan)?

  1. use a RAID array, instead of a single disk, to make it possible to recover all data if a disk fails within the server
  2. duplicate data to a Secondary server in case an entire server is destroyed: the servers can be flipped relatively quickly with just a bit of time needed to test everything.
  3. duplicate data to a disk (not server) in a separate location in case the entire datacenter is destroyed. Rebuilding from this takes time because very significant amounts of data must be copied to a new location.

Now, on this occasion, the Secondary server was in the datacenter that was burnt to the ground (SBG2), and our Primary server was in a different building. One would have thought that was safe enough but, alas, the electricity was shut down across the whole site, not just in the burning building. As a consequence both our servers are down.

5 Likes

Did we choose a poor-quality datacenter?

We don’t think so. OVH is the largest cloud server provider in Europe. The OVH data centers (including the one that housed Pavlovia) have ISO27001 certification that covers a wide range of measures, including fire safety.

Should we have chosen to host on cloud servers such as Amazon (where the data are not hosted at any particular location but mirrored automatically anywhere)? When rare major events like this occur they can affect large networks as well (Microsoft email sometimes has outages). How do institutions feel about non-localised data (e.g. for GDPR purposes)? As I say, we will look at it again.

7 Likes

How long will it take to get back online?

The original server is not damaged (to the best of our knowledge) but the power is currently cut off and won’t be back for a few days. We are currently rebuilding the server from backups. We will be testing tomorrow whether we can allow the server to continue running studies before the data have all arrived from the backups to speed the process up. Since the most recent data are currently only stored in Strasbourg we will need to wait for the power to be reconnected before getting the complete datasets.

If you can delay data collection for a couple of days we would strongly recommend you do that. If you have urgent needs due to particular populations or data that must be collected on a particular day then get in touch with us to check the options.

3 Likes

What will we learn from this?

Events like this one, however regrettable, are also an opportunity to learn. I think of it as a painful form of education for us!

We previously estimated that one on-site replica and one off-site backup would be sufficient. The extremely rare event from this morning showed that this was not necessarily the case. Going forward, we will use a replica server off-site (at another datacenter, away from our primary server). We will of course continue to use a 3rd backup (disk) as well. Rest assured that our first and the greatest priority is for experiment data to be safe.

8 Likes

In the end, this was a piece of rather extreme bad luck - most companies never have to test their Disaster Recovery Plans in such dramatic circumstances. I’m sorry for the inconvenience caused to you, the busy scientists, but I’m also immensely grateful to the team, especially Alain, who have been working to mitigate the effects on you in various ways. We’ll continue to work hard in the next few days to get things back up as quickly and smoothly as possible. And we’ll obviously post updates here as often as we can.

best wishes all,
Jon

28 Likes

Thank you all for your hard work on this.

8 Likes

Thank you for the updates! Very appreciated!

OK, I have some positive news on progress.

During the night we have managed to copy over the majority of the data from backup (a server snapshot with all data* up until the 3rd Jan 2021) to a replacement Pavlovia server and we are hoping to have that new Pavlovia instance live for you later today, allowing you to re-sync experiments and continue testing for any urgent projects.

Not all experiments/data will be there initially - data added after 3rd Jan will be populated incrementally through the next week or so - but don’t worry, we are confident at this point that the data will all be recovered eventually.

We will be asking/recommending that, if your study isn’t urgent, you hold off using Pavlovia until we let you know that the full migration of the newer data is complete. That will make the merging of old and new files less difficult for all, but we do understand that won’t be possible for everyone and we hope this solution will get essential studies up and running, as I say, hopefully later today.

Thanks for your patience
Jon

* by “data” I mean all user data, experiment code and participant data

15 Likes

We are currently testing the new Pavlovia, after restoring from the January snapshot. Looks like it’s going well (but we aren’t quite ready for you to log in yet)

2 Likes

Hello Jon,
I have a certain urgency to have the data of the three experiments that I was administering, they are necessary for a thesis. It is possible to have them somehow? Not many, about twenty for three experiments. They were administered in the last two weeks.
Thanks for all

Hi A_M, I’m afraid we aren’t in control of that

Okay let’s wait for the data to be all loaded. At that point they should be downloadable from the panel, correct? Thank you

Hey A_M, that’s correct. The data is there, but it could take a while to get access again.

1 Like

Hi folks, the newly rebuilt Pavlovia (with data from the January snapshot to start with) is now up and running. We would still recommend that you hold off running studies that aren’t urgent: merging the data when we get the rest back should be possible but might be a slight hassle. If you wait until all the data are back in place (hopefully in a week or so) life should be easier.

But if you do need to run you can now do so

5 Likes

Congratulations and a huge thank you to the team for pulling through this massive hurdle!

Two quick clarifications:

  1. So even though power will be back on Monday (hopefully) at OVH the old full data Pavlovia won’t be back up for a few days after that?
  2. If we have a study that was in piloting but is ready to go and we wanted to start collecting data for this week, can we sync that and run it or are you asking us to hold off on that as well?
    Thanks again for all your hard work on this these past few days!

If anyone needs a sign, this literally just happened outside my window:

12 Likes

Thanks Jon - that’s brilliant news and thanks for the hard work to get things back as much as possible.

Just to clarify before I send out comms to researchers - when you say “all data are back in place hopefully in a week or so”, do you mean to the point of where things were at when the fire happened, or before that point? (If so, how far before?).

2 Likes