Author Topic: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018  (Read 5627 times)

Offline Chris Bergin

So the internet people (namely Intel) have noticed "the internet" has messed up again and there's a new vulnerability that requires hosts to fix things. Apparently most websites will have to go through this and most sites won't bother to tell you, go down, come back and shrug their shoulders and say "oh well, sorry?" ;D - not us.

We're on three hosts. IBM Softlayer, Cloudflare and Digital Ocean (a good reason we need L2 support to pay for all that). IBM are always on the ball - to the point they send daily "we're improving things - oh, our phone lines now have a new hold song!" e-mails like some needy ex-girlfriend trying to keep in touch, heh - but they are always first to act, so their datacenters are going to apply a fix first and we have two servers with them. IBM is our main host.

Could be a fair amount of time we'll be down on Saturday morning, but at the least busy time of day and thank goodness the SpaceX launch jumped from a day before to a day after as it's the same window!  (And then it slipped a lot more). Chinese launch close to this, but several hours before, so we'll be OK.

Planned Virtual Host Reboot Event:

First one:
Dallas, Texas - Start Date: Saturday 25-Aug-2018 03:00:00 UTC

Second one:
Dallas, Texas - Start Date: Saturday 25-Aug-2018 05:01:00 UTC

Quote
The below Virtual Server Instances (VSIs) have been scheduled for maintenance in accordance with the Event. Customers should plan for the VSIs listed below to be inaccessible for the duration of the two hour maintenance window (Chris edit: Oh, thanks a bunch!), although some VSIs may return to service before others. IBM Cloud Engineers will be working to ensure all VSIs return during the maintenance window indicated above.

So I reckon the absolute MAX we'll be down is from 3am UTC to 7am UTC - and then Mark cleans up any errors a reboot can sometimes cause after that. Best case is we're down for 30 mins in the first window and something like that in the second window and the site comes back without any reboot errors - but hey ho, has to be done. Bottom line is Mark will have us ship shape in the morning UTC.

We're waiting to hear what the plan is for Digital Ocean, but that's mainly the news site and caching may avoid too much drama there. Cloudflare have said nothing yet so maybe they are clever clogs and can hotpatch like IBM *usually* do. And now I don't know what I'm talking about but Mark does and he's aware of it all.

Anyway! Hey, longer term members will remember when we used to fall over the second someone sneezed "SpaceX launch", so while this is unavoidable and required, the hosts have been great for the past few years.

I'll update this thread as we go.
« Last Edit: 08/24/2018 01:52 am by Chris Bergin »

Offline Lar

  • Fan boy at large
  • Global Moderator
  • Senior Member
  • *****
  • Posts: 11671
  • Saw Gemini live on TV
  • A large LEGO storage facility ... in Michigan
  • Liked: 8823
  • Likes Given: 7399
Thank you for giving IBM some of your server business. This IBMer appreciates it.

Some of the servers I use for internal stuff are also hosted in IBM Cloud... They went through this fix, but did it in a rolling way so that service was lessened but not lost. You have two servers, I think, so maybe talk to your IBM team about doing this in a rolling way next time. Probably too late now but next time?

The above is not an official position of the IBM corporation[1] and should not be construed as such, or relied upon without verification, contact your support team for official support.

1 - IBM knows better than to make ME an official spokesperson...
"I think it would be great to be born on Earth and to die on Mars. Just hopefully not at the point of impact." -Elon Musk
"We're a little bit like the dog who caught the bus" - Musk after CRS-8 S1 successfully landed on ASDS OCISLY

Offline Chris Bergin

This is what all major hosts are working against. Everyone patched, but the reboots are to install a proper fix.


Offline Chris Bergin

Note, the news site will remain up throughout as that's not on the IBM set up. The "active discussions" tab on the news site will likely show an "error" when the forum and the "read more" forum links in articles obviously won't work during the downtime window, but everything else on the news site will work. :)

Offline Chris Bergin

15 mins to the opening of the window.

Obviously during the downtime I won't be able to update the reboot thread, but remember, it may be deeper into the window, it may come back and then go down again (per two reboots) and it may come back with errors (I remember once the forum came back from a reboot, but with a big "database error" sign in the middle, which is no use to anyone), but Mark will clear that error in the morning at the end of the second window.

Feels like a Rocket Lab countdown. "I never want to reboot again".  ;D
« Last Edit: 08/25/2018 03:50 am by Chris Bergin »

Offline Chris Bergin

Ok, so reboot 1 was pretty painless! 20 mins down and back. Only thing not right is the preview of images isn't showing (that can happen, Mark will fix that in the morning). Download link is there, it's just the preview image in posts.

Remember, second one is in a window that opens in 90 minutes.
« Last Edit: 08/25/2018 03:38 am by Chris Bergin »

Offline catdlr

  • Member
  • Senior Member
  • *****
  • Posts: 6027
  • Viewed launches since the Redstones
  • Marina del Rey, California, USA
  • Liked: 2530
  • Likes Given: 1914
Looking good Chris from Los Angeles, CA.  Good Job. :)
Tony De La Rosa

Offline catdlr

  • Member
  • Senior Member
  • *****
  • Posts: 6027
  • Viewed launches since the Redstones
  • Marina del Rey, California, USA
  • Liked: 2530
  • Likes Given: 1914
Chris,

I'm noticing that I'm unable to see/download attachments - getting 404's.  Here is one example:

https://forum.nasaspaceflight.com/index.php?topic=28104.msg1850049#msg1850049
« Last Edit: 08/25/2018 03:59 am by catdlr »
Tony De La Rosa

Offline Chris Bergin

Chris,

I'm noticing that I'm unable to see/download attachments - getting 404's.  Here is one example:

https://forum.nasaspaceflight.com/index.php?topic=28104.msg1850049#msg1850049

Thanks. Yeah, I thought it was just the preview image, but clearly, it's the attachments full stop. Mark will get that sorted out after the second reboot that's coming up shortly. Right now, being able to view and post is a bonus per the middle of the two reboot windows :)

Offline gongora

  • Global Moderator
  • Senior Member
  • *****
  • Posts: 4585
  • US
  • Liked: 4153
  • Likes Given: 2361
Hmmm, my avatar is gone, but still see them on other people's posts.  Weird.

Offline Chris Bergin

Everyone keep listing these things, but especially after the second reboot so Mark can round them up :)

PS Avatar could be the difference between uploaded and linked (not sure, but an example), per related to attachments, as this is the error report:

"The attachments upload directory is not writable. Your attachment or avatar cannot be saved."

Offline NSF Webmaster

  • Administrator
  • Full Member
  • *****
  • Posts: 339
  • Almere, The Netherlands
    • NASASpaceFlight.com
  • Liked: 156
  • Likes Given: 15
All attachments and avatars are working again.

Mark

Offline Chris Bergin

Thanks Mark! All maintenance and reboots are now complete and NSF's forums are back to normal. Thanks for your patience.

Now I can go to bed. ;D

(These things are like your own child going in for an operation to me. NSF is literally my own child).

Please note any issues we may have missed in this thread.

Offline Chris Bergin

We had a 15 minute forum freeze there. No warning, none of the hosts showed anything amiss, so could be some 'leftover' from the previous reboots. Will let Mark know, but it all came back by itself and with no errors, so that's good.

Very quiet time for the site (the time of day/night/Sunday), so not many of you will have noticed it, but we always note such things.

Offline Chris Bergin

Reddit appears to be staggering their reboots - going through their "lots of" servers with this requirement. Except they are less subtle :o!

Yeah, blame someone else! I should have tried that. ;)

Offline Lar

  • Fan boy at large
  • Global Moderator
  • Senior Member
  • *****
  • Posts: 11671
  • Saw Gemini live on TV
  • A large LEGO storage facility ... in Michigan
  • Liked: 8823
  • Likes Given: 7399
I think if you use that exact image they will know that you're faking... Maybe use the NSF logo instead LOL
"I think it would be great to be born on Earth and to die on Mars. Just hopefully not at the point of impact." -Elon Musk
"We're a little bit like the dog who caught the bus" - Musk after CRS-8 S1 successfully landed on ASDS OCISLY

Offline Chris Bergin

I think if you use that exact image they will know that you're faking... Maybe use the NSF logo instead LOL

I think you jinxed it as the forum did the same as last night. Mark is checking into it again. And it's your IBM gang at fault ;)

Offline Chris Bergin

Yeah, it's IBM for sure. But that's better than some local misconfig where it's just us and Mark has to hammer something. What happens if the site becomes inaccessible is it has (for the three times it's happened) come back automatically after 10 mins, so if I'm not around, just give it 10 mins and try again.

Offline Lar

  • Fan boy at large
  • Global Moderator
  • Senior Member
  • *****
  • Posts: 11671
  • Saw Gemini live on TV
  • A large LEGO storage facility ... in Michigan
  • Liked: 8823
  • Likes Given: 7399
No bueno. Sorry man.
"I think it would be great to be born on Earth and to die on Mars. Just hopefully not at the point of impact." -Elon Musk
"We're a little bit like the dog who caught the bus" - Musk after CRS-8 S1 successfully landed on ASDS OCISLY

Offline Chris Bergin

Still getting the occasional freeze. It always returns automatically 5-10 mins later, so if you do happen to see it, just give it that long. IBM is aware, but seem rather useless (two different engineers not reading each others updates). We'll keep banging their doors down until they sort it out. It's pretty much 10-15 or so minutes in the day in total, but that's 10-15 minutes more than it should be.

Offline Chris Bergin

Still getting the occasional freeze. It always returns automatically 5-10 mins later, so if you do happen to see it, just give it that long. IBM is aware, but seem rather useless (two different engineers not reading each others updates). We'll keep banging their doors down until they sort it out. It's pretty much 10-15 or so minutes in the day in total, but that's 10-15 minutes more than it should be.

Never happened again after this point, so whatever it was has been solved. I knew if I posted this after 24 hours with no issues, I'd jinx it, so waited until over four days with no freeze. :)

Locking the thread.
« Last Edit: 09/03/2018 12:50 pm by Chris Bergin »

Tags: