NASASpaceFlight.com Forum

NSF Landing Page (Site Rules, Overviews Development, Feedback) => NASASpaceflight.com and NSF Forum Site Rules/News => Topic started by: Chris Bergin on 08/21/2018 12:30 am

Title: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: Chris Bergin on 08/21/2018 12:30 am
So the internet people (namely Intel) have noticed "the internet" has messed up again and there's a new vulnerability that requires hosts to fix things. Apparently most websites will have to go through this and most sites won't bother to tell you, go down, come back and shrug their shoulders and say "oh well, sorry?" ;D - not us.

We're on three hosts. IBM Softlayer, Cloudflare and Digital Ocean (a good reason we need L2 support to pay for all that). IBM are always on the ball - to the point they send daily "we're improving things - oh, our phone lines now have a new hold song!" e-mails like some needy ex-girlfriend trying to keep in touch, heh - but they are always first to act, so their datacenters are going to apply a fix first and we have two servers with them. IBM is our main host.

Could be a fair amount of time we'll be down on Saturday morning, but at the least busy time of day and thank goodness the SpaceX launch jumped from a day before to a day after as it's the same window!  (And then it slipped a lot more). Chinese launch close to this, but several hours before, so we'll be OK.

Planned Virtual Host Reboot Event:

First one:
Dallas, Texas - Start Date: Saturday 25-Aug-2018 03:00:00 UTC

Second one:
Dallas, Texas - Start Date: Saturday 25-Aug-2018 05:01:00 UTC

Quote
The below Virtual Server Instances (VSIs) have been scheduled for maintenance in accordance with the Event. Customers should plan for the VSIs listed below to be inaccessible for the duration of the two hour maintenance window (Chris edit: Oh, thanks a bunch!), although some VSIs may return to service before others. IBM Cloud Engineers will be working to ensure all VSIs return during the maintenance window indicated above.

So I reckon the absolute MAX we'll be down is from 3am UTC to 7am UTC - and then Mark cleans up any errors a reboot can sometimes cause after that. Best case is we're down for 30 mins in the first window and something like that in the second window and the site comes back without any reboot errors - but hey ho, has to be done. Bottom line is Mark will have us ship shape in the morning UTC.

We're waiting to hear what the plan is for Digital Ocean, but that's mainly the news site and caching may avoid too much drama there. Cloudflare have said nothing yet so maybe they are clever clogs and can hotpatch like IBM *usually* do. And now I don't know what I'm talking about but Mark does and he's aware of it all.

Anyway! Hey, longer term members will remember when we used to fall over the second someone sneezed "SpaceX launch", so while this is unavoidable and required, the hosts have been great for the past few years.

I'll update this thread as we go.
Title: Re: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: Lar on 08/21/2018 02:29 pm
Thank you for giving IBM some of your server business. This IBMer appreciates it.

Some of the servers I use for internal stuff are also hosted in IBM Cloud... They went through this fix, but did it in a rolling way so that service was lessened but not lost. You have two servers, I think, so maybe talk to your IBM team about doing this in a rolling way next time. Probably too late now but next time?

The above is not an official position of the IBM corporation[1] and should not be construed as such, or relied upon without verification, contact your support team for official support.

1 - IBM knows better than to make ME an official spokesperson...
Title: Re: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: Chris Bergin on 08/24/2018 01:44 am
This is what all major hosts are working against. Everyone patched, but the reboots are to install a proper fix.

https://www.youtube.com/watch?v=kBOsVt0iXE4
Title: Re: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: Chris Bergin on 08/24/2018 04:05 pm
Note, the news site will remain up throughout as that's not on the IBM set up. The "active discussions" tab on the news site will likely show an "error" when the forum and the "read more" forum links in articles obviously won't work during the downtime window, but everything else on the news site will work. :)
Title: Re: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: Chris Bergin on 08/25/2018 02:47 am
15 mins to the opening of the window.

Obviously during the downtime I won't be able to update the reboot thread, but remember, it may be deeper into the window, it may come back and then go down again (per two reboots) and it may come back with errors (I remember once the forum came back from a reboot, but with a big "database error" sign in the middle, which is no use to anyone), but Mark will clear that error in the morning at the end of the second window.

Feels like a Rocket Lab countdown. "I never want to reboot again".  ;D
Title: Re: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: Chris Bergin on 08/25/2018 03:35 am
Ok, so reboot 1 was pretty painless! 20 mins down and back. Only thing not right is the preview of images isn't showing (that can happen, Mark will fix that in the morning). Download link is there, it's just the preview image in posts.

Remember, second one is in a window that opens in 90 minutes.
Title: Re: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: catdlr on 08/25/2018 03:40 am
Looking good Chris from Los Angeles, CA.  Good Job. :)
Title: Re: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: catdlr on 08/25/2018 03:58 am
Chris,

I'm noticing that I'm unable to see/download attachments - getting 404's.  Here is one example:

https://forum.nasaspaceflight.com/index.php?topic=28104.msg1850049#msg1850049
Title: Re: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: Chris Bergin on 08/25/2018 04:19 am
Chris,

I'm noticing that I'm unable to see/download attachments - getting 404's.  Here is one example:

https://forum.nasaspaceflight.com/index.php?topic=28104.msg1850049#msg1850049

Thanks. Yeah, I thought it was just the preview image, but clearly, it's the attachments full stop. Mark will get that sorted out after the second reboot that's coming up shortly. Right now, being able to view and post is a bonus per the middle of the two reboot windows :)
Title: Re: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: gongora on 08/25/2018 04:20 am
Hmmm, my avatar is gone, but still see them on other people's posts.  Weird.
Title: Re: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: Chris Bergin on 08/25/2018 04:26 am
Everyone keep listing these things, but especially after the second reboot so Mark can round them up :)

PS Avatar could be the difference between uploaded and linked (not sure, but an example), per related to attachments, as this is the error report:

"The attachments upload directory is not writable. Your attachment or avatar cannot be saved."
Title: Re: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: NSF Webmaster on 08/25/2018 06:18 am
All attachments and avatars are working again.

Mark
Title: Re: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: Chris Bergin on 08/25/2018 06:53 am
Thanks Mark! All maintenance and reboots are now complete and NSF's forums are back to normal. Thanks for your patience.

Now I can go to bed. ;D

(These things are like your own child going in for an operation to me. NSF is literally my own child).

Please note any issues we may have missed in this thread.
Title: Re: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: Chris Bergin on 08/27/2018 02:21 am
We had a 15 minute forum freeze there. No warning, none of the hosts showed anything amiss, so could be some 'leftover' from the previous reboots. Will let Mark know, but it all came back by itself and with no errors, so that's good.

Very quiet time for the site (the time of day/night/Sunday), so not many of you will have noticed it, but we always note such things.
Title: Re: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: Chris Bergin on 08/27/2018 06:36 pm
Reddit appears to be staggering their reboots - going through their "lots of" servers with this requirement. Except they are less subtle :o!

Yeah, blame someone else! I should have tried that. ;)
Title: Re: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: Lar on 08/27/2018 11:31 pm
I think if you use that exact image they will know that you're faking... Maybe use the NSF logo instead LOL
Title: Re: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: Chris Bergin on 08/27/2018 11:56 pm
I think if you use that exact image they will know that you're faking... Maybe use the NSF logo instead LOL

I think you jinxed it as the forum did the same as last night. Mark is checking into it again. And it's your IBM gang at fault ;)
Title: Re: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: Chris Bergin on 08/28/2018 12:19 am
Yeah, it's IBM for sure. But that's better than some local misconfig where it's just us and Mark has to hammer something. What happens if the site becomes inaccessible is it has (for the three times it's happened) come back automatically after 10 mins, so if I'm not around, just give it 10 mins and try again.
Title: Re: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: Lar on 08/28/2018 04:46 am
No bueno. Sorry man.
Title: Re: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: Chris Bergin on 08/29/2018 02:57 pm
Still getting the occasional freeze. It always returns automatically 5-10 mins later, so if you do happen to see it, just give it that long. IBM is aware, but seem rather useless (two different engineers not reading each others updates). We'll keep banging their doors down until they sort it out. It's pretty much 10-15 or so minutes in the day in total, but that's 10-15 minutes more than it should be.
Title: Re: NSF downtime for host reboots - Saturday morning UTC - August 25, 2018
Post by: Chris Bergin on 09/03/2018 12:50 pm
Still getting the occasional freeze. It always returns automatically 5-10 mins later, so if you do happen to see it, just give it that long. IBM is aware, but seem rather useless (two different engineers not reading each others updates). We'll keep banging their doors down until they sort it out. It's pretty much 10-15 or so minutes in the day in total, but that's 10-15 minutes more than it should be.

Never happened again after this point, so whatever it was has been solved. I knew if I posted this after 24 hours with no issues, I'd jinx it, so waited until over four days with no freeze. :)

Locking the thread.