It is currently Mon Apr 29, 2024 6:28 am

Service Outage

For game and non-game related chatter, links, and other goodies, go here.

Service Outage

Postby Zancarius » Thu Jul 18, 2013 1:32 pm

As you may have already discovered if you've visited the forums since this morning, they've been inaccessible for a few hours. While we're still investigating the cause, a brief synopsis is as follows.

At some point around 9:30am MDT, several processes began triggering the OOM (Out-of-Memory) killer. This rarely happens except in circumstances of high loads and only when all available virtual memory (including swap) is exhausted. While not ideal, it is a solution that attempts to kill off the highest memory-consuming processes on the machine in effort to recover it to a point where it can continue running. Unfortunately, because the OOM-killer uses SIGKILL (kill the process immediately), killed processes aren't given a chance to shut down gracefully. This is by design: Systems that have encountered an out-of-memory condition won't have sufficient memory available for the killed processes to allocate what they might need during shutdown.

Because of this and because of the OOM-killer's selection of processes, the MySQL database process (which powers the forums and some of the other 'goon sites) was selected for termination. MySQL may have been in a write state at the time, because several tables were corrupted and the server left unable to start. This has required me to restore the database from a mix of backups and MySQL's binary logs. As a result, some posts between September 4th and 5th of last year and anything posted after 8:00am MDT this morning (July 18th) may be missing from the restoration process. I apologize if your post is one of the missing ones.

Initial investigation suggests that something was triggering a very rapid spawn of httpd processes which could be synonymous with a huge traffic spike. The source of that spike is yet unknown as I haven't had much of a chance to examine the logs. Moreover, since syslog was one of the processes selected at random for termination, any logging data that wasn't buffered by the kernel (which basically means anything that wasn't part of the kernel log) hasn't been stored.

I'll keep you posted.
I gave that lich a phylactery shard. Liches love phylactery shards.
User avatar
Zancarius
Site Admin
 
Posts: 3907
Joined: Wed Jul 05, 2006 3:06 pm
Location: New Mexico
Gender: Male

Re: Service Outage

Postby Zancarius » Thu Jul 18, 2013 2:11 pm

Everything should be restored at this point.

The Minecraft server will remain offline for a while until we determine for certain what happened. (That, and I don't yet have the time to verify that it is fully functioning.)
I gave that lich a phylactery shard. Liches love phylactery shards.
User avatar
Zancarius
Site Admin
 
Posts: 3907
Joined: Wed Jul 05, 2006 3:06 pm
Location: New Mexico
Gender: Male

Re: Service Outage

Postby Grimblast » Thu Jul 18, 2013 3:20 pm

Image
Guild Wars 2 Characters
Turalia Gearspark - Asuran Engineer ----------- Turus Gearspark - Asuran Guardian
Thelena Turusian - Norn Warrior ---------------- Jake Turusian - Human Thief
Dililah Turusian - Norn Necromancer ------------ Rahl Braincrusher - Char Mesmer
Star Earthbreaker - Sylvari Elementalist -------- Rylo Preystalker - Char Ranger
User avatar
Grimblast
Site Admin
 
Posts: 2513
Joined: Wed Jul 05, 2006 3:21 pm
Location: Alamogordo, New Mexico
Gender: Male

Re: Service Outage

Postby Zancarius » Thu Jul 18, 2013 3:29 pm

I'm starting to think that might've been the case.
I gave that lich a phylactery shard. Liches love phylactery shards.
User avatar
Zancarius
Site Admin
 
Posts: 3907
Joined: Wed Jul 05, 2006 3:06 pm
Location: New Mexico
Gender: Male

Re: Service Outage

Postby Zancarius » Thu Jul 18, 2013 3:33 pm

Here's a small sample of the hundreds of httpd processes that were being killed off and respawning. It's almost like watching a textual Doom match from the 1990s.

Code: Select all
Jul 18 09:31:55 localhost kernel: [28421373.999321] [ 5870]    33  5870    58047     1960   1       0             0 httpd
Jul 18 09:31:55 localhost kernel: [28421373.999323] [ 5891]    33  5891    58047     1961   1       0             0 httpd
Jul 18 09:31:55 localhost kernel: [28421373.999326] [ 5893]    33  5893    57791     1766   1       0             0 httpd
Jul 18 09:31:55 localhost kernel: [28421373.999329] [ 5894]    33  5894    58047     1961   0       0             0 httpd
Jul 18 09:31:55 localhost kernel: [28421373.999331] [ 5899]    33  5899    58047     1956   0       0             0 httpd
Jul 18 09:31:55 localhost kernel: [28421373.999334] [ 5900]    33  5900    58047     1961   1       0             0 httpd
Jul 18 09:31:55 localhost kernel: [28421373.999336] [ 5908]    33  5908    58047     1957   0       0             0 httpd
Jul 18 09:31:55 localhost kernel: [28421373.999339] [ 5909]    33  5909    58047     1961   1       0             0 httpd
Jul 18 09:31:55 localhost kernel: [28421373.999341] [ 5931]    33  5931    57734     1669   1       0             0 httpd
Jul 18 09:31:55 localhost kernel: [28421373.999344] [ 5932]    33  5932    57983     1897   1       0             0 httpd
Jul 18 09:31:55 localhost kernel: [28421373.999346] [ 5933]    33  5933    57983     1898   1       0             0 httpd
Jul 18 09:31:55 localhost kernel: [28421373.999349] [ 5934]    33  5934    57989     1769   0       0             0 httpd
Jul 18 09:31:55 localhost kernel: [28421373.999351] [ 5935]    33  5935    57918     1733   0       0             0 httpd
Jul 18 09:31:55 localhost kernel: [28421373.999354] [ 5936]    33  5936    57918     1733   1       0             0 httpd
I gave that lich a phylactery shard. Liches love phylactery shards.
User avatar
Zancarius
Site Admin
 
Posts: 3907
Joined: Wed Jul 05, 2006 3:06 pm
Location: New Mexico
Gender: Male

Re: Service Outage

Postby Zancarius » Sat Jul 20, 2013 3:13 pm

Apologies again for the service disruption today. Several updates were in dire need of being applied, so the server was offline while these updates were installed.

There'll be some service disruptions to the goon site in the coming days as we look to shifting toward nginx (away from Apache), but TeamSpeak will remain untouched.

Also, sorry Josh...

Code: Select all
$ uptime
13:53:54 up 332 days, 34 min,  1 user,  load average: 0.00, 0.01, 0.05
I gave that lich a phylactery shard. Liches love phylactery shards.
User avatar
Zancarius
Site Admin
 
Posts: 3907
Joined: Wed Jul 05, 2006 3:06 pm
Location: New Mexico
Gender: Male


Return to General Chat

Who is online

Users browsing this forum: No registered users and 4 guests