NS27 Server Offline

11-12-2016

3:30am: Linux server CPANEL27 suddenly stopped responding to remote requests.

4:30am: After running complete analysis and hardware diagnostics a hardware issue was identified with the RAM modules and these were replaced after un-racking the server, replacing modules and placing it back in the rack.

5:30am: Unfortunately, the server did not reboot correctly even after that and has now started showing an issue in mounting the root partition.

6:30am: We are actively working with the hardware as well as OS teams on this issue to try and have the server back online as soon as possible and will update with more information as available.

7:30am: The OS team is actively working on this to try and rescue the server / file system. This is a time consuming process and the team is actively working to fix the server.

8:30am: OS team is still working on rescue.

9:50am: The server was repeatedly crashing on mount during OS recovery. This is being pushed back to deeper server hardware and RAID hardware analysis.

11:20am: The issue is still ongoing and the hardware team is still actively working on this.

12:40pm: RAID Hardware and all cables were replaced but did not help. We have started a clone process on another server also at this time and are continuing to work on the issue.

2:40PM: Cloning to a new drive on a new server has been done. We are running repairs on that now.

4pm: Filesystem check / repair is still running. Unfortunately this does not show a progress percentage or report. Sample can be seen at this link

5:40pm: The server briefly came up but is crashing again repeatedly on boot attempts. We are continuing to investigate this.

6:30pm: Server came online but started showing file system errors again. We are continuing to check this.

7:pm: Unfortunately, at this point, we will have to consider this filesystem as corrupted beyond repair and unreliable. We are provisioning a new server and will have to start restore from backup. Regrettably, this will take several hours more but we are working continually and round the clock to get services up for all clients as soon as possible.

11:45PM: Old server has been mounted in rescue mode and a new server has been setup to restore data. A full backup from December 1st will first be restored to ensure valid cPanel and data structure. Following that, latest data will be synced from the old server for websites, mails and databases. Data restore is currently in progress. If your account has been restored, please try to keep account activity at minimum at this time.

3AM: Restore is about 70% done at this time. Once this completes, we will run a sync to update files.

6AM: Full backups as available from December 1st/4th/10th were restored. We did some test runs to sync up data any MySQL from the old server but MySQL was unable to start with the data from the old server. It appears that data is unfortunately corrupted beyond repair. We will check further in a test environment to see if the MySQL databases from 11th can be recovered and provide these on an as needed basis. It would therefore also not be advisable to overwrite restored working data by syncing from the old server. We will take this up on a case to case basis as needed to copy and test data if needed.

12:30PM: Emails have been synced to latest data from the old disk.

We sincerely regret and apologize for this inconvenience and outage. The reason we had to roll back to older backups was that the crash actually occurred while taking a backup on Sunday morning. This ended up clearing the previous Daily backup and writing out an incomplete new Daily Backup. While this is an extremely rare occurrence, we will be looking into implementing an additional fail safe to avoid such a situation going forward.


Back to all Support Announcements