Ubuntu 22.04 - Server Reboot Corrupts Boot Partition

Hi all, I am having a very strange issue - this happened 2 weeks ago when I last rebooted my server.

Essentially, I performed sudo reboot via SSH.

My server never rebooted - upon logging into my server via a IPMI Java Applet, I could see the server wasn’t booting into its boot partition. I quickly rebooted my server into rescue mode, where I had to perform manual work to rebuild the boot partition - this took me a LONG time since I’m not all that familiar with Linux commands.

All has been well until today, when I rebooted my server again via sudo reboot command.

This time, it appeared my server booted straight into GRUB BASH console. I tried setting the root directory, Linux/vmlinuz, and initrd.img, but that unfortunately still didn’t work.

I will need to boot back into rescue mode, mount the server partition, and try to repair the boot partition once again - though this will have to be done tomorrow morning as it’s getting late now.

My question to the community would be - what is happening?

I am running on an OVH dedicated server, running 2 x 2TB HDD’s, in RAID1.

When I checked my /boot/ folder last time, it appeared some files were missing, when I compared them to my other CyberPanel install. Is the sudden reboot of the server causing the boot partition to corrupt? Or perhaps do I have a bigger problem at hand - failing disks?

I know this might not be the place to ask this, but I did see someone else had a similar issue on this form with their Ubuntu setup, where their boot partition vanished/corrupt. So I am hoping someone can help point me in the right direction with regards to what I need to look out for.

Other than the boot partition (/boot/) failing, I haven’t had any other issues with CyberPanel, things have been running smooth. I am a bit worried about loosing my data.

Please share of you get this resolved.

Unfortunately, I’ve got no idea what caused the /boot/ partition to become corrupt/loose files, I am uncertain as to the root cause of the issue.

However, after A LOT of trial and error, and after jumping in and out of recovery mode on my server MULTIPLE times (took 10+ hours), I finally narrowed it down to a faulty Kernel.

I was able to boot into Ubuntu via 5.15.0-71-generic Kernel, but it would NOT boot into 73-generic kernel.

After force re-installing kernel 73-generic in rescue/recovery mode (required mounting the partitions, and chroot’ing into it), I was able to resolve the issue.

However, that’s not to say the issue won’t re-occur again in the future, if there is something causing this.

Thank you for sharing how you resolve this.
Looping in @shoaibkk and @usmannasir .

1 Like

I may add that there were additional steps I took, such as checking drive health, server health (RAM, CPU), etc - to ensure there was no hardware related issue causing the /boot/ partition to corrupt/fail (for the second time), but there is no indication on the server-side.

It is running software RAID1, however, RAID checks up fine with no issues after a long scan.

1 Like