I recently upgraded my home server from Ubuntu Karmic to Lucid. It did not go well.
The actual apt-get dist-upgrade went fine, with only minor problems which were easy enough to fix. The problem came when I rebooted. The boot started fine, but then got
stuck at the purple boot page, which showed “Ubuntu 10.4″ and 5 dots which cycled
between white and red. It never got past that point.
The usual thing to do at this point is to reboot in single user mode and start debugging startup scripts. Unfortunately I found that single user mode with Ubuntu Lucid was not useful as it doesn’t start a shell until after a huge pile of other things are started. In my case a ‘single’ boot got stuck at the same point. Getting rid of the quiet and splash options, and adding nomodeset also didn’t help.
I found that if I booted an older kernel (2.6.31-19) then the system came up OK. That pointed to a likely driver issue. I could have just settled for that older kernel, but part of the reason for going to Lucid was to get a newer ALSA with better support for HDMI audio, so I didn’t really want to stick to an older kernel. I also wanted to know why the problem was happening.
I was also able to get a shell using the latest kernel by using the init=/bin/bash trick, but that doesn’t help to actually debug the problem. To debug startup problems you need to be able to watch the startup process in action, to see what is waiting. This is much harder these days with the new upstart init system now used in Ubuntu, as startup is much more parallel than it used to be. Adding some echo lines to init scripts used to be a useful technique, but it is much harder to get anything sensible out of that when using upstart.
To try to debug the problem I initially had a look for any startup debugging options. I found some promising options in /etc/default/rcS, and tried setting VERBOSE=yes and SULOGIN=yes. I found that the VERBOSE=yes option was somewhat useful, as it gave me some information on what jobs were started/waiting, but it didn’t really allow me to pin down the problem. The parallelism in upstart again made interpreting the output hard. When it says that a job is waiting it doesn’t say what it is waiting on, so you have no idea what the underlying problem really is.
Despite the promising name, and the nice description in the rcS(5) manpage, the SULOGIN=yes option didn’t seem to do anything at all. A grep for SULOGIN in the startup scripts didn’t find any hits, so I suspect it isn’t actually implemented.
As usual, the real key to solving the problem was a hack. I added the following to /etc/default/rcS:
(
/bin/sleep 10
/sbin/ifconfig eth0 192.168.2.10 up
/usr/sbin/sshd
) > /dev/null 2>&1 &
The idea behind this hack was to allow me to login with ssh from my laptop during the startup process and watch what was going on. This worked really well and meant that I was finally able to debug the startup process with the most recent Lucid kernel.
I rebooted again, logged into the system with ssh from my laptop, and started poking around with ps and initctl to see what was going on. I had assumed that “initctl list” would give me the information I needed. It does show what jobs are waiting, but as with the VERBOSE=yes messages it doesn’t tell you what it is waiting on.
Poking around some more I saw 3 things that were suspicious:
1) cryptdisks-enable was shown as “waiting”. I don’t have any encrypted disks on this system, so why should it be waiting?
2) dmesg showed a segfault in plymouth, which is the process that asks for user input during startup (it also does splash screens). This could be linked to why cryptdisks was waiting, as its possible that cryptdisks wanted a passphrase (for what disk though? I don’t have any encrypted disks)
3) dmesg also showed a lot of warnings from the dvb-usb-cxusb driver
As I was running low on time I decided to try the triple whammy of removing the cryptsetup package, removing the dvb-usb and dvb-usb-cxusb drivers (by moving them out of /lib/modules and running depmod) and removing the plymouth-theme-ubuntu-text package to try to simplify plymouth. This did the trick and my system now boots fine.
I still have the puzzle as to what is really causing the problem (and thus which of the changes matter), but I can leave that for another day. I thought it would be worthwhile sharing the ssh debug hack in case other people are also trying to debug upstart startup problems.