I recently upgraded my home server from Ubuntu Karmic to Lucid. It did not go well.
The actual apt-get dist-upgrade went fine, with only minor problems which were easy enough to fix. The problem came when I rebooted. The boot started fine, but then got
stuck at the purple boot page, which showed “Ubuntu 10.4″ and 5 dots which cycled
between white and red. It never got past that point.
The usual thing to do at this point is to reboot in single user mode and start debugging startup scripts. Unfortunately I found that single user mode with Ubuntu Lucid was not useful as it doesn’t start a shell until after a huge pile of other things are started. In my case a ’single’ boot got stuck at the same point. Getting rid of the quiet and splash options, and adding nomodeset also didn’t help.
I found that if I booted an older kernel (2.6.31-19) then the system came up OK. That pointed to a likely driver issue. I could have just settled for that older kernel, but part of the reason for going to Lucid was to get a newer ALSA with better support for HDMI audio, so I didn’t really want to stick to an older kernel. I also wanted to know why the problem was happening.
I was also able to get a shell using the latest kernel by using the init=/bin/bash trick, but that doesn’t help to actually debug the problem. To debug startup problems you need to be able to watch the startup process in action, to see what is waiting. This is much harder these days with the new upstart init system now used in Ubuntu, as startup is much more parallel than it used to be. Adding some echo lines to init scripts used to be a useful technique, but it is much harder to get anything sensible out of that when using upstart.
To try to debug the problem I initially had a look for any startup debugging options. I found some promising options in /etc/default/rcS, and tried setting VERBOSE=yes and SULOGIN=yes. I found that the VERBOSE=yes option was somewhat useful, as it gave me some information on what jobs were started/waiting, but it didn’t really allow me to pin down the problem. The parallelism in upstart again made interpreting the output hard. When it says that a job is waiting it doesn’t say what it is waiting on, so you have no idea what the underlying problem really is.
Despite the promising name, and the nice description in the rcS(5) manpage, the SULOGIN=yes option didn’t seem to do anything at all. A grep for SULOGIN in the startup scripts didn’t find any hits, so I suspect it isn’t actually implemented.
As usual, the real key to solving the problem was a hack. I added the following to /etc/default/rcS:
(
/bin/sleep 10
/sbin/ifconfig eth0 192.168.2.10 up
/usr/sbin/sshd
) > /dev/null 2>&1 &
The idea behind this hack was to allow me to login with ssh from my laptop during the startup process and watch what was going on. This worked really well and meant that I was finally able to debug the startup process with the most recent Lucid kernel.
I rebooted again, logged into the system with ssh from my laptop, and started poking around with ps and initctl to see what was going on. I had assumed that “initctl list” would give me the information I needed. It does show what jobs are waiting, but as with the VERBOSE=yes messages it doesn’t tell you what it is waiting on.
Poking around some more I saw 3 things that were suspicious:
1) cryptdisks-enable was shown as “waiting”. I don’t have any encrypted disks on this system, so why should it be waiting?
2) dmesg showed a segfault in plymouth, which is the process that asks for user input during startup (it also does splash screens). This could be linked to why cryptdisks was waiting, as its possible that cryptdisks wanted a passphrase (for what disk though? I don’t have any encrypted disks)
3) dmesg also showed a lot of warnings from the dvb-usb-cxusb driver
As I was running low on time I decided to try the triple whammy of removing the cryptsetup package, removing the dvb-usb and dvb-usb-cxusb drivers (by moving them out of /lib/modules and running depmod) and removing the plymouth-theme-ubuntu-text package to try to simplify plymouth. This did the trick and my system now boots fine.
I still have the puzzle as to what is really causing the problem (and thus which of the changes matter), but I can leave that for another day. I thought it would be worthwhile sharing the ssh debug hack in case other people are also trying to debug upstart startup problems.
Thanks for the post Tridge. Welcome to Ubuntu…
-c
…I probably should add something a little more substantial and less sarcastic.
I think part of the problem with this sort of thing is that Ubuntu abstracts away the back end to “make things easier” but in reality this adds additional complexity, especially when things go wrong. There’s also the issue of trying to make everything “just work” which means installing everything on there that any user or system might possibly want. Recall the Wacom tablet service which would be installed and started on every Ubuntu system whether it has one or not. Seems like a waste of time and resources to me.
I’d love to see a Linux distro which adapts itself to your very machine over time. For example – no printing enabled by default yet when you plug in a printer it automatically installs _required_ software and configures it. Blam. (Fedora actually already does this, but it’s something that I’d love to see expanded to other devices.)
Plug in a scanner, web cam, tv tuner, etc and you get a similar dialog box. Of course during installation if you already have these things it will automatically configure them.
Aside from hardware, it could do things like recognise your most used applications and do things like pre-cache them, offer suggestions on how to improve them, perform maintenance on them, etc.
This way we can have a slim, light weight, custom operating system that grows with you and your hardware over time.
-c
P.S. Not that I can offer much useful, but cryptdisk looks like the culprit to me – removing that package is probably the only one which would have re-built your initramfs, after which point your problem went away. In addition, removing splash from the kernel line probably would have disabled Plymouth anyway. But then, you’re the guru!
-c
Hi Chris,
I also thought it would be cryptdisk, although I now have changed my mind and think its the dvb-usb-cxusb driver. If I try to load that manually then modprobe hangs and the system becomes very slow. Once I can get a terminal to respond a rmmod on the module does
recover things, but I still think this may be sufficient for the module
loading at startup to cause the startup problems I saw.
I’ll debug the module when I have some time.
Cheers, Tridge
I’m in exact this same situation: the boot does not finish and I can’t have a prompt.
I followed your procedure changing /etc/default/rcS.
If I use the init=/bin/bash and then “. /etc/default/rcS”, I can log in using ssh. When I use the “normal” boot process, I can’t log in. I receive a message from putty that says “Network error: Connection refused”, looking like the sshd server isn’t up, but if I “ping” the IP I put in the “ifconfig” line, I receive an answer. For me, this means that the code in rcS was executed, but sshd failed to start, and now I don’t know why and I can’t think in other way to analyze the problem.
I also tried something I saw in another site, pressing “m”, but all I received was a message: “Spawning maintenance shell” and nothing more (for some minutes now).
Updating: I couldn’t find a way to analyze the problem, but I have found a solution for my problem in http://handypenguin.blogspot.com/2010/04/upgrading-to-lucid-with-virtualbox-usb.html