Archive for July, 2010

debugging startup problems on Ubuntu

Thursday, July 29th, 2010

I recently upgraded my home server from Ubuntu Karmic to Lucid. It did not go well.

The actual apt-get dist-upgrade went fine, with only minor problems which were easy enough to fix. The problem came when I rebooted. The boot started fine, but then got
stuck at the purple boot page, which showed “Ubuntu 10.4″ and 5 dots which cycled
between white and red. It never got past that point.

The usual thing to do at this point is to reboot in single user mode and start debugging startup scripts. Unfortunately I found that single user mode with Ubuntu Lucid was not useful as it doesn’t start a shell until after a huge pile of other things are started. In my case a ‘single’ boot got stuck at the same point. Getting rid of the quiet and splash options, and adding nomodeset also didn’t help.

I found that if I booted an older kernel (2.6.31-19) then the system came up OK. That pointed to a likely driver issue. I could have just settled for that older kernel, but part of the reason for going to Lucid was to get a newer ALSA with better support for HDMI audio, so I didn’t really want to stick to an older kernel. I also wanted to know why the problem was happening.

I was also able to get a shell using the latest kernel by using the init=/bin/bash trick, but that doesn’t help to actually debug the problem. To debug startup problems you need to be able to watch the startup process in action, to see what is waiting. This is much harder these days with the new upstart init system now used in Ubuntu, as startup is much more parallel than it used to be. Adding some echo lines to init scripts used to be a useful technique, but it is much harder to get anything sensible out of that when using upstart.

To try to debug the problem I initially had a look for any startup debugging options. I found some promising options in /etc/default/rcS, and tried setting VERBOSE=yes and SULOGIN=yes. I found that the VERBOSE=yes option was somewhat useful, as it gave me some information on what jobs were started/waiting, but it didn’t really allow me to pin down the problem. The parallelism in upstart again made interpreting the output hard. When it says that a job is waiting it doesn’t say what it is waiting on, so you have no idea what the underlying problem really is.

Despite the promising name, and the nice description in the rcS(5) manpage, the SULOGIN=yes option didn’t seem to do anything at all. A grep for SULOGIN in the startup scripts didn’t find any hits, so I suspect it isn’t actually implemented.

As usual, the real key to solving the problem was a hack. I added the following to /etc/default/rcS:

(
/bin/sleep 10
/sbin/ifconfig eth0 192.168.2.10 up
/usr/sbin/sshd
) > /dev/null 2>&1 &

The idea behind this hack was to allow me to login with ssh from my laptop during the startup process and watch what was going on. This worked really well and meant that I was finally able to debug the startup process with the most recent Lucid kernel.

I rebooted again, logged into the system with ssh from my laptop, and started poking around with ps and initctl to see what was going on. I had assumed that “initctl list” would give me the information I needed. It does show what jobs are waiting, but as with the VERBOSE=yes messages it doesn’t tell you what it is waiting on.

Poking around some more I saw 3 things that were suspicious:

1) cryptdisks-enable was shown as “waiting”. I don’t have any encrypted disks on this system, so why should it be waiting?

2) dmesg showed a segfault in plymouth, which is the process that asks for user input during startup (it also does splash screens). This could be linked to why cryptdisks was waiting, as its possible that cryptdisks wanted a passphrase (for what disk though? I don’t have any encrypted disks)

3) dmesg also showed a lot of warnings from the dvb-usb-cxusb driver

As I was running low on time I decided to try the triple whammy of removing the cryptsetup package, removing the dvb-usb and dvb-usb-cxusb drivers (by moving them out of /lib/modules and running depmod) and removing the plymouth-theme-ubuntu-text package to try to simplify plymouth. This did the trick and my system now boots fine.

I still have the puzzle as to what is really causing the problem (and thus which of the changes matter), but I can leave that for another day. I thought it would be worthwhile sharing the ssh debug hack in case other people are also trying to debug upstart startup problems.

Going solar

Tuesday, July 20th, 2010

I’ve just signed up for a 29.7kW grid connected solar system to be installed in August. We have a large north facing roof with almost no shade, and it really seemed like a shame to be wasting all that potential!

The system will consist of 132 Sunpower 225W panels, and 6 Xantrex GT5 inverters. It is being installed on a colorbond roof, using a landscape layout of the panels:

I got quotes from a lot of installers, and found a big variance in the level of knowledge that the various installers showed when they quoted for the system. I ended up getting quotes from 2 good Canberra based installers (Armada and Enviro-Friendly) and one national installer (Clear Solar). The Clear Solar quote was the one I went with because of the really good mounting system they were able to show me. The rails are mounted vertically up the roof face, and the mounting brackets are shaped to the contours of the colorbond roofing, giving a strong mounting system that allows for airflow on the back of the panels which should keep them a bit cooler (the efficiency of PV panels drops quite quickly if they get too hot). It should also stop leaves and other rubbish from building up behind the rails, which would tend to happen if the rails were mounted horizontally.

I’m getting a data logger and Paul Wilson from Clear has sent me the manual for the protocol to talk to the inverters, so I should be able to write some python code pretty quickly to analyze the performance of the system and watch for any degradation.

It will be fun to be producing more electricity than I use!