OpenSolaris snv130 hangs on booting in VMWare ESXi 4.0

Discussion:

Gavin Maltby

2010-01-13 01:27:51 UTC

Hi,

[cc'd fm-discuss]

NOTICE: cmi_hdl_create: chipid 15 coreid 0 strandid 0 handle already allocated!
WARNING: There will be no MCA support on chip 15 core 0 strand 0 (cmi_hdl_create returned NULL)

If the above is the cause of the hang (quite likely, see below) you can
avoid it by adding

set cmi_no_init=1

to /etc/system. You lose nothing on an virtualized guest, anyway. You may
need to beadm mount the 130 BE from a working BE and edit ./etc/system
within the 130 BE.

This worked in Nevada snv_117.

Has ESXi changed at all since then - patches etc?

As each cpu starts up (virtual in this case, as presented by VMWare) Solaris
reads the APIC id for the cpu and decomposes it into (chip, core, thread)
components. The error above tells us we had a collision between two
cpus when decomposed in this way. There have been some logic errors in the
past that do this, but it's more likely I think that ESXi is presenting
some inconsistent view of the hardware (*) such as a "number of cores per chip"
that is not a power-of-2. Those values, read via CPUID instructions
during startup for each cpu, are used in figuring out the number of
bits of APIC id that pertain to chip number, core number, thread within
core and getting those wrong leads to collisions.

Of course having detected the collision we should fail gracefully. The
code is trying but there's a longstanding and as yet undiagnosed (tried
and failed) bug here that can lead to a horrible panic or hang instead.

Assuming cmi_no_init=1 works for you it would be interesting to see
the output of 'kstat -m cpu_info' from the 130 system after boot.
On the 117 BE you could also grab that output for contrast, as
well as the output of "echo ::cmihdl | pfexec mdb -k" on 117.

(*) That based on a past workaround added for VMware. But there has been
a recent change (10947:2ecbb0a4d189) that made changes to cmi_hdl_create
which could explain this. Could you include real/physical cpu details,
as well as the output of 'psrinfo -vp' from the Solaris guest under 117
and 130.

Thanks

Gavin

Maurice Volaski

2010-01-13 07:03:25 UTC

Permalink

I still had to disable pcieb, though, as per the instructions in the
release notes (add "-B disable-pcieb=true" to the kernel$ GRUB
command line, then, after the installation, edit the permamnent
menu.lst likewise).

Ah, I didn't know this.

Also Settings->Options->Advanced->CPU/MMU Virtualization is on the
third option - Use Intel VT/x etc., not EPT.

It was set to automatic, so I forced set it.

When I try to boot from the snv130 iso at GenUnix in an ESXi 4.0
guest set to Solaris 64-bit, I get the grub screen, then it
NOTICE: cmi_hdl_create: chipid 15 coreid 0 strandid 0 handle
already allocated!
WARNING: There will be no MCA support on chip 15 core 0 strand 0
(cmi_hdl_create returned NULL)

If the above is the cause of the hang (quite likely, see below) you can
avoid it by adding
set cmi_no_init=1

(I'm booting from the DVD installer, so setting this involved passing
-kd to grub and then entering cmi_no_init/W1)

This does NOT appear to be the cause of the hang. It can boot if I
don't set it, but it may not boot if I do. It's seems somewhat
random. I estimate I have a 50% chance of the DVD booting!

This worked in Nevada snv_117.

Oops, it was b114. Now I do see the cmi_hdl_create/no MCA support
message twice, but it never hangs.

Has ESXi changed at all since then - patches etc?

No, and it should be a moot point since I'm testing them side by side.

Assuming cmi_no_init=1 works for you it would be interesting to see

In the lucky time I had got it booted, I did have that set.

the output of 'kstat -m cpu_info' from the 130 system after boot.
On the 117 BE you could also grab that output for contrast, as

This is pretty much identical except for
current_clock_Hz/supported_frequencies_Hz, which is 2933250901 in
b114.

module: cpu_info instance: 0
name: cpu_info0 class: misc
brand Intel(r) Xeon(r) CPU
X5570 @ 2.93GHz
chip_id 0
clock_MHz 2933
clog_id 0
core_id 0
cpu_type i386
crtime 47.379098502
current_clock_Hz 2933284363
current_cstate 1
family 6
fpu_type i387 compatible
implementation x86 (chipid 0x0 GenuineIntel
106A5 family 6 model 26 step 5 clock 2933 MHz)
model 26
ncore_per_chip 1
ncpu_per_chip 1
pkg_core_id 0
snaptime 151.877646117
socket_type Unknown
state on-line
state_begin 1263365686
stepping 5
supported_frequencies_Hz 2933284363
supported_max_cstates 1
vendor_id GenuineIntel

well as the output of "echo ::cmihdl | pfexec mdb -k" on 117.

ffffff03da8b0c48 2 0 0/0/0 S N cpu.generic -
ffffff03e244e308 2 2 0/1/0 S N cpu.generic -

which could explain this. Could you include real/physical cpu details,

There are two X5570s and I've allocated 4 vCPUs to b114 and b130.

as well as the output of 'psrinfo -vp' from the Solaris guest under 117
and 130.

They are same for both.
The physical processor has 2 cores and 4 virtual processors (0-3)
The core has 2 virtual processors (0 1)
The core has 2 virtual processors (2 3)
x86 (GenuineIntel 106A5 family 6 model 26 step 5 clock 2933 MHz)

I do not get that message at all. I am using ESX4i Update 1 (build 208167

Perhaps I shall try this.

--
Maurice Volaski, maurice.volaski-z7/***@public.gmane.org
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University

Maurice Volaski

2010-01-15 07:44:35 UTC

Permalink

NOTICE: cmi_hdl_create: chipid 15 coreid 0 strandid 0
handle already allocated!

I do not get that message at all. I am using ESX4i Update 1 (build
208167) on two Xeon W3520s with virtualization enabled.

Indeed, the update to ESXi 4 (build 219382) eliminates this message.

However, it still hangs, so it looks like this is a new bug in
OpenSolaris, presumably something environmental about my setup is
triggering it.

--
Maurice Volaski, maurice.volaski-z7/***@public.gmane.org
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University