Gavin Maltby
2010-01-13 01:27:51 UTC
Hi,
[cc'd fm-discuss]
avoid it by adding
set cmi_no_init=1
to /etc/system. You lose nothing on an virtualized guest, anyway. You may
need to beadm mount the 130 BE from a working BE and edit ./etc/system
within the 130 BE.
As each cpu starts up (virtual in this case, as presented by VMWare) Solaris
reads the APIC id for the cpu and decomposes it into (chip, core, thread)
components. The error above tells us we had a collision between two
cpus when decomposed in this way. There have been some logic errors in the
past that do this, but it's more likely I think that ESXi is presenting
some inconsistent view of the hardware (*) such as a "number of cores per chip"
that is not a power-of-2. Those values, read via CPUID instructions
during startup for each cpu, are used in figuring out the number of
bits of APIC id that pertain to chip number, core number, thread within
core and getting those wrong leads to collisions.
Of course having detected the collision we should fail gracefully. The
code is trying but there's a longstanding and as yet undiagnosed (tried
and failed) bug here that can lead to a horrible panic or hang instead.
Assuming cmi_no_init=1 works for you it would be interesting to see
the output of 'kstat -m cpu_info' from the 130 system after boot.
On the 117 BE you could also grab that output for contrast, as
well as the output of "echo ::cmihdl | pfexec mdb -k" on 117.
(*) That based on a past workaround added for VMware. But there has been
a recent change (10947:2ecbb0a4d189) that made changes to cmi_hdl_create
which could explain this. Could you include real/physical cpu details,
as well as the output of 'psrinfo -vp' from the Solaris guest under 117
and 130.
Thanks
Gavin
[cc'd fm-discuss]
NOTICE: cmi_hdl_create: chipid 15 coreid 0 strandid 0 handle already allocated!
WARNING: There will be no MCA support on chip 15 core 0 strand 0 (cmi_hdl_create returned NULL)
If the above is the cause of the hang (quite likely, see below) you canWARNING: There will be no MCA support on chip 15 core 0 strand 0 (cmi_hdl_create returned NULL)
avoid it by adding
set cmi_no_init=1
to /etc/system. You lose nothing on an virtualized guest, anyway. You may
need to beadm mount the 130 BE from a working BE and edit ./etc/system
within the 130 BE.
This worked in Nevada snv_117.
Has ESXi changed at all since then - patches etc?As each cpu starts up (virtual in this case, as presented by VMWare) Solaris
reads the APIC id for the cpu and decomposes it into (chip, core, thread)
components. The error above tells us we had a collision between two
cpus when decomposed in this way. There have been some logic errors in the
past that do this, but it's more likely I think that ESXi is presenting
some inconsistent view of the hardware (*) such as a "number of cores per chip"
that is not a power-of-2. Those values, read via CPUID instructions
during startup for each cpu, are used in figuring out the number of
bits of APIC id that pertain to chip number, core number, thread within
core and getting those wrong leads to collisions.
Of course having detected the collision we should fail gracefully. The
code is trying but there's a longstanding and as yet undiagnosed (tried
and failed) bug here that can lead to a horrible panic or hang instead.
Assuming cmi_no_init=1 works for you it would be interesting to see
the output of 'kstat -m cpu_info' from the 130 system after boot.
On the 117 BE you could also grab that output for contrast, as
well as the output of "echo ::cmihdl | pfexec mdb -k" on 117.
(*) That based on a past workaround added for VMware. But there has been
a recent change (10947:2ecbb0a4d189) that made changes to cmi_hdl_create
which could explain this. Could you include real/physical cpu details,
as well as the output of 'psrinfo -vp' from the Solaris guest under 117
and 130.
Thanks
Gavin