public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed
From: Laszlo Ersek <lersek@redhat.com>
To: "Dong, Eric" <eric.dong@intel.com>,
	"edk2-devel@lists.01.org" <edk2-devel@lists.01.org>
Cc: "Ni, Ruiyu" <ruiyu.ni@intel.com>
Subject: Re: [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.
Date: Thu, 19 Jul 2018 19:01:17 +0200	[thread overview]
Message-ID: <3ec340cf-3bf1-ad22-3b7b-aa1b2c1fcaa8@redhat.com> (raw)
In-Reply-To: <ED077930C258884BBCB450DB737E66224AC53B66@shsmsx102.ccr.corp.intel.com>

Hi Eric,

apologies about the delay.

On 07/18/18 14:59, Dong, Eric wrote:
> Hi Laszlo,
>
> I finally succeed to setup the OVMF platform which can verify the boot
> failure issue.  But on my platform, if I use image build with below
> command (I assume it is used to enable SMM), the system can't boot to
> OS (host OS is fedora 25 and guest OS is Ubuntu 18.04). It hang at OS
> boot phase after ExitBootService point (I can see the console log
> which should been printed at ExitBootService point, so I think hang
> should after this point).
> 	build -a IA32 -a X64 -p OvmfPkg/OvmfPkgIa32X64.dsc -t VS2015x86 -b NOOPT -D SMM_REQUIRE -D SECURE_BOOT_ENABLE -D TLS_ENABLE
>
> If I use below command to build the image, the system can boot to OS.
> 	build -a IA32 -a X64 -p OvmfPkg\OvmfPkgIa32X64.dsc -t VS2015x86 -b NOOPT
>
> Does my OVMF environment still has problem?
>
>
> When do the above test, I don't include my two patches.

Yes, I think this host environment is still problematic. Namely, the
latest QEMU version shipped in Fedora 25 is QEMU-2.7:

  https://koji.fedoraproject.org/koji/buildinfo?buildID=918114

and QEMU-2.7 does not have a feature that is important for SMM
stability. This feature is called "SMI broadcast".

In OVMF, the "OvmfPkg/SmmControl2Dxe" runtime driver implements
EFI_SMM_CONTROL2_PROTOCOL (which is a runtime protocol). The Trigger()
member function raises an SMI, by writing to IO port 0xB2
(ICH9_APM_CNT).

Originally, QEMU would raise the SMI synchronously only on the sole VCPU
that called Trigger(). Then, the edk2 SMM driver stack would have to
pull the other processors explicitly into SMM (via APIC accesses, if I
remember correctly). This was extremely slow (the processor first
raising the SMI would wait for a long time for the other processors to
show up in SMM, before it would decide to pull them in with APIC
writes). Also when we switched the edk2 SMM sync mode to "relaxed", the
results remained very unstable. We decided that edk2 supported the
"traditional" SMM sync mode much better, and so we implemented "SMI
broadcast" in QEMU, to satisfy that sync mode.

(My memories are a bit fuzzy at this point; you can read more in the
following RH Bugzilla entries:

  https://bugzilla.redhat.com/show_bug.cgi?id=1412327 [QEMU]
  https://bugzilla.redhat.com/show_bug.cgi?id=1412313 [OVMF])

The idea of "SMI broadcast" is that, regardless of which VCPU triggers
the SMI, QEMU raises the SMI immediately on all VCPUs. This made a
*huge* difference for the performance and the stability of the edk2 SMM
driver stack, used in OVMF and on QEMU/KVM.

Now, in order to be able to use old OVMF on new QEMU and vice versa,
this feature is runtime-negotiated between "OvmfPkg/SmmControl2Dxe" and
QEMU. (The feature is not enabled by default, and without "SMI
broadcast", the "relaxed" sync method is slightly less broken than the
"tradiational" method, so OVMF defaults to that. With the feature
enabled, the "traditional" mode is better -- that config is the absolute
best of all four possible combinations.)

More precisely, on the QEMU side, the feature is not tied to a QEMU
release, but to Q35 *machine type versions*. Therefore, in order to
benefit from the feature, you need all of the following:

- a recent enough OVMF,
- a recent enough QEMU release,
- a recent enough Q35 machine type, specified on the QEMU command line.

The particular minimum machine type is "pc-q35-2.9" (which is clearly
only provided by QEMU-2.9 and later). The machine type requirement is
automatically satisfied if you use QEMU-2.9+, and just request the "q35"
machine type. (Without an explicit machtype version number, the highest
one supported by the QEMU release will be picked.)

The lack of this feature in your environment is confirmed by your OVMF
log:

> NegotiateSmiFeatures: SMI feature negotiation unavailable

If the feature is available, you will see the following two messages
instead:

  NegotiateSmiFeatures: using SMI broadcast
  [...]
  AppendFwCfgBootScript: SMI feature negotiation boot script saved

(The second message only appears if you have S3 enabled -- at S3 resume,
the feature has to be re-enabled, so SmmControl2Dxe saves a boot script
fragment for that.)

Therefore, please upgrade the host to Fedora 26. In Fedora 26, QEMU 2.9
is shipped:

  https://koji.fedoraproject.org/koji/buildinfo?buildID=986762

... It's even better if you can upgrade to Fedora 27, as Fedora 27 is
the oldest Fedora release still supported at this point. The following
article describes the recommended upgrade method:

  https://fedoraproject.org/wiki/DNF_system_upgrade

> Then I include my patches and build the image with SMM enabled, I
> found I can't reproduce the issue you met. I can find the
> "MpInitChangeApLoopCallback done!" message in the console log.
> Attached the console log.

Yes, I can see "MpInitChangeApLoopCallback() done" in the log.

> Can you help to verify the OVMF image build from my side?

Your firmware image (SHA1: a11169ef30ab4d0182dbe2c3fc072b0b2e98c06a)
reproduces the same issue that I reported, on my end. Out of 10
subsequent attempts, it only succeeded to boot the OS 3 times (attempts
#1, #8 and #10). In the failed cases, the log always ends like this:

  MpInitChangeApLoopCallback :: Processor 8, Enabled Processor 8!
  RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 3 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 4 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 5 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 6 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 1 Enter... MwaitSupport = 0!
  <HANG>

That is, one of the APs fails to show up. It always changes which one is
missing; for example, another failure:

  MpInitChangeApLoopCallback :: Processor 8, Enabled Processor 8!
  RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 7 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 4 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 6 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 3 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 5 Enter... MwaitSupport = 0!
  <HANG>

My laptop that I use for testing has 1 socket, 4 cores, and 2 threads.
This is the same VCPU configuration that I use for the guest (hence the
1 BSP + 7 AP config seen above). I got the idea that perhaps the host
was slightly over-subscribed (= more VCPU work than the physical
processors can serve in "near real time"), and so I changed the guest
config to 1 socket, 2 cores, and 2 threads (= 1 BSP + 3 APs).
Unfortunately, the issue reproduced in this config as well, at the 4th
try:

  MpInitChangeApLoopCallback :: Processor 4, Enabled Processor 4!
  RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 1 Enter... MwaitSupport = 0!
  <HANG>

Just to be sure, I tested a fresh build (without the patches); that
booted the OS fine (10 out of 10).

I think something in the code is sensitive to timing, or lacks some kind
of synchronization. One of the APs may sometimes be missed. I guess it's
possible that the SMI broadcast feature, when enabled, helps expose the
problem.

Thanks,
Laszlo


  reply	other threads:[~2018-07-19 17:01 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-29  3:20 [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter Eric Dong
2018-06-29 12:14 ` Laszlo Ersek
2018-07-18 12:59   ` Dong, Eric
2018-07-19 17:01     ` Laszlo Ersek [this message]
2018-07-20  6:53       ` Dong, Eric
2018-07-20 16:30         ` Laszlo Ersek
2018-07-25  3:50           ` Dong, Eric
2018-07-25 10:13             ` Laszlo Ersek
2018-07-25 11:35               ` Dong, Eric
2018-07-25 15:35                 ` Laszlo Ersek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3ec340cf-3bf1-ad22-3b7b-aa1b2c1fcaa8@redhat.com \
    --to=devel@edk2.groups.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox