From: "Dong, Eric" <eric.dong@intel.com>
To: Laszlo Ersek <lersek@redhat.com>,
"edk2-devel@lists.01.org" <edk2-devel@lists.01.org>
Cc: "Ni, Ruiyu" <ruiyu.ni@intel.com>
Subject: Re: [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.
Date: Fri, 20 Jul 2018 06:53:07 +0000 [thread overview]
Message-ID: <ED077930C258884BBCB450DB737E66224AC554A1@shsmsx102.ccr.corp.intel.com> (raw)
In-Reply-To: <3ec340cf-3bf1-ad22-3b7b-aa1b2c1fcaa8@redhat.com>
Hi Laszlo,
> -----Original Message-----
> From: Laszlo Ersek [mailto:lersek@redhat.com]
> Sent: Friday, July 20, 2018 1:01 AM
> To: Dong, Eric <eric.dong@intel.com>; edk2-devel@lists.01.org
> Cc: Ni, Ruiyu <ruiyu.ni@intel.com>
> Subject: Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant
> parameter.
>
> Hi Eric,
>
> apologies about the delay.
>
> On 07/18/18 14:59, Dong, Eric wrote:
> > Hi Laszlo,
> >
> > I finally succeed to setup the OVMF platform which can verify the boot
> > failure issue. But on my platform, if I use image build with below
> > command (I assume it is used to enable SMM), the system can't boot to
> > OS (host OS is fedora 25 and guest OS is Ubuntu 18.04). It hang at OS
> > boot phase after ExitBootService point (I can see the console log
> > which should been printed at ExitBootService point, so I think hang
> > should after this point).
> > build -a IA32 -a X64 -p OvmfPkg/OvmfPkgIa32X64.dsc -t VS2015x86 -b
> > NOOPT -D SMM_REQUIRE -D SECURE_BOOT_ENABLE -D TLS_ENABLE
> >
> > If I use below command to build the image, the system can boot to OS.
> > build -a IA32 -a X64 -p OvmfPkg\OvmfPkgIa32X64.dsc -t VS2015x86 -b
> > NOOPT
> >
> > Does my OVMF environment still has problem?
> >
> >
> > When do the above test, I don't include my two patches.
>
> Yes, I think this host environment is still problematic. Namely, the latest
> QEMU version shipped in Fedora 25 is QEMU-2.7:
>
> https://koji.fedoraproject.org/koji/buildinfo?buildID=918114
>
> and QEMU-2.7 does not have a feature that is important for SMM stability.
> This feature is called "SMI broadcast".
>
> In OVMF, the "OvmfPkg/SmmControl2Dxe" runtime driver implements
> EFI_SMM_CONTROL2_PROTOCOL (which is a runtime protocol). The Trigger()
> member function raises an SMI, by writing to IO port 0xB2 (ICH9_APM_CNT).
>
> Originally, QEMU would raise the SMI synchronously only on the sole VCPU
> that called Trigger(). Then, the edk2 SMM driver stack would have to pull the
> other processors explicitly into SMM (via APIC accesses, if I remember
> correctly). This was extremely slow (the processor first raising the SMI would
> wait for a long time for the other processors to show up in SMM, before it
> would decide to pull them in with APIC writes). Also when we switched the
> edk2 SMM sync mode to "relaxed", the results remained very unstable. We
> decided that edk2 supported the "traditional" SMM sync mode much better,
> and so we implemented "SMI broadcast" in QEMU, to satisfy that sync mode.
>
> (My memories are a bit fuzzy at this point; you can read more in the following
> RH Bugzilla entries:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1412327 [QEMU]
> https://bugzilla.redhat.com/show_bug.cgi?id=1412313 [OVMF])
>
> The idea of "SMI broadcast" is that, regardless of which VCPU triggers the
> SMI, QEMU raises the SMI immediately on all VCPUs. This made a
> *huge* difference for the performance and the stability of the edk2 SMM
> driver stack, used in OVMF and on QEMU/KVM.
>
> Now, in order to be able to use old OVMF on new QEMU and vice versa, this
> feature is runtime-negotiated between "OvmfPkg/SmmControl2Dxe" and
> QEMU. (The feature is not enabled by default, and without "SMI broadcast",
> the "relaxed" sync method is slightly less broken than the "tradiational"
> method, so OVMF defaults to that. With the feature enabled, the "traditional"
> mode is better -- that config is the absolute best of all four possible
> combinations.)
>
> More precisely, on the QEMU side, the feature is not tied to a QEMU release,
> but to Q35 *machine type versions*. Therefore, in order to benefit from the
> feature, you need all of the following:
>
> - a recent enough OVMF,
> - a recent enough QEMU release,
> - a recent enough Q35 machine type, specified on the QEMU command line.
>
> The particular minimum machine type is "pc-q35-2.9" (which is clearly only
> provided by QEMU-2.9 and later). The machine type requirement is
> automatically satisfied if you use QEMU-2.9+, and just request the "q35"
> machine type. (Without an explicit machtype version number, the highest one
> supported by the QEMU release will be picked.)
>
> The lack of this feature in your environment is confirmed by your OVMF
> log:
>
> > NegotiateSmiFeatures: SMI feature negotiation unavailable
>
> If the feature is available, you will see the following two messages
> instead:
>
> NegotiateSmiFeatures: using SMI broadcast
> [...]
> AppendFwCfgBootScript: SMI feature negotiation boot script saved
>
> (The second message only appears if you have S3 enabled -- at S3 resume, the
> feature has to be re-enabled, so SmmControl2Dxe saves a boot script
> fragment for that.)
>
> Therefore, please upgrade the host to Fedora 26. In Fedora 26, QEMU 2.9 is
> shipped:
>
> https://koji.fedoraproject.org/koji/buildinfo?buildID=986762
>
> ... It's even better if you can upgrade to Fedora 27, as Fedora 27 is the oldest
> Fedora release still supported at this point. The following article describes the
> recommended upgrade method:
>
> https://fedoraproject.org/wiki/DNF_system_upgrade
>
I updated the system to fedora 28, but it failed to boot. :( so I borrowed an exited fedora 27 DVD and installed it. With this OS, I can reproduce this issue now. I found this issue is an random issue, I booted 5 times and met the issue. I'm checking the issue.
> > Then I include my patches and build the image with SMM enabled, I
> > found I can't reproduce the issue you met. I can find the
> > "MpInitChangeApLoopCallback done!" message in the console log.
> > Attached the console log.
>
> Yes, I can see "MpInitChangeApLoopCallback() done" in the log.
>
> > Can you help to verify the OVMF image build from my side?
>
> Your firmware image (SHA1: a11169ef30ab4d0182dbe2c3fc072b0b2e98c06a)
> reproduces the same issue that I reported, on my end. Out of 10 subsequent
> attempts, it only succeeded to boot the OS 3 times (attempts #1, #8 and #10).
> In the failed cases, the log always ends like this:
>
> MpInitChangeApLoopCallback :: Processor 8, Enabled Processor 8!
> RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0!
> RelocateApLoop :: Processor 3 Enter... MwaitSupport = 0!
> RelocateApLoop :: Processor 4 Enter... MwaitSupport = 0!
> RelocateApLoop :: Processor 5 Enter... MwaitSupport = 0!
> RelocateApLoop :: Processor 6 Enter... MwaitSupport = 0!
> RelocateApLoop :: Processor 1 Enter... MwaitSupport = 0!
> <HANG>
>
> That is, one of the APs fails to show up. It always changes which one is missing;
> for example, another failure:
>
> MpInitChangeApLoopCallback :: Processor 8, Enabled Processor 8!
> RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0!
> RelocateApLoop :: Processor 7 Enter... MwaitSupport = 0!
> RelocateApLoop :: Processor 4 Enter... MwaitSupport = 0!
> RelocateApLoop :: Processor 6 Enter... MwaitSupport = 0!
> RelocateApLoop :: Processor 3 Enter... MwaitSupport = 0!
> RelocateApLoop :: Processor 5 Enter... MwaitSupport = 0!
> <HANG>
>
> My laptop that I use for testing has 1 socket, 4 cores, and 2 threads.
> This is the same VCPU configuration that I use for the guest (hence the
> 1 BSP + 7 AP config seen above). I got the idea that perhaps the host was
> slightly over-subscribed (= more VCPU work than the physical processors can
> serve in "near real time"), and so I changed the guest config to 1 socket, 2
> cores, and 2 threads (= 1 BSP + 3 APs).
> Unfortunately, the issue reproduced in this config as well, at the 4th
> try:
>
> MpInitChangeApLoopCallback :: Processor 4, Enabled Processor 4!
> RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0!
> RelocateApLoop :: Processor 1 Enter... MwaitSupport = 0!
> <HANG>
>
> Just to be sure, I tested a fresh build (without the patches); that booted the OS
> fine (10 out of 10).
>
> I think something in the code is sensitive to timing, or lacks some kind of
> synchronization. One of the APs may sometimes be missed. I guess it's
> possible that the SMI broadcast feature, when enabled, helps expose the
> problem.
>
Good message. I'm investigating this issue and will be back when I root caused it.
> Thanks,
> Laszlo
next prev parent reply other threads:[~2018-07-20 6:56 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-06-29 3:20 [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter Eric Dong
2018-06-29 12:14 ` Laszlo Ersek
2018-07-18 12:59 ` Dong, Eric
2018-07-19 17:01 ` Laszlo Ersek
2018-07-20 6:53 ` Dong, Eric [this message]
2018-07-20 16:30 ` Laszlo Ersek
2018-07-25 3:50 ` Dong, Eric
2018-07-25 10:13 ` Laszlo Ersek
2018-07-25 11:35 ` Dong, Eric
2018-07-25 15:35 ` Laszlo Ersek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-list from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ED077930C258884BBCB450DB737E66224AC554A1@shsmsx102.ccr.corp.intel.com \
--to=devel@edk2.groups.io \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox