public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed
From: "Dong, Eric" <eric.dong@intel.com>
To: Laszlo Ersek <lersek@redhat.com>,
	"edk2-devel@lists.01.org" <edk2-devel@lists.01.org>
Cc: "Ni, Ruiyu" <ruiyu.ni@intel.com>
Subject: Re: [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.
Date: Fri, 20 Jul 2018 06:53:07 +0000	[thread overview]
Message-ID: <ED077930C258884BBCB450DB737E66224AC554A1@shsmsx102.ccr.corp.intel.com> (raw)
In-Reply-To: <3ec340cf-3bf1-ad22-3b7b-aa1b2c1fcaa8@redhat.com>

Hi Laszlo,


> -----Original Message-----
> From: Laszlo Ersek [mailto:lersek@redhat.com]
> Sent: Friday, July 20, 2018 1:01 AM
> To: Dong, Eric <eric.dong@intel.com>; edk2-devel@lists.01.org
> Cc: Ni, Ruiyu <ruiyu.ni@intel.com>
> Subject: Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant
> parameter.
> 
> Hi Eric,
> 
> apologies about the delay.
> 
> On 07/18/18 14:59, Dong, Eric wrote:
> > Hi Laszlo,
> >
> > I finally succeed to setup the OVMF platform which can verify the boot
> > failure issue.  But on my platform, if I use image build with below
> > command (I assume it is used to enable SMM), the system can't boot to
> > OS (host OS is fedora 25 and guest OS is Ubuntu 18.04). It hang at OS
> > boot phase after ExitBootService point (I can see the console log
> > which should been printed at ExitBootService point, so I think hang
> > should after this point).
> > 	build -a IA32 -a X64 -p OvmfPkg/OvmfPkgIa32X64.dsc -t VS2015x86 -b
> > NOOPT -D SMM_REQUIRE -D SECURE_BOOT_ENABLE -D TLS_ENABLE
> >
> > If I use below command to build the image, the system can boot to OS.
> > 	build -a IA32 -a X64 -p OvmfPkg\OvmfPkgIa32X64.dsc -t VS2015x86 -b
> > NOOPT
> >
> > Does my OVMF environment still has problem?
> >
> >
> > When do the above test, I don't include my two patches.
> 
> Yes, I think this host environment is still problematic. Namely, the latest
> QEMU version shipped in Fedora 25 is QEMU-2.7:
> 
>   https://koji.fedoraproject.org/koji/buildinfo?buildID=918114
> 
> and QEMU-2.7 does not have a feature that is important for SMM stability.
> This feature is called "SMI broadcast".
> 
> In OVMF, the "OvmfPkg/SmmControl2Dxe" runtime driver implements
> EFI_SMM_CONTROL2_PROTOCOL (which is a runtime protocol). The Trigger()
> member function raises an SMI, by writing to IO port 0xB2 (ICH9_APM_CNT).
> 
> Originally, QEMU would raise the SMI synchronously only on the sole VCPU
> that called Trigger(). Then, the edk2 SMM driver stack would have to pull the
> other processors explicitly into SMM (via APIC accesses, if I remember
> correctly). This was extremely slow (the processor first raising the SMI would
> wait for a long time for the other processors to show up in SMM, before it
> would decide to pull them in with APIC writes). Also when we switched the
> edk2 SMM sync mode to "relaxed", the results remained very unstable. We
> decided that edk2 supported the "traditional" SMM sync mode much better,
> and so we implemented "SMI broadcast" in QEMU, to satisfy that sync mode.
> 
> (My memories are a bit fuzzy at this point; you can read more in the following
> RH Bugzilla entries:
> 
>   https://bugzilla.redhat.com/show_bug.cgi?id=1412327 [QEMU]
>   https://bugzilla.redhat.com/show_bug.cgi?id=1412313 [OVMF])
> 
> The idea of "SMI broadcast" is that, regardless of which VCPU triggers the
> SMI, QEMU raises the SMI immediately on all VCPUs. This made a
> *huge* difference for the performance and the stability of the edk2 SMM
> driver stack, used in OVMF and on QEMU/KVM.
> 
> Now, in order to be able to use old OVMF on new QEMU and vice versa, this
> feature is runtime-negotiated between "OvmfPkg/SmmControl2Dxe" and
> QEMU. (The feature is not enabled by default, and without "SMI broadcast",
> the "relaxed" sync method is slightly less broken than the "tradiational"
> method, so OVMF defaults to that. With the feature enabled, the "traditional"
> mode is better -- that config is the absolute best of all four possible
> combinations.)
> 
> More precisely, on the QEMU side, the feature is not tied to a QEMU release,
> but to Q35 *machine type versions*. Therefore, in order to benefit from the
> feature, you need all of the following:
> 
> - a recent enough OVMF,
> - a recent enough QEMU release,
> - a recent enough Q35 machine type, specified on the QEMU command line.
> 
> The particular minimum machine type is "pc-q35-2.9" (which is clearly only
> provided by QEMU-2.9 and later). The machine type requirement is
> automatically satisfied if you use QEMU-2.9+, and just request the "q35"
> machine type. (Without an explicit machtype version number, the highest one
> supported by the QEMU release will be picked.)
> 
> The lack of this feature in your environment is confirmed by your OVMF
> log:
> 
> > NegotiateSmiFeatures: SMI feature negotiation unavailable
> 
> If the feature is available, you will see the following two messages
> instead:
> 
>   NegotiateSmiFeatures: using SMI broadcast
>   [...]
>   AppendFwCfgBootScript: SMI feature negotiation boot script saved
> 
> (The second message only appears if you have S3 enabled -- at S3 resume, the
> feature has to be re-enabled, so SmmControl2Dxe saves a boot script
> fragment for that.)
> 
> Therefore, please upgrade the host to Fedora 26. In Fedora 26, QEMU 2.9 is
> shipped:
> 
>   https://koji.fedoraproject.org/koji/buildinfo?buildID=986762
> 
> ... It's even better if you can upgrade to Fedora 27, as Fedora 27 is the oldest
> Fedora release still supported at this point. The following article describes the
> recommended upgrade method:
> 
>   https://fedoraproject.org/wiki/DNF_system_upgrade
> 

I updated the system to fedora 28, but it failed to boot. :(  so I borrowed an exited fedora 27 DVD and installed it. With this OS, I can reproduce this issue now. I found this issue is an random issue, I booted 5 times and met the issue.  I'm checking the issue.

> > Then I include my patches and build the image with SMM enabled, I
> > found I can't reproduce the issue you met. I can find the
> > "MpInitChangeApLoopCallback done!" message in the console log.
> > Attached the console log.
> 
> Yes, I can see "MpInitChangeApLoopCallback() done" in the log.
> 
> > Can you help to verify the OVMF image build from my side?
> 
> Your firmware image (SHA1: a11169ef30ab4d0182dbe2c3fc072b0b2e98c06a)
> reproduces the same issue that I reported, on my end. Out of 10 subsequent
> attempts, it only succeeded to boot the OS 3 times (attempts #1, #8 and #10).
> In the failed cases, the log always ends like this:
> 
>   MpInitChangeApLoopCallback :: Processor 8, Enabled Processor 8!
>   RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 3 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 4 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 5 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 6 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 1 Enter... MwaitSupport = 0!
>   <HANG>
> 
> That is, one of the APs fails to show up. It always changes which one is missing;
> for example, another failure:
> 
>   MpInitChangeApLoopCallback :: Processor 8, Enabled Processor 8!
>   RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 7 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 4 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 6 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 3 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 5 Enter... MwaitSupport = 0!
>   <HANG>
> 
> My laptop that I use for testing has 1 socket, 4 cores, and 2 threads.
> This is the same VCPU configuration that I use for the guest (hence the
> 1 BSP + 7 AP config seen above). I got the idea that perhaps the host was
> slightly over-subscribed (= more VCPU work than the physical processors can
> serve in "near real time"), and so I changed the guest config to 1 socket, 2
> cores, and 2 threads (= 1 BSP + 3 APs).
> Unfortunately, the issue reproduced in this config as well, at the 4th
> try:
> 
>   MpInitChangeApLoopCallback :: Processor 4, Enabled Processor 4!
>   RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 1 Enter... MwaitSupport = 0!
>   <HANG>
> 
> Just to be sure, I tested a fresh build (without the patches); that booted the OS
> fine (10 out of 10).
> 
> I think something in the code is sensitive to timing, or lacks some kind of
> synchronization. One of the APs may sometimes be missed. I guess it's
> possible that the SMI broadcast feature, when enabled, helps expose the
> problem.
> 

Good message.  I'm investigating this issue and will be back when I root caused it.

> Thanks,
> Laszlo

  reply	other threads:[~2018-07-20  6:56 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-29  3:20 [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter Eric Dong
2018-06-29 12:14 ` Laszlo Ersek
2018-07-18 12:59   ` Dong, Eric
2018-07-19 17:01     ` Laszlo Ersek
2018-07-20  6:53       ` Dong, Eric [this message]
2018-07-20 16:30         ` Laszlo Ersek
2018-07-25  3:50           ` Dong, Eric
2018-07-25 10:13             ` Laszlo Ersek
2018-07-25 11:35               ` Dong, Eric
2018-07-25 15:35                 ` Laszlo Ersek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ED077930C258884BBCB450DB737E66224AC554A1@shsmsx102.ccr.corp.intel.com \
    --to=devel@edk2.groups.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox