From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received-SPF: Pass (sender SPF authorized) identity=mailfrom; client-ip=66.187.233.73; helo=mx1.redhat.com; envelope-from=lersek@redhat.com; receiver=edk2-devel@lists.01.org Received: from mx1.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 0BF7521BADAB2 for ; Thu, 19 Jul 2018 10:01:22 -0700 (PDT) Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 0D03640252ED; Thu, 19 Jul 2018 17:01:21 +0000 (UTC) Received: from lacos-laptop-7.usersys.redhat.com (ovpn-120-95.rdu2.redhat.com [10.10.120.95]) by smtp.corp.redhat.com (Postfix) with ESMTP id 916222026D6B; Thu, 19 Jul 2018 17:01:18 +0000 (UTC) To: "Dong, Eric" , "edk2-devel@lists.01.org" Cc: "Ni, Ruiyu" References: <20180629032047.6340-1-eric.dong@intel.com> <2eac3f3f-972f-9844-6567-5503a0403a85@redhat.com> From: Laszlo Ersek Message-ID: <3ec340cf-3bf1-ad22-3b7b-aa1b2c1fcaa8@redhat.com> Date: Thu, 19 Jul 2018 19:01:17 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: X-Scanned-By: MIMEDefang 2.78 on 10.11.54.4 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.6]); Thu, 19 Jul 2018 17:01:21 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.6]); Thu, 19 Jul 2018 17:01:21 +0000 (UTC) for IP:'10.11.54.4' DOMAIN:'int-mx04.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'lersek@redhat.com' RCPT:'' Subject: Re: [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter. X-BeenThere: edk2-devel@lists.01.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: EDK II Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jul 2018 17:01:23 -0000 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Hi Eric, apologies about the delay. On 07/18/18 14:59, Dong, Eric wrote: > Hi Laszlo, > > I finally succeed to setup the OVMF platform which can verify the boot > failure issue. But on my platform, if I use image build with below > command (I assume it is used to enable SMM), the system can't boot to > OS (host OS is fedora 25 and guest OS is Ubuntu 18.04). It hang at OS > boot phase after ExitBootService point (I can see the console log > which should been printed at ExitBootService point, so I think hang > should after this point). > build -a IA32 -a X64 -p OvmfPkg/OvmfPkgIa32X64.dsc -t VS2015x86 -b NOOPT -D SMM_REQUIRE -D SECURE_BOOT_ENABLE -D TLS_ENABLE > > If I use below command to build the image, the system can boot to OS. > build -a IA32 -a X64 -p OvmfPkg\OvmfPkgIa32X64.dsc -t VS2015x86 -b NOOPT > > Does my OVMF environment still has problem? > > > When do the above test, I don't include my two patches. Yes, I think this host environment is still problematic. Namely, the latest QEMU version shipped in Fedora 25 is QEMU-2.7: https://koji.fedoraproject.org/koji/buildinfo?buildID=918114 and QEMU-2.7 does not have a feature that is important for SMM stability. This feature is called "SMI broadcast". In OVMF, the "OvmfPkg/SmmControl2Dxe" runtime driver implements EFI_SMM_CONTROL2_PROTOCOL (which is a runtime protocol). The Trigger() member function raises an SMI, by writing to IO port 0xB2 (ICH9_APM_CNT). Originally, QEMU would raise the SMI synchronously only on the sole VCPU that called Trigger(). Then, the edk2 SMM driver stack would have to pull the other processors explicitly into SMM (via APIC accesses, if I remember correctly). This was extremely slow (the processor first raising the SMI would wait for a long time for the other processors to show up in SMM, before it would decide to pull them in with APIC writes). Also when we switched the edk2 SMM sync mode to "relaxed", the results remained very unstable. We decided that edk2 supported the "traditional" SMM sync mode much better, and so we implemented "SMI broadcast" in QEMU, to satisfy that sync mode. (My memories are a bit fuzzy at this point; you can read more in the following RH Bugzilla entries: https://bugzilla.redhat.com/show_bug.cgi?id=1412327 [QEMU] https://bugzilla.redhat.com/show_bug.cgi?id=1412313 [OVMF]) The idea of "SMI broadcast" is that, regardless of which VCPU triggers the SMI, QEMU raises the SMI immediately on all VCPUs. This made a *huge* difference for the performance and the stability of the edk2 SMM driver stack, used in OVMF and on QEMU/KVM. Now, in order to be able to use old OVMF on new QEMU and vice versa, this feature is runtime-negotiated between "OvmfPkg/SmmControl2Dxe" and QEMU. (The feature is not enabled by default, and without "SMI broadcast", the "relaxed" sync method is slightly less broken than the "tradiational" method, so OVMF defaults to that. With the feature enabled, the "traditional" mode is better -- that config is the absolute best of all four possible combinations.) More precisely, on the QEMU side, the feature is not tied to a QEMU release, but to Q35 *machine type versions*. Therefore, in order to benefit from the feature, you need all of the following: - a recent enough OVMF, - a recent enough QEMU release, - a recent enough Q35 machine type, specified on the QEMU command line. The particular minimum machine type is "pc-q35-2.9" (which is clearly only provided by QEMU-2.9 and later). The machine type requirement is automatically satisfied if you use QEMU-2.9+, and just request the "q35" machine type. (Without an explicit machtype version number, the highest one supported by the QEMU release will be picked.) The lack of this feature in your environment is confirmed by your OVMF log: > NegotiateSmiFeatures: SMI feature negotiation unavailable If the feature is available, you will see the following two messages instead: NegotiateSmiFeatures: using SMI broadcast [...] AppendFwCfgBootScript: SMI feature negotiation boot script saved (The second message only appears if you have S3 enabled -- at S3 resume, the feature has to be re-enabled, so SmmControl2Dxe saves a boot script fragment for that.) Therefore, please upgrade the host to Fedora 26. In Fedora 26, QEMU 2.9 is shipped: https://koji.fedoraproject.org/koji/buildinfo?buildID=986762 ... It's even better if you can upgrade to Fedora 27, as Fedora 27 is the oldest Fedora release still supported at this point. The following article describes the recommended upgrade method: https://fedoraproject.org/wiki/DNF_system_upgrade > Then I include my patches and build the image with SMM enabled, I > found I can't reproduce the issue you met. I can find the > "MpInitChangeApLoopCallback done!" message in the console log. > Attached the console log. Yes, I can see "MpInitChangeApLoopCallback() done" in the log. > Can you help to verify the OVMF image build from my side? Your firmware image (SHA1: a11169ef30ab4d0182dbe2c3fc072b0b2e98c06a) reproduces the same issue that I reported, on my end. Out of 10 subsequent attempts, it only succeeded to boot the OS 3 times (attempts #1, #8 and #10). In the failed cases, the log always ends like this: MpInitChangeApLoopCallback :: Processor 8, Enabled Processor 8! RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0! RelocateApLoop :: Processor 3 Enter... MwaitSupport = 0! RelocateApLoop :: Processor 4 Enter... MwaitSupport = 0! RelocateApLoop :: Processor 5 Enter... MwaitSupport = 0! RelocateApLoop :: Processor 6 Enter... MwaitSupport = 0! RelocateApLoop :: Processor 1 Enter... MwaitSupport = 0! That is, one of the APs fails to show up. It always changes which one is missing; for example, another failure: MpInitChangeApLoopCallback :: Processor 8, Enabled Processor 8! RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0! RelocateApLoop :: Processor 7 Enter... MwaitSupport = 0! RelocateApLoop :: Processor 4 Enter... MwaitSupport = 0! RelocateApLoop :: Processor 6 Enter... MwaitSupport = 0! RelocateApLoop :: Processor 3 Enter... MwaitSupport = 0! RelocateApLoop :: Processor 5 Enter... MwaitSupport = 0! My laptop that I use for testing has 1 socket, 4 cores, and 2 threads. This is the same VCPU configuration that I use for the guest (hence the 1 BSP + 7 AP config seen above). I got the idea that perhaps the host was slightly over-subscribed (= more VCPU work than the physical processors can serve in "near real time"), and so I changed the guest config to 1 socket, 2 cores, and 2 threads (= 1 BSP + 3 APs). Unfortunately, the issue reproduced in this config as well, at the 4th try: MpInitChangeApLoopCallback :: Processor 4, Enabled Processor 4! RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0! RelocateApLoop :: Processor 1 Enter... MwaitSupport = 0! Just to be sure, I tested a fresh build (without the patches); that booted the OS fine (10 out of 10). I think something in the code is sensitive to timing, or lacks some kind of synchronization. One of the APs may sometimes be missed. I guess it's possible that the SMI broadcast feature, when enabled, helps expose the problem. Thanks, Laszlo