From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 2BB0281E36 for ; Mon, 14 Nov 2016 17:38:44 -0800 (PST) Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga105.fm.intel.com with ESMTP; 14 Nov 2016 17:38:48 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.31,640,1473145200"; d="scan'208";a="901411867" Received: from fmsmsx103.amr.corp.intel.com ([10.18.124.201]) by orsmga003.jf.intel.com with ESMTP; 14 Nov 2016 17:38:47 -0800 Received: from fmsmsx112.amr.corp.intel.com (10.18.116.6) by FMSMSX103.amr.corp.intel.com (10.18.124.201) with Microsoft SMTP Server (TLS) id 14.3.248.2; Mon, 14 Nov 2016 17:38:46 -0800 Received: from shsmsx152.ccr.corp.intel.com (10.239.6.52) by FMSMSX112.amr.corp.intel.com (10.18.116.6) with Microsoft SMTP Server (TLS) id 14.3.248.2; Mon, 14 Nov 2016 17:38:46 -0800 Received: from shsmsx102.ccr.corp.intel.com ([169.254.2.239]) by SHSMSX152.ccr.corp.intel.com ([169.254.6.138]) with mapi id 14.03.0248.002; Tue, 15 Nov 2016 09:38:43 +0800 From: "Fan, Jeff" To: Laszlo Ersek , Paolo Bonzini CC: "edk2-devel@ml01.01.org" , "Yao, Jiewen" Thread-Topic: [edk2] [PATCH v2 0/3] Put AP into safe hlt-loop code on S3 path Thread-Index: AQHSO97k043lSFBUy0mbX04flay71KDTq3eAgAMz0ICAAMHjAIAACSgAgAAeoYCAAAhqAIAABPsAgAAJRwCAAGZ0AIAAAaeAgABf0gCAABmQgIAAiNPw Date: Tue, 15 Nov 2016 01:38:42 +0000 Message-ID: <542CF652F8836A4AB8DBFAAD40ED192A4A2DD199@shsmsx102.ccr.corp.intel.com> References: <20161111054545.19616-1-jeff.fan@intel.com> <542CF652F8836A4AB8DBFAAD40ED192A4A2DB4F5@shsmsx102.ccr.corp.intel.com> <00b6828b-78c5-af4f-ab98-de4460b1b8ec@redhat.com> <4dc14e5c-9b43-4338-c7a5-9750e8a9547a@redhat.com> <3e61ffc4-9eaf-0015-11a7-e2d698624acb@redhat.com> <648314a4-6c17-7c88-7e47-98c4de95fe2d@redhat.com> <7bca5070-be51-aa99-bdb6-9fcc03086430@redhat.com> <9d695359-5b0d-2f86-9dc9-1986ee425a55@redhat.com> In-Reply-To: <9d695359-5b0d-2f86-9dc9-1986ee425a55@redhat.com> Accept-Language: zh-CN, en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiMTUwYjU3ZWQtNzM2Yi00NWU0LWJmZDQtMmRmZDRkNDVhNDZhIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE1LjkuNi42IiwiVHJ1c3RlZExhYmVsSGFzaCI6IjZ4bzNWV2dPTnMxWUh3WkJOVVZCUm1yQUwzWERTVjNtUHpMa1RTQkI1Slk9In0= x-ctpclassification: CTP_IC x-originating-ip: [10.239.127.40] MIME-Version: 1.0 Subject: Re: [PATCH v2 0/3] Put AP into safe hlt-loop code on S3 path X-BeenThere: edk2-devel@lists.01.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: EDK II Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 15 Nov 2016 01:38:44 -0000 Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Laszlo, That's great! Thanks your details updating. Jeff -----Original Message----- From: Laszlo Ersek [mailto:lersek@redhat.com]=20 Sent: Tuesday, November 15, 2016 9:28 AM To: Paolo Bonzini; Fan, Jeff Cc: edk2-devel@ml01.01.org; Yao, Jiewen Subject: Re: [edk2] [PATCH v2 0/3] Put AP into safe hlt-loop code on S3 pat= h On 11/15/16 00:56, Laszlo Ersek wrote: > On 11/14/16 19:13, Paolo Bonzini wrote: >> >> >> On 14/11/2016 19:07, Laszlo Ersek wrote: >>> On 11/14/16 13:00, Paolo Bonzini wrote: >>>> >>>> >>>> On 14/11/2016 12:27, Laszlo Ersek wrote: >>>>> Well... >>>>> >>>>> http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.h >>>>> tml=20 >>>>> http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00125.h >>>>> tml=20 >>>>> http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00563.h >>>>> tml >>>>> >>>>> Are you suggesting that I resurrect this patch? That would be my=20 >>>>> pleasure. Please say yes. >>>> >>>> It's hard to say no when someone has written the code already. :) >>> >>> Thanks. I refreshed both patches (OVMF and QEMU -- no code changes=20 >>> just more precise commit messages). Unfortunately, quite a few=20 >>> things seem broken, although these patches worked a year ago. >>> >>> My QEMU base commit is current master 83c83f9a5266. My host kernel=20 >>> is 3.10.0-514.el7.x86_64. >>> >>> *** So, when I test these two patches, based on edk2 master (no=20 >>> on-list patches), Ia32 target, my boot hangs (spins) with the log endin= g in: >>> >>>> SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0 >>> >>> That is, MpInitChangeApLoopCallback() is entered, but it never finishes= . >>> "info cpus" prints: >>> >>> * CPU #0: pc=3D0x000000007f1f7763 thread_id=3D17395 >>> CPU #1: pc=3D0x000000007f2ce01e (halted) thread_id=3D17396 >>> CPU #2: pc=3D0x000000007f2ce01e (halted) thread_id=3D17397 >>> CPU #3: pc=3D0x00000000fffffff0 thread_id=3D17398 >>> >>> and I've also seen a case where all the APs were stuck at the reset=20 >>> vector (0x00000000fffffff0), *not* halted, like VCPU#3 above. They=20 >>> don't spin, they're just stuck. The spinning comes from CPU#0,=20 >>> apparently in MpInitChangeApLoopCallback. >>> >>> *** I flipped the AP sync mode to traditional (considering the=20 >>> relaxed mode shouldn't be required with the broadcast SMIs). This=20 >>> time the log ends with: >>> >>>> SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0 >>>> MpInitChangeApLoopCallback() done! >>> >>> but then QEMU abort()s: >>> >>>> kvm_io_ioeventfd_add: error adding ioeventfd: File exists >>>> 2016-11-14 17:00:41.405+0000: shutting down, reason=3Dcrashed >>> >>> I see some ioeventfd stuff in the recent QEMU history; do you think=20 >>> it's related? >> >> Yes, just try 2.7 for now or disable vhost. >=20 > (1) I think I have some new results. >=20 > I used the gdbserver built into QEMU, and I (sort of) single-stepped=20 > the > MpInitChangeApLoopCallback() function in=20 > "UefiCpuPkg/Library/MpInitLib/DxeMpLib.c", and whatever else it called. > For this I used the Ia32 1x2x2 configuration. Also, the broadcast SMI=20 > patches were applied to both QEMU and OVMF. >=20 > I set a breakpoint on RelocateApLoop(), so that when an AP would start=20 > up, gdb would switch to it automatically (and it did in fact). >=20 > I liberally used "info cpus" from a separate terminal, and also "info=20 > thread" from within gdb (which gives an incredibly cool insight into=20 > the VCPU states!) >=20 > (2) Here's a few interesting results (strictly empirically): >=20 > * when I was stepping through the SendInitSipiSipiAllExcludingSelf() > function *real slow*, manually, suddenly things worked >=20 > * I noticed that, while stepping through the INIT-SIPI-SIPI sequence in > the above-mentioned BSP function, the APs switched from "halted" to > "running" after the *first* SIPI. Not the second SIPI, the first one. >=20 > Then I went to the KVM code, and looked at arch/x86/kvm/lapic.c. My=20 > explanation is terribly inexact, but: >=20 > * __apic_accept_irq() translates the LAPIC writes (APIC_DM_INIT and > APIC_DM_STARTUP) into pending events (KVM_APIC_INIT and=20 > KVM_APIC_SIPI) >=20 > * this function also calls kvm_vcpu_kick() >=20 > * the kvm_apic_accept_events() function processes the pending events. > The KVM_APIC_INIT event, if pending, causes kvm_vcpu_reset() to be > called, and the AP to be moved to KVM_MP_STATE_INIT_RECEIVED state. >=20 > And, in that state, a pending KVM_APIC_SIPI event moves the AP to > KVM_MP_STATE_RUNNABLE. This confirms my surprising empirical result > that only one SIPI is awaited after the INIT before launching the APs. >=20 > * the kvm_vcpu_kick() function in "virt/kvm/kvm_main.c" boils down to a > smp_send_reschedule() call, if the kicker and the "kickee" are > different processors (for example, when the BSP kicks the AP) >=20 > This seemed important because it suggested that host kernel > scheduling jitter could delay the delivery (reception) of the INIT on > the AP after the BSP sent it. If the BSP sent the first (and second) > SIPI really fast after the INIT, then those SIPIs could be missed, > and the AP would remain stuck in KVM_MP_STATE_INIT_RECEIVED state. >=20 > And this state (remember kvm_vcpu_reset() from above!) is consistent > with the RIP being stuck at the reset vector address, with the AP > neither running nor being halted. >=20 > (3) So, I looked at the Intel SDM. It says in the sysprog book (yes,=20 > yes, I should be using the humongous fused edition), >=20 > 8.4.4.1 Typical BSP Initialization Sequence >=20 > 15. Broadcasts an INIT-SIPI-SIPI IPI sequence to the APs to wake them > up and initialize them: >=20 > MOV ESI, ICR_LOW; Load address of ICR low dword into ESI. > MOV EAX, 000C4500H; Load ICR encoding for broadcast INIT IPI > ; to all APs into EAX. > MOV [ESI], EAX; Broadcast INIT IPI to all APs >=20 > ; 10-millisecond delay loop. >=20 > MOV EAX, 000C46XXH; Load ICR encoding for broadcast SIPI IP > ; to all APs into EAX, where xx is the vector > ; computed in step 10. > MOV [ESI], EAX; Broadcast SIPI IPI to all APs >=20 > ; 200-microsecond delay loop >=20 > MOV [ESI], EAX; Broadcast second SIPI IPI to all APs >=20 > ; 200-microsecond delay loop >=20 > In the SendInitSipiSipiAllExcludingSelf(), we have a >=20 > MicroSecondDelay (PcdGet32(PcdCpuInitIpiDelayInMicroSeconds)); >=20 > between the INIT and the first SIPI, and a >=20 > MicroSecondDelay (200); >=20 > between both SIPIs. PcdCpuInitIpiDelayInMicroSeconds is set (in > UefiCpuPkg.dec) to 10000, which matches the above 10ms recommendation. >=20 > I think this could be too low for KVM. (It's telling that the value is=20 > a PCD in the first place.) >=20 > (4) The circumstances where the "AP is lost" -- due to the missed > SIPI(s) I believe -- vary, interestingly. These results are from my=20 > testing with the Ia32 SMM build of OVMF, CPU topology 1x2x2, no XD=20 > support. Jeff's v2 (=3D=3D this series) is applied invariably as a basis= =20 > (because we agree that it fixes bugs). >=20 > * The "AP lost" issue persists at S3 resume even if I raise > PcdCpuInitIpiDelayInMicroSeconds to 1/10th of a second. (Reproduced > at the 43th resume.) >=20 > If I set the kernel's "cpu_init_udelay" parameter similarly in > addition, then that seems to make the symptom more frequent, not > less, which is completely counter-intuitive :( >=20 > * If I apply the broadcast SMI patch to OVMF (on top of Jeff's v2), > then I can't even boot; the AP (or some APs) regularly lose the SIPI > and remain stuck in "INIT received" (I think) when started from > MpInitChangeApLoopCallback(). Hence the boot never completes. >=20 > Raising PcdCpuInitIpiDelayInMicroSeconds as described above fixes the > boot. (And, the original goal of the broadcast SMI patches is > achieved, "efibootmgr" is pretty fast even when bound to VCPU#1.) > However, an AP still gets stuck during S3 occasionally :( >=20 > And, cpu_init_udelay=3D100000 for the guest kernel makes it only worse. >=20 > Ultimately, I couldn't find a way to make S3 work reliably in the Ia32=20 > SMM build. This v2 series doesn't help, broadcast SMIs don't help,=20 > raising the INIT<->SIPI delay in the firmware doesn't help (it just=20 > mitigates the bad effects of the broadcast SMIs :/), and raising the=20 > same delay in the guest kernel only makes things worse. Scratch that, I did find a good configuration: - Jeff's v2, plus - broadcast SMI, plus - flipping back the AP sync mode to 0x00 (traditional) -- this is what I missed in the above tests - no guest kernel cmdline params - messing with PcdCpuInitIpiDelayInMicroSeconds is not necessary, my tests succeeded (*) both with its original value (10ms) and its tenfold increased value (100ms) (*) "Success" means 100 successful S3 suspend/resume cycles, separately wit= h and without modifying PcdCpuInitIpiDelayInMicroSeconds (in total, 200 successful cycles), without losing APs. This is unprecedentedly good for OVMF's Ia32 SMM build! I'll soon submit some patches. I'll also drop the PcdCpuSmmApSyncTimeout ov= erride, because the default is longer (hence safer), and with the tradition= al sync method, in practice we don't have to wait for the BSP on the AP at = all (tested with the "taskset -c 1 efibootmgr" command). ... Okay, repeated the test, with the successful configuration, for Ia32X64 as well: - Jeff's v2, plus - broadcast SMI, plus - flipping back the AP sync mode to 0x00 (traditional), - no guest kernel cmdline params, - no PcdCpuInitIpiDelayInMicroSeconds changes relative to the DEC, - no PcdCpuSmmApSyncTimeout changes relative to the DEC, - 64-bit Fedora guest tested with 100 cycles of S3, no emulation failures, = no lost APs - 64-bit Windows 8.1 guest tested with 33 cycles of S3, no emulation failur= es, no lost APs Series Tested-by: Laszlo Ersek Thanks! Laszlo > Jeff, I think you should go ahead and commit this series. Paolo=20 > reviewed patch #3 and I hope Mike or Jiewen can review the first two=20 > patches. For OVMF they improve things (no more emulation failures),=20 > and I guess we can figure out the lost AP issue later. >=20 > Thanks > Laszlo > _______________________________________________ > edk2-devel mailing list > edk2-devel@lists.01.org > https://lists.01.org/mailman/listinfo/edk2-devel >=20