Re: [PATCH v2 0/3] Put AP into safe hlt-loop code on S3 path

From: "Fan, Jeff" <jeff.fan@intel.com>
To: Laszlo Ersek <lersek@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>
Cc: "edk2-devel@ml01.01.org" <edk2-devel@ml01.01.org>,
	"Yao, Jiewen" <jiewen.yao@intel.com>
Subject: Re: [PATCH v2 0/3] Put AP into safe hlt-loop code on S3 path
Date: Tue, 15 Nov 2016 01:04:55 +0000	[thread overview]
Message-ID: <542CF652F8836A4AB8DBFAAD40ED192A4A2DCE84@shsmsx102.ccr.corp.intel.com> (raw)
In-Reply-To: <deb3e226-a572-b307-b8b5-bffdd2ac6a91@redhat.com>

Sure. Please add it. :-)

-----Original Message-----
From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Laszlo Ersek
Sent: Tuesday, November 15, 2016 9:03 AM
To: Fan, Jeff; Paolo Bonzini
Cc: edk2-devel@ml01.01.org; Yao, Jiewen
Subject: Re: [edk2] [PATCH v2 0/3] Put AP into safe hlt-loop code on S3 path

On 11/15/16 01:47, Fan, Jeff wrote:
> Laszlo,
> 
> I file https://bugzilla.tianocore.org/show_bug.cgi?id=230 to handle AP lost issue.
> 
> I will push the v2 serial of patches, Paolo has already added r-b signature on the first 2 patches in v1 serials.

Please add my

Tested-by: Laszlo Ersek <lersek@redhat.com>

to the first and the third patch.

If you can wait a bit, I might be able to extend my T-b to the second patch as well.

(More details to follow soon.)

Thanks!
Laszlo

> -----Original Message-----
> From: Laszlo Ersek [mailto:lersek@redhat.com]
> Sent: Tuesday, November 15, 2016 7:56 AM
> To: Paolo Bonzini; Fan, Jeff
> Cc: edk2-devel@ml01.01.org; Yao, Jiewen
> Subject: Re: [edk2] [PATCH v2 0/3] Put AP into safe hlt-loop code on 
> S3 path
> 
> On 11/14/16 19:13, Paolo Bonzini wrote:
>>
>>
>> On 14/11/2016 19:07, Laszlo Ersek wrote:
>>> On 11/14/16 13:00, Paolo Bonzini wrote:
>>>>
>>>>
>>>> On 14/11/2016 12:27, Laszlo Ersek wrote:
>>>>> Well...
>>>>>
>>>>> http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.h
>>>>> t
>>>>> ml
>>>>> http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00125.h
>>>>> t
>>>>> ml
>>>>> http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00563.h
>>>>> t
>>>>> ml
>>>>>
>>>>> Are you suggesting that I resurrect this patch? That would be my 
>>>>> pleasure. Please say yes.
>>>>
>>>> It's hard to say no when someone has written the code already. :)
>>>
>>> Thanks. I refreshed both patches (OVMF and QEMU -- no code changes 
>>> just more precise commit messages). Unfortunately, quite a few 
>>> things seem broken, although these patches worked a year ago.
>>>
>>> My QEMU base commit is current master 83c83f9a5266. My host kernel 
>>> is 3.10.0-514.el7.x86_64.
>>>
>>> *** So, when I test these two patches, based on edk2 master (no 
>>> on-list patches), Ia32 target, my boot hangs (spins) with the log ending in:
>>>
>>>> SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0
>>>
>>> That is, MpInitChangeApLoopCallback() is entered, but it never finishes.
>>> "info cpus" prints:
>>>
>>> * CPU #0: pc=0x000000007f1f7763 thread_id=17395
>>>   CPU #1: pc=0x000000007f2ce01e (halted) thread_id=17396
>>>   CPU #2: pc=0x000000007f2ce01e (halted) thread_id=17397
>>>   CPU #3: pc=0x00000000fffffff0 thread_id=17398
>>>
>>> and I've also seen a case where all the APs were stuck at the reset 
>>> vector (0x00000000fffffff0), *not* halted, like VCPU#3 above. They 
>>> don't spin, they're just stuck. The spinning comes from CPU#0, 
>>> apparently in MpInitChangeApLoopCallback.
>>>
>>> *** I flipped the AP sync mode to traditional (considering the 
>>> relaxed mode shouldn't be required with the broadcast SMIs). This 
>>> time the log ends with:
>>>
>>>> SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0
>>>> MpInitChangeApLoopCallback() done!
>>>
>>> but then QEMU abort()s:
>>>
>>>> kvm_io_ioeventfd_add: error adding ioeventfd: File exists
>>>> 2016-11-14 17:00:41.405+0000: shutting down, reason=crashed
>>>
>>> I see some ioeventfd stuff in the recent QEMU history; do you think 
>>> it's related?
>>
>> Yes, just try 2.7 for now or disable vhost.
> 
> (1) I think I have some new results.
> 
> I used the gdbserver built into QEMU, and I (sort of) single-stepped 
> the
> MpInitChangeApLoopCallback() function in "UefiCpuPkg/Library/MpInitLib/DxeMpLib.c", and whatever else it called.
> For this I used the Ia32 1x2x2 configuration. Also, the broadcast SMI patches were applied to both QEMU and OVMF.
> 
> I set a breakpoint on RelocateApLoop(), so that when an AP would start up, gdb would switch to it automatically (and it did in fact).
> 
> I liberally used "info cpus" from a separate terminal, and also "info 
> thread" from within gdb (which gives an incredibly cool insight into 
> the VCPU states!)
> 
> (2) Here's a few interesting results (strictly empirically):
> 
> * when I was stepping through the SendInitSipiSipiAllExcludingSelf()
>   function *real slow*, manually, suddenly things worked
> 
> * I noticed that, while stepping through the INIT-SIPI-SIPI sequence in
>   the above-mentioned BSP function, the APs switched from "halted" to
>   "running" after the *first* SIPI. Not the second SIPI, the first one.
> 
> Then I went to the KVM code, and looked at arch/x86/kvm/lapic.c. My explanation is terribly inexact, but:
> 
> * __apic_accept_irq() translates the LAPIC writes (APIC_DM_INIT and
>   APIC_DM_STARTUP) into pending events (KVM_APIC_INIT and 
> KVM_APIC_SIPI)
> 
> * this function also calls kvm_vcpu_kick()
> 
> * the kvm_apic_accept_events() function processes the pending events.
>   The KVM_APIC_INIT event, if pending, causes kvm_vcpu_reset() to be
>   called, and the AP to be moved to KVM_MP_STATE_INIT_RECEIVED state.
> 
>   And, in that state, a pending KVM_APIC_SIPI event moves the AP to
>   KVM_MP_STATE_RUNNABLE. This confirms my surprising empirical result
>   that only one SIPI is awaited after the INIT before launching the APs.
> 
> * the kvm_vcpu_kick() function in "virt/kvm/kvm_main.c" boils down to a
>   smp_send_reschedule() call, if the kicker and the "kickee" are
>   different processors (for example, when the BSP kicks the AP)
> 
>   This seemed important because it suggested that host kernel
>   scheduling jitter could delay the delivery (reception) of the INIT on
>   the AP after the BSP sent it. If the BSP sent the first (and second)
>   SIPI really fast after the INIT, then those SIPIs could be missed,
>   and the AP would remain stuck in KVM_MP_STATE_INIT_RECEIVED state.
> 
>   And this state (remember kvm_vcpu_reset() from above!) is consistent
>   with the RIP being stuck at the reset vector address, with the AP
>   neither running nor being halted.
> 
> (3) So, I looked at the Intel SDM. It says in the sysprog book (yes, 
> yes, I should be using the humongous fused edition),
> 
> 8.4.4.1 Typical BSP Initialization Sequence
> 
>   15. Broadcasts an INIT-SIPI-SIPI IPI sequence to the APs to wake them
>       up and initialize them:
> 
>       MOV ESI, ICR_LOW; Load address of ICR low dword into ESI.
>       MOV EAX, 000C4500H; Load ICR encoding for broadcast INIT IPI
>                         ; to all APs into EAX.
>       MOV [ESI], EAX; Broadcast INIT IPI to all APs
> 
>       ; 10-millisecond delay loop.
> 
>       MOV EAX, 000C46XXH; Load ICR encoding for broadcast SIPI IP
>                         ; to all APs into EAX, where xx is the vector
>                         ; computed in step 10.
>       MOV [ESI], EAX; Broadcast SIPI IPI to all APs
> 
>       ; 200-microsecond delay loop
> 
>       MOV [ESI], EAX; Broadcast second SIPI IPI to all APs
> 
>      ; 200-microsecond delay loop
> 
> In the SendInitSipiSipiAllExcludingSelf(), we have a
> 
>   MicroSecondDelay (PcdGet32(PcdCpuInitIpiDelayInMicroSeconds));
> 
> between the INIT and the first SIPI, and a
> 
>   MicroSecondDelay (200);
> 
> between both SIPIs. PcdCpuInitIpiDelayInMicroSeconds is set (in
> UefiCpuPkg.dec) to 10000, which matches the above 10ms recommendation.
> 
> I think this could be too low for KVM. (It's telling that the value is 
> a PCD in the first place.)
> 
> (4) The circumstances where the "AP is lost" -- due to the missed
> SIPI(s) I believe -- vary, interestingly. These results are from my testing with the Ia32 SMM build of OVMF, CPU topology 1x2x2, no XD support. Jeff's v2 (== this series) is applied invariably as a basis (because we agree that it fixes bugs).
> 
> * The "AP lost" issue persists at S3 resume even if I raise
>   PcdCpuInitIpiDelayInMicroSeconds to 1/10th of a second. (Reproduced
>   at the 43th resume.)
> 
>   If I set the kernel's "cpu_init_udelay" parameter similarly in
>   addition, then that seems to make the symptom more frequent, not
>   less, which is completely counter-intuitive :(
> 
> * If I apply the broadcast SMI patch to OVMF (on top of Jeff's v2),
>   then I can't even boot; the AP (or some APs) regularly lose the SIPI
>   and remain stuck in "INIT received" (I think) when started from
>   MpInitChangeApLoopCallback(). Hence the boot never completes.
> 
>   Raising PcdCpuInitIpiDelayInMicroSeconds as described above fixes the
>   boot. (And, the original goal of the broadcast SMI patches is
>   achieved, "efibootmgr" is pretty fast even when bound to VCPU#1.)
>   However, an AP still gets stuck during S3 occasionally :(
> 
>   And, cpu_init_udelay=100000 for the guest kernel makes it only worse.
> 
> Ultimately, I couldn't find a way to make S3 work reliably in the Ia32 SMM build. This v2 series doesn't help, broadcast SMIs don't help, raising the INIT<->SIPI delay in the firmware doesn't help (it just mitigates the bad effects of the broadcast SMIs :/), and raising the same delay in the guest kernel only makes things worse.
> 
> Jeff, I think you should go ahead and commit this series. Paolo reviewed patch #3 and I hope Mike or Jiewen can review the first two patches. For OVMF they improve things (no more emulation failures), and I guess we can figure out the lost AP issue later.
> 
> Thanks
> Laszlo
> 

_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel