From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <lersek@redhat.com>
Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by ml01.01.org (Postfix) with ESMTPS id DB31D81E30
 for <edk2-devel@ml01.01.org>; Mon, 14 Nov 2016 17:03:23 -0800 (PST)
Received: from int-mx14.intmail.prod.int.phx2.redhat.com
 (int-mx14.intmail.prod.int.phx2.redhat.com [10.5.11.27])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.redhat.com (Postfix) with ESMTPS id 35A9083F45;
 Tue, 15 Nov 2016 01:03:28 +0000 (UTC)
Received: from lacos-laptop-7.usersys.redhat.com (ovpn-116-50.phx2.redhat.com
 [10.3.116.50])
 by int-mx14.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id
 uAF13Qec011175; Mon, 14 Nov 2016 20:03:26 -0500
To: "Fan, Jeff" <jeff.fan@intel.com>, Paolo Bonzini <pbonzini@redhat.com>
References: <20161111054545.19616-1-jeff.fan@intel.com>
 <b55d1e04-296f-f855-dba9-4905b7bf5e88@redhat.com>
 <542CF652F8836A4AB8DBFAAD40ED192A4A2DB4F5@shsmsx102.ccr.corp.intel.com>
 <00b6828b-78c5-af4f-ab98-de4460b1b8ec@redhat.com>
 <4dc14e5c-9b43-4338-c7a5-9750e8a9547a@redhat.com>
 <fe6d7e69-96a2-22dc-4266-58676aa3baaf@redhat.com>
 <a4d62280-41be-c53e-11a9-75539439b716@redhat.com>
 <3e61ffc4-9eaf-0015-11a7-e2d698624acb@redhat.com>
 <648314a4-6c17-7c88-7e47-98c4de95fe2d@redhat.com>
 <e49f4039-e54e-c2db-12b5-8a8e3f284cde@redhat.com>
 <7bca5070-be51-aa99-bdb6-9fcc03086430@redhat.com>
 <d5813a75-c54a-1cd9-e3e2-1db158e817ff@redhat.com>
 <542CF652F8836A4AB8DBFAAD40ED192A4A2DCDC0@shsmsx102.ccr.corp.intel.com>
Cc: "edk2-devel@ml01.01.org" <edk2-devel@ml01.01.org>,
 "Yao, Jiewen" <jiewen.yao@intel.com>
From: Laszlo Ersek <lersek@redhat.com>
Message-ID: <deb3e226-a572-b307-b8b5-bffdd2ac6a91@redhat.com>
Date: Tue, 15 Nov 2016 02:03:25 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Thunderbird/45.4.0
MIME-Version: 1.0
In-Reply-To: <542CF652F8836A4AB8DBFAAD40ED192A4A2DCDC0@shsmsx102.ccr.corp.intel.com>
X-Scanned-By: MIMEDefang 2.68 on 10.5.11.27
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16
 (mx1.redhat.com [10.5.110.27]); Tue, 15 Nov 2016 01:03:28 +0000 (UTC)
Subject: Re: [PATCH v2 0/3] Put AP into safe hlt-loop code on S3 path
X-BeenThere: edk2-devel@lists.01.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: EDK II Development  <edk2-devel.lists.01.org>
List-Unsubscribe: <https://lists.01.org/mailman/options/edk2-devel>,
 <mailto:edk2-devel-request@lists.01.org?subject=unsubscribe>
List-Archive: <http://lists.01.org/pipermail/edk2-devel/>
List-Post: <mailto:edk2-devel@lists.01.org>
List-Help: <mailto:edk2-devel-request@lists.01.org?subject=help>
List-Subscribe: <https://lists.01.org/mailman/listinfo/edk2-devel>,
 <mailto:edk2-devel-request@lists.01.org?subject=subscribe>
X-List-Received-Date: Tue, 15 Nov 2016 01:03:24 -0000
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit

On 11/15/16 01:47, Fan, Jeff wrote:
> Laszlo,
> 
> I file https://bugzilla.tianocore.org/show_bug.cgi?id=230 to handle AP lost issue.
> 
> I will push the v2 serial of patches, Paolo has already added r-b signature on the first 2 patches in v1 serials.

Please add my

Tested-by: Laszlo Ersek <lersek@redhat.com>

to the first and the third patch.

If you can wait a bit, I might be able to extend my T-b to the second
patch as well.

(More details to follow soon.)

Thanks!
Laszlo

> -----Original Message-----
> From: Laszlo Ersek [mailto:lersek@redhat.com] 
> Sent: Tuesday, November 15, 2016 7:56 AM
> To: Paolo Bonzini; Fan, Jeff
> Cc: edk2-devel@ml01.01.org; Yao, Jiewen
> Subject: Re: [edk2] [PATCH v2 0/3] Put AP into safe hlt-loop code on S3 path
> 
> On 11/14/16 19:13, Paolo Bonzini wrote:
>>
>>
>> On 14/11/2016 19:07, Laszlo Ersek wrote:
>>> On 11/14/16 13:00, Paolo Bonzini wrote:
>>>>
>>>>
>>>> On 14/11/2016 12:27, Laszlo Ersek wrote:
>>>>> Well...
>>>>>
>>>>> http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.ht
>>>>> ml 
>>>>> http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00125.ht
>>>>> ml 
>>>>> http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00563.ht
>>>>> ml
>>>>>
>>>>> Are you suggesting that I resurrect this patch? That would be my 
>>>>> pleasure. Please say yes.
>>>>
>>>> It's hard to say no when someone has written the code already. :)
>>>
>>> Thanks. I refreshed both patches (OVMF and QEMU -- no code changes 
>>> just more precise commit messages). Unfortunately, quite a few things 
>>> seem broken, although these patches worked a year ago.
>>>
>>> My QEMU base commit is current master 83c83f9a5266. My host kernel is 
>>> 3.10.0-514.el7.x86_64.
>>>
>>> *** So, when I test these two patches, based on edk2 master (no 
>>> on-list patches), Ia32 target, my boot hangs (spins) with the log ending in:
>>>
>>>> SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0
>>>
>>> That is, MpInitChangeApLoopCallback() is entered, but it never finishes.
>>> "info cpus" prints:
>>>
>>> * CPU #0: pc=0x000000007f1f7763 thread_id=17395
>>>   CPU #1: pc=0x000000007f2ce01e (halted) thread_id=17396
>>>   CPU #2: pc=0x000000007f2ce01e (halted) thread_id=17397
>>>   CPU #3: pc=0x00000000fffffff0 thread_id=17398
>>>
>>> and I've also seen a case where all the APs were stuck at the reset 
>>> vector (0x00000000fffffff0), *not* halted, like VCPU#3 above. They 
>>> don't spin, they're just stuck. The spinning comes from CPU#0, 
>>> apparently in MpInitChangeApLoopCallback.
>>>
>>> *** I flipped the AP sync mode to traditional (considering the 
>>> relaxed mode shouldn't be required with the broadcast SMIs). This 
>>> time the log ends with:
>>>
>>>> SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0
>>>> MpInitChangeApLoopCallback() done!
>>>
>>> but then QEMU abort()s:
>>>
>>>> kvm_io_ioeventfd_add: error adding ioeventfd: File exists
>>>> 2016-11-14 17:00:41.405+0000: shutting down, reason=crashed
>>>
>>> I see some ioeventfd stuff in the recent QEMU history; do you think 
>>> it's related?
>>
>> Yes, just try 2.7 for now or disable vhost.
> 
> (1) I think I have some new results.
> 
> I used the gdbserver built into QEMU, and I (sort of) single-stepped the
> MpInitChangeApLoopCallback() function in "UefiCpuPkg/Library/MpInitLib/DxeMpLib.c", and whatever else it called.
> For this I used the Ia32 1x2x2 configuration. Also, the broadcast SMI patches were applied to both QEMU and OVMF.
> 
> I set a breakpoint on RelocateApLoop(), so that when an AP would start up, gdb would switch to it automatically (and it did in fact).
> 
> I liberally used "info cpus" from a separate terminal, and also "info thread" from within gdb (which gives an incredibly cool insight into the VCPU states!)
> 
> (2) Here's a few interesting results (strictly empirically):
> 
> * when I was stepping through the SendInitSipiSipiAllExcludingSelf()
>   function *real slow*, manually, suddenly things worked
> 
> * I noticed that, while stepping through the INIT-SIPI-SIPI sequence in
>   the above-mentioned BSP function, the APs switched from "halted" to
>   "running" after the *first* SIPI. Not the second SIPI, the first one.
> 
> Then I went to the KVM code, and looked at arch/x86/kvm/lapic.c. My explanation is terribly inexact, but:
> 
> * __apic_accept_irq() translates the LAPIC writes (APIC_DM_INIT and
>   APIC_DM_STARTUP) into pending events (KVM_APIC_INIT and KVM_APIC_SIPI)
> 
> * this function also calls kvm_vcpu_kick()
> 
> * the kvm_apic_accept_events() function processes the pending events.
>   The KVM_APIC_INIT event, if pending, causes kvm_vcpu_reset() to be
>   called, and the AP to be moved to KVM_MP_STATE_INIT_RECEIVED state.
> 
>   And, in that state, a pending KVM_APIC_SIPI event moves the AP to
>   KVM_MP_STATE_RUNNABLE. This confirms my surprising empirical result
>   that only one SIPI is awaited after the INIT before launching the APs.
> 
> * the kvm_vcpu_kick() function in "virt/kvm/kvm_main.c" boils down to a
>   smp_send_reschedule() call, if the kicker and the "kickee" are
>   different processors (for example, when the BSP kicks the AP)
> 
>   This seemed important because it suggested that host kernel
>   scheduling jitter could delay the delivery (reception) of the INIT on
>   the AP after the BSP sent it. If the BSP sent the first (and second)
>   SIPI really fast after the INIT, then those SIPIs could be missed,
>   and the AP would remain stuck in KVM_MP_STATE_INIT_RECEIVED state.
> 
>   And this state (remember kvm_vcpu_reset() from above!) is consistent
>   with the RIP being stuck at the reset vector address, with the AP
>   neither running nor being halted.
> 
> (3) So, I looked at the Intel SDM. It says in the sysprog book (yes, yes, I should be using the humongous fused edition),
> 
> 8.4.4.1 Typical BSP Initialization Sequence
> 
>   15. Broadcasts an INIT-SIPI-SIPI IPI sequence to the APs to wake them
>       up and initialize them:
> 
>       MOV ESI, ICR_LOW; Load address of ICR low dword into ESI.
>       MOV EAX, 000C4500H; Load ICR encoding for broadcast INIT IPI
>                         ; to all APs into EAX.
>       MOV [ESI], EAX; Broadcast INIT IPI to all APs
> 
>       ; 10-millisecond delay loop.
> 
>       MOV EAX, 000C46XXH; Load ICR encoding for broadcast SIPI IP
>                         ; to all APs into EAX, where xx is the vector
>                         ; computed in step 10.
>       MOV [ESI], EAX; Broadcast SIPI IPI to all APs
> 
>       ; 200-microsecond delay loop
> 
>       MOV [ESI], EAX; Broadcast second SIPI IPI to all APs
> 
>      ; 200-microsecond delay loop
> 
> In the SendInitSipiSipiAllExcludingSelf(), we have a
> 
>   MicroSecondDelay (PcdGet32(PcdCpuInitIpiDelayInMicroSeconds));
> 
> between the INIT and the first SIPI, and a
> 
>   MicroSecondDelay (200);
> 
> between both SIPIs. PcdCpuInitIpiDelayInMicroSeconds is set (in
> UefiCpuPkg.dec) to 10000, which matches the above 10ms recommendation.
> 
> I think this could be too low for KVM. (It's telling that the value is a PCD in the first place.)
> 
> (4) The circumstances where the "AP is lost" -- due to the missed
> SIPI(s) I believe -- vary, interestingly. These results are from my testing with the Ia32 SMM build of OVMF, CPU topology 1x2x2, no XD support. Jeff's v2 (== this series) is applied invariably as a basis (because we agree that it fixes bugs).
> 
> * The "AP lost" issue persists at S3 resume even if I raise
>   PcdCpuInitIpiDelayInMicroSeconds to 1/10th of a second. (Reproduced
>   at the 43th resume.)
> 
>   If I set the kernel's "cpu_init_udelay" parameter similarly in
>   addition, then that seems to make the symptom more frequent, not
>   less, which is completely counter-intuitive :(
> 
> * If I apply the broadcast SMI patch to OVMF (on top of Jeff's v2),
>   then I can't even boot; the AP (or some APs) regularly lose the SIPI
>   and remain stuck in "INIT received" (I think) when started from
>   MpInitChangeApLoopCallback(). Hence the boot never completes.
> 
>   Raising PcdCpuInitIpiDelayInMicroSeconds as described above fixes the
>   boot. (And, the original goal of the broadcast SMI patches is
>   achieved, "efibootmgr" is pretty fast even when bound to VCPU#1.)
>   However, an AP still gets stuck during S3 occasionally :(
> 
>   And, cpu_init_udelay=100000 for the guest kernel makes it only worse.
> 
> Ultimately, I couldn't find a way to make S3 work reliably in the Ia32 SMM build. This v2 series doesn't help, broadcast SMIs don't help, raising the INIT<->SIPI delay in the firmware doesn't help (it just mitigates the bad effects of the broadcast SMIs :/), and raising the same delay in the guest kernel only makes things worse.
> 
> Jeff, I think you should go ahead and commit this series. Paolo reviewed patch #3 and I hope Mike or Jiewen can review the first two patches. For OVMF they improve things (no more emulation failures), and I guess we can figure out the lost AP issue later.
> 
> Thanks
> Laszlo
>