Re: [PATCH 0/2] Put AP into safe hlt-loop code on S3 path

From: Laszlo Ersek <lersek@redhat.com>
To: "Yao, Jiewen" <jiewen.yao@intel.com>, "Fan, Jeff" <jeff.fan@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	"edk2-devel@ml01.01.org" <edk2-devel@ml01.01.org>
Subject: Re: [PATCH 0/2] Put AP into safe hlt-loop code on S3 path
Date: Thu, 10 Nov 2016 21:45:25 +0100	[thread overview]
Message-ID: <ea028fd4-5859-19a6-9c2a-2b2503c07ba2@redhat.com> (raw)
In-Reply-To: <170e5527-057f-9bff-af43-5b131fc93335@redhat.com>

On 11/10/16 13:08, Laszlo Ersek wrote:
> On 11/10/16 12:17, Yao, Jiewen wrote:
>> Hi Laszlo
>>
>> Thanks to test for us.
>>
>>  
>>
>> Are you saying Jeff’s patch introduces a new issue?
>>
>> Or is this a previous issue but just not fixed by Jeff’s patch?
> 
> With your v2 series applied, Jeff's patches replace the crash /
> emulation failure symptoms during S3 resume with less intrusive
> symptoms, namely that some of the APs cannot be brought up by the OS,
> occasionally.
> 
> Without your v2 series applied, Jeff's patches seem to present the same
> symptoms (OS cannot bring up some APs), although much less frequently.
> However, I cannot say definitively whether or not this exact same issues
> exists, on Ia32X64, with none of the patch sets applied. I haven't seen
> it before (on Ia32X64), but maybe I just haven't tried hard enough.
> 
> I guess I should try harder and see if the "lost AP" issue exists
> without either patch set applied.

With none of the patch sets applied, the "lost AP" issue doesn't exist;
instead, the emulation failure is experienced.

So, for Ia32X64,

                         Jeff's not applied           Jeff's applied
                         ---------------------------  -----------------
Jiewen's v2 not applied  emulation failure, rare      AP lost, rare
Jiewen's v2 applied      emulation failure, frequent  AP lost, frequent

Thanks
Laszlo

>>
>>  
>>
>>  
>>
>> Thank you
>>
>> Yao Jiewen
>>
>>  
>>
>> *From:*Laszlo Ersek [mailto:lersek@redhat.com]
>> *Sent:* Thursday, November 10, 2016 6:41 PM
>> *To:* Fan, Jeff <jeff.fan@intel.com>
>> *Cc:* edk2-devel@ml01.01.org; Yao, Jiewen <jiewen.yao@intel.com>; Paolo
>> Bonzini <pbonzini@redhat.com>
>> *Subject:* Re: [edk2] [PATCH 0/2] Put AP into safe hlt-loop code on S3 path
>>
>>  
>>
>> On 11/10/16 07:07, Jeff Fan wrote:
>>> On S3 path, we will wake up APs to restore CPU context in PiSmmCpuDxeSmm
>>> driver. In case, one NMI or SMI happens, APs may exit from hlt state and
>>> execute the instruction after HLT instruction.
>>>
>>> But APs are not running on safe code, it leads OVMF S3 boot unstable.
>>>
>>> https://bugzilla.tianocore.org/show_bug.cgi?id=216
>>>
>>> I tested real platform with 64bit DXE.
>>>
>>> Jeff Fan (2):
>>>   UefiCpuPkg/PiSmmCpuDxeSmm: Put AP into safe hlt-loop code on S3 path
>>>   UefiCpuPkg/PiSmmCpuDxeSmm: Place AP to 32bit protected mode on S3 path
>>>
>>>  UefiCpuPkg/PiSmmCpuDxeSmm/CpuS3.c             | 31 ++++++++++++++
>>>  UefiCpuPkg/PiSmmCpuDxeSmm/Ia32/SmmFuncsArch.c | 25 ++++++++++++
>>>  UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.h    | 13 ++++++
>>>  UefiCpuPkg/PiSmmCpuDxeSmm/X64/SmmFuncsArch.c  | 59 +++++++++++++++++++++++++++
>>>  4 files changed, 128 insertions(+)
>>>
>>
>> I applied this on top of Jiewen's v2, for testing.
>>
>> This series (with my addition for patch #1) doesn't fix the boot failure in case 8. (See "case 8" in <https://lists.01.org/pipermail/edk2-devel/2016-November/004316.html>.) I don't think the series aims to do that at all, but since it modifies the Ia32/SmmFuncsArch.c file, I thought I'd give it a shot.
>>
>> The series (with my addition for patch #1) changed the behavior of S3 resume, in case 13. There seem to be no crashes / emulation failures now. However, in some of the tries, the resume seems to include a several second long busy loop, and after that -- although the guest OS does come back up --, I cannot access *some* of the APs from within the OS:
>>
>> # this works, quickly
>> taskset -c 0 efibootmgr 
>>
>> # this fails
>> taskset -c 1 efibootmgr
>> taskset: failed to set pid 0's affinity: Invalid argument
>>
>> # these work again, albeit more slowly (as expected)
>> taskset -c 2 efibootmgr
>> taskset -c 3 efibootmgr
>>
>> I've seen this symptom ("AP goes lost during S3 resume") with the Ia32 SMM build before (without Jiewen's v2 series applied).
>>
>> If I run the "info cpus" QEMU command, I get:
>>
>> * CPU #0: pc=0xffffffff8105eb26 (halted) thread_id=22745
>>   CPU #1: pc=0x00000000fffffff0 thread_id=22746
>>   CPU #2: pc=0xffffffff8105eb26 (halted) thread_id=22747
>>   CPU #3: pc=0xffffffff8105eb26 (halted) thread_id=22748
>>
>> The halted status for #0, #2 and #3 is fine; that's just Linux at work. CPU#1 is strange -- not halted, but somehow stuck in the reset vector (0xfffffff0)?
>>
>> The gust kernel dmesg contains the following messages:
>>
>>> [   55.805153] PM: Restoring platform NVS memory
>>> [   55.805153] Enabling non-boot CPUs ...
>>> [   55.805153] x86: Booting SMP configuration:
>>> [   55.805516] smpboot: Booting Node 0 Processor 1 APIC 0x1
>>> [   65.816049] smpboot: do_boot_cpu failed(-1) to wakeup CPU#1 <- HERE
>>> [   65.816738] Error taking CPU1 up: -5
>>> [   65.817050] smpboot: Booting Node 0 Processor 2 APIC 0x2
>>> [   65.817029] kvm-clock: cpu 2, msr 1:7ffd6081, secondary cpu clock
>>> [   65.817029] kvm: enabling virtualization on CPU2
>>> [   65.832296] KVM setup async PF for cpu 2
>>> [   65.832607] kvm-stealtime: cpu 2, msr 17fd0e100
>>> [   65.833031] CPU2 is up
>>> [   65.833242] smpboot: Booting Node 0 Processor 3 APIC 0x3
>>> [   65.833229] kvm-clock: cpu 3, msr 1:7ffd60c1, secondary cpu clock
>>> [   65.833229] kvm: enabling virtualization on CPU3
>>> [   65.848594] KVM setup async PF for cpu 3
>>> [   65.848940] kvm-stealtime: cpu 3, msr 17fd8e100
>>> [   65.849393] CPU3 is up
>>> [   65.849722] ACPI: Waking up from system sleep state S3
>>
>> Note the 10 second gap where I put the marker (and the error message itself, too).
>>
>> Here's an excerpt from the KVM trace:
>>
>>>  CPU-23509 [002]  8406.908787: kvm_enter_smm:        vcpu 1: entering SMM, smbase 0x30000
>>>  CPU-23509 [002]  8406.908836: kvm_enter_smm:        vcpu 1: leaving SMM, smbase 0x7ffb3000
>>>  CPU-23510 [003]  8406.908850: kvm_enter_smm:        vcpu 2: entering SMM, smbase 0x30000
>>>  CPU-23510 [003]  8406.908881: kvm_enter_smm:        vcpu 2: leaving SMM, smbase 0x7ffb5000
>>>  CPU-23511 [001]  8406.908908: kvm_enter_smm:        vcpu 3: entering SMM, smbase 0x30000
>>>  CPU-23511 [001]  8406.908941: kvm_enter_smm:        vcpu 3: leaving SMM, smbase 0x7ffb7000
>>>  CPU-23508 [005]  8406.908951: kvm_enter_smm:        vcpu 0: entering SMM, smbase 0x30000
>>>  CPU-23508 [005]  8406.908989: kvm_enter_smm:        vcpu 0: leaving SMM, smbase 0x7ffb1000
>>>  CPU-23511 [001]  8406.920215: kvm_enter_smm:        vcpu 3: entering SMM, smbase 0x7ffb7000
>>>  CPU-23509 [002]  8406.920225: kvm_enter_smm:        vcpu 1: entering SMM, smbase 0x7ffb3000
>>>  CPU-23510 [003]  8406.920225: kvm_enter_smm:        vcpu 2: entering SMM, smbase 0x7ffb5000
>>>  CPU-23508 [005]  8406.920227: kvm_enter_smm:        vcpu 0: entering SMM, smbase 0x7ffb1000
>>>  CPU-23508 [005]  8406.920262: kvm_enter_smm:        vcpu 0: leaving SMM, smbase 0x7ffb1000
>>>  CPU-23511 [001]  8406.920263: kvm_enter_smm:        vcpu 3: leaving SMM, smbase 0x7ffb7000
>>>  CPU-23508 [005]  8407.020292: kvm_enter_smm:        vcpu 0: entering SMM, smbase 0x7ffb1000
>>>  CPU-23509 [006]  8407.020338: kvm_enter_smm:        vcpu 1: leaving SMM, smbase 0x7ffb3000
>>>  CPU-23510 [003]  8407.020338: kvm_enter_smm:        vcpu 2: leaving SMM, smbase 0x7ffb5000
>>>  CPU-23508 [005]  8407.020338: kvm_enter_smm:        vcpu 0: leaving SMM, smbase 0x7ffb1000
>>
>> It seems that VCPU#0 still leaves (and then re-enters) SMM while VCPU#1 and VCPU#2 are firmly in SMM.
>>
>> So this series is a clear improvement, but something else remains amiss.
>>
>> If I remove Jiewen's v2 series, and apply only this one, then the symptom shows up much less frequently, but it does exist:
>> - With (Jiewen's v2 + this one), testing case 13, I hit the symptom on the second resume,
>> - With just this set applied, I hit the symptom (= one AP disappearing from Linux after resume) only on the 24th resume.
>>
>> Thanks
>> Laszlo
>>
> 
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org
> https://lists.01.org/mailman/listinfo/edk2-devel
>