From: Laszlo Ersek <lersek@redhat.com>
To: "Yao, Jiewen" <jiewen.yao@intel.com>, "Fan, Jeff" <jeff.fan@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
"edk2-devel@ml01.01.org" <edk2-devel@ml01.01.org>
Subject: Re: [PATCH 0/2] Put AP into safe hlt-loop code on S3 path
Date: Thu, 10 Nov 2016 21:45:25 +0100 [thread overview]
Message-ID: <ea028fd4-5859-19a6-9c2a-2b2503c07ba2@redhat.com> (raw)
In-Reply-To: <170e5527-057f-9bff-af43-5b131fc93335@redhat.com>
On 11/10/16 13:08, Laszlo Ersek wrote:
> On 11/10/16 12:17, Yao, Jiewen wrote:
>> Hi Laszlo
>>
>> Thanks to test for us.
>>
>>
>>
>> Are you saying Jeff’s patch introduces a new issue?
>>
>> Or is this a previous issue but just not fixed by Jeff’s patch?
>
> With your v2 series applied, Jeff's patches replace the crash /
> emulation failure symptoms during S3 resume with less intrusive
> symptoms, namely that some of the APs cannot be brought up by the OS,
> occasionally.
>
> Without your v2 series applied, Jeff's patches seem to present the same
> symptoms (OS cannot bring up some APs), although much less frequently.
> However, I cannot say definitively whether or not this exact same issues
> exists, on Ia32X64, with none of the patch sets applied. I haven't seen
> it before (on Ia32X64), but maybe I just haven't tried hard enough.
>
> I guess I should try harder and see if the "lost AP" issue exists
> without either patch set applied.
With none of the patch sets applied, the "lost AP" issue doesn't exist;
instead, the emulation failure is experienced.
So, for Ia32X64,
Jeff's not applied Jeff's applied
--------------------------- -----------------
Jiewen's v2 not applied emulation failure, rare AP lost, rare
Jiewen's v2 applied emulation failure, frequent AP lost, frequent
Thanks
Laszlo
>>
>>
>>
>>
>>
>> Thank you
>>
>> Yao Jiewen
>>
>>
>>
>> *From:*Laszlo Ersek [mailto:lersek@redhat.com]
>> *Sent:* Thursday, November 10, 2016 6:41 PM
>> *To:* Fan, Jeff <jeff.fan@intel.com>
>> *Cc:* edk2-devel@ml01.01.org; Yao, Jiewen <jiewen.yao@intel.com>; Paolo
>> Bonzini <pbonzini@redhat.com>
>> *Subject:* Re: [edk2] [PATCH 0/2] Put AP into safe hlt-loop code on S3 path
>>
>>
>>
>> On 11/10/16 07:07, Jeff Fan wrote:
>>> On S3 path, we will wake up APs to restore CPU context in PiSmmCpuDxeSmm
>>> driver. In case, one NMI or SMI happens, APs may exit from hlt state and
>>> execute the instruction after HLT instruction.
>>>
>>> But APs are not running on safe code, it leads OVMF S3 boot unstable.
>>>
>>> https://bugzilla.tianocore.org/show_bug.cgi?id=216
>>>
>>> I tested real platform with 64bit DXE.
>>>
>>> Jeff Fan (2):
>>> UefiCpuPkg/PiSmmCpuDxeSmm: Put AP into safe hlt-loop code on S3 path
>>> UefiCpuPkg/PiSmmCpuDxeSmm: Place AP to 32bit protected mode on S3 path
>>>
>>> UefiCpuPkg/PiSmmCpuDxeSmm/CpuS3.c | 31 ++++++++++++++
>>> UefiCpuPkg/PiSmmCpuDxeSmm/Ia32/SmmFuncsArch.c | 25 ++++++++++++
>>> UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.h | 13 ++++++
>>> UefiCpuPkg/PiSmmCpuDxeSmm/X64/SmmFuncsArch.c | 59 +++++++++++++++++++++++++++
>>> 4 files changed, 128 insertions(+)
>>>
>>
>> I applied this on top of Jiewen's v2, for testing.
>>
>> This series (with my addition for patch #1) doesn't fix the boot failure in case 8. (See "case 8" in <https://lists.01.org/pipermail/edk2-devel/2016-November/004316.html>.) I don't think the series aims to do that at all, but since it modifies the Ia32/SmmFuncsArch.c file, I thought I'd give it a shot.
>>
>> The series (with my addition for patch #1) changed the behavior of S3 resume, in case 13. There seem to be no crashes / emulation failures now. However, in some of the tries, the resume seems to include a several second long busy loop, and after that -- although the guest OS does come back up --, I cannot access *some* of the APs from within the OS:
>>
>> # this works, quickly
>> taskset -c 0 efibootmgr
>>
>> # this fails
>> taskset -c 1 efibootmgr
>> taskset: failed to set pid 0's affinity: Invalid argument
>>
>> # these work again, albeit more slowly (as expected)
>> taskset -c 2 efibootmgr
>> taskset -c 3 efibootmgr
>>
>> I've seen this symptom ("AP goes lost during S3 resume") with the Ia32 SMM build before (without Jiewen's v2 series applied).
>>
>> If I run the "info cpus" QEMU command, I get:
>>
>> * CPU #0: pc=0xffffffff8105eb26 (halted) thread_id=22745
>> CPU #1: pc=0x00000000fffffff0 thread_id=22746
>> CPU #2: pc=0xffffffff8105eb26 (halted) thread_id=22747
>> CPU #3: pc=0xffffffff8105eb26 (halted) thread_id=22748
>>
>> The halted status for #0, #2 and #3 is fine; that's just Linux at work. CPU#1 is strange -- not halted, but somehow stuck in the reset vector (0xfffffff0)?
>>
>> The gust kernel dmesg contains the following messages:
>>
>>> [ 55.805153] PM: Restoring platform NVS memory
>>> [ 55.805153] Enabling non-boot CPUs ...
>>> [ 55.805153] x86: Booting SMP configuration:
>>> [ 55.805516] smpboot: Booting Node 0 Processor 1 APIC 0x1
>>> [ 65.816049] smpboot: do_boot_cpu failed(-1) to wakeup CPU#1 <- HERE
>>> [ 65.816738] Error taking CPU1 up: -5
>>> [ 65.817050] smpboot: Booting Node 0 Processor 2 APIC 0x2
>>> [ 65.817029] kvm-clock: cpu 2, msr 1:7ffd6081, secondary cpu clock
>>> [ 65.817029] kvm: enabling virtualization on CPU2
>>> [ 65.832296] KVM setup async PF for cpu 2
>>> [ 65.832607] kvm-stealtime: cpu 2, msr 17fd0e100
>>> [ 65.833031] CPU2 is up
>>> [ 65.833242] smpboot: Booting Node 0 Processor 3 APIC 0x3
>>> [ 65.833229] kvm-clock: cpu 3, msr 1:7ffd60c1, secondary cpu clock
>>> [ 65.833229] kvm: enabling virtualization on CPU3
>>> [ 65.848594] KVM setup async PF for cpu 3
>>> [ 65.848940] kvm-stealtime: cpu 3, msr 17fd8e100
>>> [ 65.849393] CPU3 is up
>>> [ 65.849722] ACPI: Waking up from system sleep state S3
>>
>> Note the 10 second gap where I put the marker (and the error message itself, too).
>>
>> Here's an excerpt from the KVM trace:
>>
>>> CPU-23509 [002] 8406.908787: kvm_enter_smm: vcpu 1: entering SMM, smbase 0x30000
>>> CPU-23509 [002] 8406.908836: kvm_enter_smm: vcpu 1: leaving SMM, smbase 0x7ffb3000
>>> CPU-23510 [003] 8406.908850: kvm_enter_smm: vcpu 2: entering SMM, smbase 0x30000
>>> CPU-23510 [003] 8406.908881: kvm_enter_smm: vcpu 2: leaving SMM, smbase 0x7ffb5000
>>> CPU-23511 [001] 8406.908908: kvm_enter_smm: vcpu 3: entering SMM, smbase 0x30000
>>> CPU-23511 [001] 8406.908941: kvm_enter_smm: vcpu 3: leaving SMM, smbase 0x7ffb7000
>>> CPU-23508 [005] 8406.908951: kvm_enter_smm: vcpu 0: entering SMM, smbase 0x30000
>>> CPU-23508 [005] 8406.908989: kvm_enter_smm: vcpu 0: leaving SMM, smbase 0x7ffb1000
>>> CPU-23511 [001] 8406.920215: kvm_enter_smm: vcpu 3: entering SMM, smbase 0x7ffb7000
>>> CPU-23509 [002] 8406.920225: kvm_enter_smm: vcpu 1: entering SMM, smbase 0x7ffb3000
>>> CPU-23510 [003] 8406.920225: kvm_enter_smm: vcpu 2: entering SMM, smbase 0x7ffb5000
>>> CPU-23508 [005] 8406.920227: kvm_enter_smm: vcpu 0: entering SMM, smbase 0x7ffb1000
>>> CPU-23508 [005] 8406.920262: kvm_enter_smm: vcpu 0: leaving SMM, smbase 0x7ffb1000
>>> CPU-23511 [001] 8406.920263: kvm_enter_smm: vcpu 3: leaving SMM, smbase 0x7ffb7000
>>> CPU-23508 [005] 8407.020292: kvm_enter_smm: vcpu 0: entering SMM, smbase 0x7ffb1000
>>> CPU-23509 [006] 8407.020338: kvm_enter_smm: vcpu 1: leaving SMM, smbase 0x7ffb3000
>>> CPU-23510 [003] 8407.020338: kvm_enter_smm: vcpu 2: leaving SMM, smbase 0x7ffb5000
>>> CPU-23508 [005] 8407.020338: kvm_enter_smm: vcpu 0: leaving SMM, smbase 0x7ffb1000
>>
>> It seems that VCPU#0 still leaves (and then re-enters) SMM while VCPU#1 and VCPU#2 are firmly in SMM.
>>
>> So this series is a clear improvement, but something else remains amiss.
>>
>> If I remove Jiewen's v2 series, and apply only this one, then the symptom shows up much less frequently, but it does exist:
>> - With (Jiewen's v2 + this one), testing case 13, I hit the symptom on the second resume,
>> - With just this set applied, I hit the symptom (= one AP disappearing from Linux after resume) only on the 24th resume.
>>
>> Thanks
>> Laszlo
>>
>
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org
> https://lists.01.org/mailman/listinfo/edk2-devel
>
next prev parent reply other threads:[~2016-11-10 20:45 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-11-10 6:07 [PATCH 0/2] Put AP into safe hlt-loop code on S3 path Jeff Fan
2016-11-10 6:07 ` [PATCH 1/2] UefiCpuPkg/PiSmmCpuDxeSmm: " Jeff Fan
2016-11-10 8:50 ` Laszlo Ersek
2016-11-10 9:00 ` Fan, Jeff
2016-11-10 9:30 ` Laszlo Ersek
2016-11-10 6:07 ` [PATCH 2/2] UefiCpuPkg/PiSmmCpuDxeSmm: Place AP to 32bit protected mode " Jeff Fan
2016-11-10 8:56 ` [PATCH 0/2] Put AP into safe hlt-loop code " Laszlo Ersek
2016-11-10 9:59 ` Paolo Bonzini
2016-11-11 6:32 ` Fan, Jeff
2016-11-10 10:41 ` Laszlo Ersek
2016-11-10 11:17 ` Yao, Jiewen
2016-11-10 12:08 ` Laszlo Ersek
2016-11-10 20:45 ` Laszlo Ersek [this message]
2016-11-10 12:26 ` Paolo Bonzini
2016-11-10 13:33 ` Laszlo Ersek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-list from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ea028fd4-5859-19a6-9c2a-2b2503c07ba2@redhat.com \
--to=devel@edk2.groups.io \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox