Re: [PATCH 0/2] Put AP into safe hlt-loop code on S3 path

public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed

From: Laszlo Ersek <lersek@redhat.com>
To: "Yao, Jiewen" <jiewen.yao@intel.com>, "Fan, Jeff" <jeff.fan@intel.com>
Cc: "edk2-devel@ml01.01.org" <edk2-devel@ml01.01.org>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [PATCH 0/2] Put AP into safe hlt-loop code on S3 path
Date: Thu, 10 Nov 2016 13:08:25 +0100	[thread overview]
Message-ID: <170e5527-057f-9bff-af43-5b131fc93335@redhat.com> (raw)
In-Reply-To: <74D8A39837DF1E4DA445A8C0B3885C50386CE428@shsmsx102.ccr.corp.intel.com>

On 11/10/16 12:17, Yao, Jiewen wrote:
> Hi Laszlo
> 
> Thanks to test for us.
> 
>  
> 
> Are you saying Jeff’s patch introduces a new issue?
> 
> Or is this a previous issue but just not fixed by Jeff’s patch?

With your v2 series applied, Jeff's patches replace the crash /
emulation failure symptoms during S3 resume with less intrusive
symptoms, namely that some of the APs cannot be brought up by the OS,
occasionally.

Without your v2 series applied, Jeff's patches seem to present the same
symptoms (OS cannot bring up some APs), although much less frequently.
However, I cannot say definitively whether or not this exact same issues
exists, on Ia32X64, with none of the patch sets applied. I haven't seen
it before (on Ia32X64), but maybe I just haven't tried hard enough.

I guess I should try harder and see if the "lost AP" issue exists
without either patch set applied.

Thanks
Laszlo


> 
>  
> 
>  
> 
> Thank you
> 
> Yao Jiewen
> 
>  
> 
> *From:*Laszlo Ersek [mailto:lersek@redhat.com]
> *Sent:* Thursday, November 10, 2016 6:41 PM
> *To:* Fan, Jeff <jeff.fan@intel.com>
> *Cc:* edk2-devel@ml01.01.org; Yao, Jiewen <jiewen.yao@intel.com>; Paolo
> Bonzini <pbonzini@redhat.com>
> *Subject:* Re: [edk2] [PATCH 0/2] Put AP into safe hlt-loop code on S3 path
> 
>  
> 
> On 11/10/16 07:07, Jeff Fan wrote:
>> On S3 path, we will wake up APs to restore CPU context in PiSmmCpuDxeSmm
>> driver. In case, one NMI or SMI happens, APs may exit from hlt state and
>> execute the instruction after HLT instruction.
>> 
>> But APs are not running on safe code, it leads OVMF S3 boot unstable.
>> 
>> https://bugzilla.tianocore.org/show_bug.cgi?id=216
>> 
>> I tested real platform with 64bit DXE.
>> 
>> Jeff Fan (2):
>>   UefiCpuPkg/PiSmmCpuDxeSmm: Put AP into safe hlt-loop code on S3 path
>>   UefiCpuPkg/PiSmmCpuDxeSmm: Place AP to 32bit protected mode on S3 path
>> 
>>  UefiCpuPkg/PiSmmCpuDxeSmm/CpuS3.c             | 31 ++++++++++++++
>>  UefiCpuPkg/PiSmmCpuDxeSmm/Ia32/SmmFuncsArch.c | 25 ++++++++++++
>>  UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.h    | 13 ++++++
>>  UefiCpuPkg/PiSmmCpuDxeSmm/X64/SmmFuncsArch.c  | 59 +++++++++++++++++++++++++++
>>  4 files changed, 128 insertions(+)
>> 
> 
> I applied this on top of Jiewen's v2, for testing.
> 
> This series (with my addition for patch #1) doesn't fix the boot failure in case 8. (See "case 8" in <https://lists.01.org/pipermail/edk2-devel/2016-November/004316.html>.) I don't think the series aims to do that at all, but since it modifies the Ia32/SmmFuncsArch.c file, I thought I'd give it a shot.
> 
> The series (with my addition for patch #1) changed the behavior of S3 resume, in case 13. There seem to be no crashes / emulation failures now. However, in some of the tries, the resume seems to include a several second long busy loop, and after that -- although the guest OS does come back up --, I cannot access *some* of the APs from within the OS:
> 
> # this works, quickly
> taskset -c 0 efibootmgr 
> 
> # this fails
> taskset -c 1 efibootmgr
> taskset: failed to set pid 0's affinity: Invalid argument
> 
> # these work again, albeit more slowly (as expected)
> taskset -c 2 efibootmgr
> taskset -c 3 efibootmgr
> 
> I've seen this symptom ("AP goes lost during S3 resume") with the Ia32 SMM build before (without Jiewen's v2 series applied).
> 
> If I run the "info cpus" QEMU command, I get:
> 
> * CPU #0: pc=0xffffffff8105eb26 (halted) thread_id=22745
>   CPU #1: pc=0x00000000fffffff0 thread_id=22746
>   CPU #2: pc=0xffffffff8105eb26 (halted) thread_id=22747
>   CPU #3: pc=0xffffffff8105eb26 (halted) thread_id=22748
> 
> The halted status for #0, #2 and #3 is fine; that's just Linux at work. CPU#1 is strange -- not halted, but somehow stuck in the reset vector (0xfffffff0)?
> 
> The gust kernel dmesg contains the following messages:
> 
>> [   55.805153] PM: Restoring platform NVS memory
>> [   55.805153] Enabling non-boot CPUs ...
>> [   55.805153] x86: Booting SMP configuration:
>> [   55.805516] smpboot: Booting Node 0 Processor 1 APIC 0x1
>> [   65.816049] smpboot: do_boot_cpu failed(-1) to wakeup CPU#1 <- HERE
>> [   65.816738] Error taking CPU1 up: -5
>> [   65.817050] smpboot: Booting Node 0 Processor 2 APIC 0x2
>> [   65.817029] kvm-clock: cpu 2, msr 1:7ffd6081, secondary cpu clock
>> [   65.817029] kvm: enabling virtualization on CPU2
>> [   65.832296] KVM setup async PF for cpu 2
>> [   65.832607] kvm-stealtime: cpu 2, msr 17fd0e100
>> [   65.833031] CPU2 is up
>> [   65.833242] smpboot: Booting Node 0 Processor 3 APIC 0x3
>> [   65.833229] kvm-clock: cpu 3, msr 1:7ffd60c1, secondary cpu clock
>> [   65.833229] kvm: enabling virtualization on CPU3
>> [   65.848594] KVM setup async PF for cpu 3
>> [   65.848940] kvm-stealtime: cpu 3, msr 17fd8e100
>> [   65.849393] CPU3 is up
>> [   65.849722] ACPI: Waking up from system sleep state S3
> 
> Note the 10 second gap where I put the marker (and the error message itself, too).
> 
> Here's an excerpt from the KVM trace:
> 
>>  CPU-23509 [002]  8406.908787: kvm_enter_smm:        vcpu 1: entering SMM, smbase 0x30000
>>  CPU-23509 [002]  8406.908836: kvm_enter_smm:        vcpu 1: leaving SMM, smbase 0x7ffb3000
>>  CPU-23510 [003]  8406.908850: kvm_enter_smm:        vcpu 2: entering SMM, smbase 0x30000
>>  CPU-23510 [003]  8406.908881: kvm_enter_smm:        vcpu 2: leaving SMM, smbase 0x7ffb5000
>>  CPU-23511 [001]  8406.908908: kvm_enter_smm:        vcpu 3: entering SMM, smbase 0x30000
>>  CPU-23511 [001]  8406.908941: kvm_enter_smm:        vcpu 3: leaving SMM, smbase 0x7ffb7000
>>  CPU-23508 [005]  8406.908951: kvm_enter_smm:        vcpu 0: entering SMM, smbase 0x30000
>>  CPU-23508 [005]  8406.908989: kvm_enter_smm:        vcpu 0: leaving SMM, smbase 0x7ffb1000
>>  CPU-23511 [001]  8406.920215: kvm_enter_smm:        vcpu 3: entering SMM, smbase 0x7ffb7000
>>  CPU-23509 [002]  8406.920225: kvm_enter_smm:        vcpu 1: entering SMM, smbase 0x7ffb3000
>>  CPU-23510 [003]  8406.920225: kvm_enter_smm:        vcpu 2: entering SMM, smbase 0x7ffb5000
>>  CPU-23508 [005]  8406.920227: kvm_enter_smm:        vcpu 0: entering SMM, smbase 0x7ffb1000
>>  CPU-23508 [005]  8406.920262: kvm_enter_smm:        vcpu 0: leaving SMM, smbase 0x7ffb1000
>>  CPU-23511 [001]  8406.920263: kvm_enter_smm:        vcpu 3: leaving SMM, smbase 0x7ffb7000
>>  CPU-23508 [005]  8407.020292: kvm_enter_smm:        vcpu 0: entering SMM, smbase 0x7ffb1000
>>  CPU-23509 [006]  8407.020338: kvm_enter_smm:        vcpu 1: leaving SMM, smbase 0x7ffb3000
>>  CPU-23510 [003]  8407.020338: kvm_enter_smm:        vcpu 2: leaving SMM, smbase 0x7ffb5000
>>  CPU-23508 [005]  8407.020338: kvm_enter_smm:        vcpu 0: leaving SMM, smbase 0x7ffb1000
> 
> It seems that VCPU#0 still leaves (and then re-enters) SMM while VCPU#1 and VCPU#2 are firmly in SMM.
> 
> So this series is a clear improvement, but something else remains amiss.
> 
> If I remove Jiewen's v2 series, and apply only this one, then the symptom shows up much less frequently, but it does exist:
> - With (Jiewen's v2 + this one), testing case 13, I hit the symptom on the second resume,
> - With just this set applied, I hit the symptom (= one AP disappearing from Linux after resume) only on the 24th resume.
> 
> Thanks
> Laszlo
>

next prev parent reply	other threads:[~2016-11-10 12:08 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-10  6:07 [PATCH 0/2] Put AP into safe hlt-loop code on S3 path Jeff Fan
2016-11-10  6:07 ` [PATCH 1/2] UefiCpuPkg/PiSmmCpuDxeSmm: " Jeff Fan
2016-11-10  8:50   ` Laszlo Ersek
2016-11-10  9:00     ` Fan, Jeff
2016-11-10  9:30       ` Laszlo Ersek
2016-11-10  6:07 ` [PATCH 2/2] UefiCpuPkg/PiSmmCpuDxeSmm: Place AP to 32bit protected mode " Jeff Fan
2016-11-10  8:56 ` [PATCH 0/2] Put AP into safe hlt-loop code " Laszlo Ersek
2016-11-10  9:59 ` Paolo Bonzini
2016-11-11  6:32   ` Fan, Jeff
2016-11-10 10:41 ` Laszlo Ersek
2016-11-10 11:17   ` Yao, Jiewen
2016-11-10 12:08     ` Laszlo Ersek [this message]
2016-11-10 20:45       ` Laszlo Ersek
2016-11-10 12:26   ` Paolo Bonzini
2016-11-10 13:33     ` Laszlo Ersek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=170e5527-057f-9bff-af43-5b131fc93335@redhat.com \
    --to=devel@edk2.groups.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox