Re: [PATCH 0/2] Put AP into safe hlt-loop code on S3 path

public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed

From: Laszlo Ersek <lersek@redhat.com>
To: Jeff Fan <jeff.fan@intel.com>
Cc: edk2-devel@ml01.01.org, Jiewen Yao <jiewen.yao@intel.com>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [PATCH 0/2] Put AP into safe hlt-loop code on S3 path
Date: Thu, 10 Nov 2016 11:41:27 +0100	[thread overview]
Message-ID: <0528a12e-3755-99cb-861a-ac927d484ec1@redhat.com> (raw)
In-Reply-To: <20161110060708.13932-1-jeff.fan@intel.com>

On 11/10/16 07:07, Jeff Fan wrote:
> On S3 path, we will wake up APs to restore CPU context in PiSmmCpuDxeSmm
> driver. In case, one NMI or SMI happens, APs may exit from hlt state and
> execute the instruction after HLT instruction.
> 
> But APs are not running on safe code, it leads OVMF S3 boot unstable.
> 
> https://bugzilla.tianocore.org/show_bug.cgi?id=216
> 
> I tested real platform with 64bit DXE.
> 
> Jeff Fan (2):
>   UefiCpuPkg/PiSmmCpuDxeSmm: Put AP into safe hlt-loop code on S3 path
>   UefiCpuPkg/PiSmmCpuDxeSmm: Place AP to 32bit protected mode on S3 path
> 
>  UefiCpuPkg/PiSmmCpuDxeSmm/CpuS3.c             | 31 ++++++++++++++
>  UefiCpuPkg/PiSmmCpuDxeSmm/Ia32/SmmFuncsArch.c | 25 ++++++++++++
>  UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.h    | 13 ++++++
>  UefiCpuPkg/PiSmmCpuDxeSmm/X64/SmmFuncsArch.c  | 59 +++++++++++++++++++++++++++
>  4 files changed, 128 insertions(+)
> 

I applied this on top of Jiewen's v2, for testing.

This series (with my addition for patch #1) doesn't fix the boot failure in case 8. (See "case 8" in <https://lists.01.org/pipermail/edk2-devel/2016-November/004316.html>.) I don't think the series aims to do that at all, but since it modifies the Ia32/SmmFuncsArch.c file, I thought I'd give it a shot.

The series (with my addition for patch #1) changed the behavior of S3 resume, in case 13. There seem to be no crashes / emulation failures now. However, in some of the tries, the resume seems to include a several second long busy loop, and after that -- although the guest OS does come back up --, I cannot access *some* of the APs from within the OS:

# this works, quickly
taskset -c 0 efibootmgr 

# this fails
taskset -c 1 efibootmgr
taskset: failed to set pid 0's affinity: Invalid argument

# these work again, albeit more slowly (as expected)
taskset -c 2 efibootmgr
taskset -c 3 efibootmgr

I've seen this symptom ("AP goes lost during S3 resume") with the Ia32 SMM build before (without Jiewen's v2 series applied).

If I run the "info cpus" QEMU command, I get:

* CPU #0: pc=0xffffffff8105eb26 (halted) thread_id=22745
  CPU #1: pc=0x00000000fffffff0 thread_id=22746
  CPU #2: pc=0xffffffff8105eb26 (halted) thread_id=22747
  CPU #3: pc=0xffffffff8105eb26 (halted) thread_id=22748

The halted status for #0, #2 and #3 is fine; that's just Linux at work. CPU#1 is strange -- not halted, but somehow stuck in the reset vector (0xfffffff0)?

The gust kernel dmesg contains the following messages:

> [   55.805153] PM: Restoring platform NVS memory
> [   55.805153] Enabling non-boot CPUs ...
> [   55.805153] x86: Booting SMP configuration:
> [   55.805516] smpboot: Booting Node 0 Processor 1 APIC 0x1
> [   65.816049] smpboot: do_boot_cpu failed(-1) to wakeup CPU#1 <- HERE
> [   65.816738] Error taking CPU1 up: -5
> [   65.817050] smpboot: Booting Node 0 Processor 2 APIC 0x2
> [   65.817029] kvm-clock: cpu 2, msr 1:7ffd6081, secondary cpu clock
> [   65.817029] kvm: enabling virtualization on CPU2
> [   65.832296] KVM setup async PF for cpu 2
> [   65.832607] kvm-stealtime: cpu 2, msr 17fd0e100
> [   65.833031] CPU2 is up
> [   65.833242] smpboot: Booting Node 0 Processor 3 APIC 0x3
> [   65.833229] kvm-clock: cpu 3, msr 1:7ffd60c1, secondary cpu clock
> [   65.833229] kvm: enabling virtualization on CPU3
> [   65.848594] KVM setup async PF for cpu 3
> [   65.848940] kvm-stealtime: cpu 3, msr 17fd8e100
> [   65.849393] CPU3 is up
> [   65.849722] ACPI: Waking up from system sleep state S3

Note the 10 second gap where I put the marker (and the error message itself, too).

Here's an excerpt from the KVM trace:

>  CPU-23509 [002]  8406.908787: kvm_enter_smm:        vcpu 1: entering SMM, smbase 0x30000
>  CPU-23509 [002]  8406.908836: kvm_enter_smm:        vcpu 1: leaving SMM, smbase 0x7ffb3000
>  CPU-23510 [003]  8406.908850: kvm_enter_smm:        vcpu 2: entering SMM, smbase 0x30000
>  CPU-23510 [003]  8406.908881: kvm_enter_smm:        vcpu 2: leaving SMM, smbase 0x7ffb5000
>  CPU-23511 [001]  8406.908908: kvm_enter_smm:        vcpu 3: entering SMM, smbase 0x30000
>  CPU-23511 [001]  8406.908941: kvm_enter_smm:        vcpu 3: leaving SMM, smbase 0x7ffb7000
>  CPU-23508 [005]  8406.908951: kvm_enter_smm:        vcpu 0: entering SMM, smbase 0x30000
>  CPU-23508 [005]  8406.908989: kvm_enter_smm:        vcpu 0: leaving SMM, smbase 0x7ffb1000
>  CPU-23511 [001]  8406.920215: kvm_enter_smm:        vcpu 3: entering SMM, smbase 0x7ffb7000
>  CPU-23509 [002]  8406.920225: kvm_enter_smm:        vcpu 1: entering SMM, smbase 0x7ffb3000
>  CPU-23510 [003]  8406.920225: kvm_enter_smm:        vcpu 2: entering SMM, smbase 0x7ffb5000
>  CPU-23508 [005]  8406.920227: kvm_enter_smm:        vcpu 0: entering SMM, smbase 0x7ffb1000
>  CPU-23508 [005]  8406.920262: kvm_enter_smm:        vcpu 0: leaving SMM, smbase 0x7ffb1000
>  CPU-23511 [001]  8406.920263: kvm_enter_smm:        vcpu 3: leaving SMM, smbase 0x7ffb7000
>  CPU-23508 [005]  8407.020292: kvm_enter_smm:        vcpu 0: entering SMM, smbase 0x7ffb1000
>  CPU-23509 [006]  8407.020338: kvm_enter_smm:        vcpu 1: leaving SMM, smbase 0x7ffb3000
>  CPU-23510 [003]  8407.020338: kvm_enter_smm:        vcpu 2: leaving SMM, smbase 0x7ffb5000
>  CPU-23508 [005]  8407.020338: kvm_enter_smm:        vcpu 0: leaving SMM, smbase 0x7ffb1000

It seems that VCPU#0 still leaves (and then re-enters) SMM while VCPU#1 and VCPU#2 are firmly in SMM.

So this series is a clear improvement, but something else remains amiss.

If I remove Jiewen's v2 series, and apply only this one, then the symptom shows up much less frequently, but it does exist:
- With (Jiewen's v2 + this one), testing case 13, I hit the symptom on the second resume,
- With just this set applied, I hit the symptom (= one AP disappearing from Linux after resume) only on the 24th resume.

Thanks
Laszlo

next prev parent reply	other threads:[~2016-11-10 10:41 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-10  6:07 [PATCH 0/2] Put AP into safe hlt-loop code on S3 path Jeff Fan
2016-11-10  6:07 ` [PATCH 1/2] UefiCpuPkg/PiSmmCpuDxeSmm: " Jeff Fan
2016-11-10  8:50   ` Laszlo Ersek
2016-11-10  9:00     ` Fan, Jeff
2016-11-10  9:30       ` Laszlo Ersek
2016-11-10  6:07 ` [PATCH 2/2] UefiCpuPkg/PiSmmCpuDxeSmm: Place AP to 32bit protected mode " Jeff Fan
2016-11-10  8:56 ` [PATCH 0/2] Put AP into safe hlt-loop code " Laszlo Ersek
2016-11-10  9:59 ` Paolo Bonzini
2016-11-11  6:32   ` Fan, Jeff
2016-11-10 10:41 ` Laszlo Ersek [this message]
2016-11-10 11:17   ` Yao, Jiewen
2016-11-10 12:08     ` Laszlo Ersek
2016-11-10 20:45       ` Laszlo Ersek
2016-11-10 12:26   ` Paolo Bonzini
2016-11-10 13:33     ` Laszlo Ersek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0528a12e-3755-99cb-861a-ac927d484ec1@redhat.com \
    --to=devel@edk2.groups.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox