From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id B104C81E06 for ; Thu, 10 Nov 2016 04:08:24 -0800 (PST) Received: from int-mx14.intmail.prod.int.phx2.redhat.com (int-mx14.intmail.prod.int.phx2.redhat.com [10.5.11.27]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id F3DA2C04B95C; Thu, 10 Nov 2016 12:08:27 +0000 (UTC) Received: from lacos-laptop-7.usersys.redhat.com (ovpn-116-106.phx2.redhat.com [10.3.116.106]) by int-mx14.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id uAAC8QAS027989; Thu, 10 Nov 2016 07:08:26 -0500 To: "Yao, Jiewen" , "Fan, Jeff" References: <20161110060708.13932-1-jeff.fan@intel.com> <0528a12e-3755-99cb-861a-ac927d484ec1@redhat.com> <74D8A39837DF1E4DA445A8C0B3885C50386CE428@shsmsx102.ccr.corp.intel.com> Cc: "edk2-devel@ml01.01.org" , Paolo Bonzini From: Laszlo Ersek Message-ID: <170e5527-057f-9bff-af43-5b131fc93335@redhat.com> Date: Thu, 10 Nov 2016 13:08:25 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <74D8A39837DF1E4DA445A8C0B3885C50386CE428@shsmsx102.ccr.corp.intel.com> X-Scanned-By: MIMEDefang 2.68 on 10.5.11.27 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Thu, 10 Nov 2016 12:08:28 +0000 (UTC) Subject: Re: [PATCH 0/2] Put AP into safe hlt-loop code on S3 path X-BeenThere: edk2-devel@lists.01.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: EDK II Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Nov 2016 12:08:24 -0000 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit On 11/10/16 12:17, Yao, Jiewen wrote: > Hi Laszlo > > Thanks to test for us. > > > > Are you saying Jeff’s patch introduces a new issue? > > Or is this a previous issue but just not fixed by Jeff’s patch? With your v2 series applied, Jeff's patches replace the crash / emulation failure symptoms during S3 resume with less intrusive symptoms, namely that some of the APs cannot be brought up by the OS, occasionally. Without your v2 series applied, Jeff's patches seem to present the same symptoms (OS cannot bring up some APs), although much less frequently. However, I cannot say definitively whether or not this exact same issues exists, on Ia32X64, with none of the patch sets applied. I haven't seen it before (on Ia32X64), but maybe I just haven't tried hard enough. I guess I should try harder and see if the "lost AP" issue exists without either patch set applied. Thanks Laszlo > > > > > > Thank you > > Yao Jiewen > > > > *From:*Laszlo Ersek [mailto:lersek@redhat.com] > *Sent:* Thursday, November 10, 2016 6:41 PM > *To:* Fan, Jeff > *Cc:* edk2-devel@ml01.01.org; Yao, Jiewen ; Paolo > Bonzini > *Subject:* Re: [edk2] [PATCH 0/2] Put AP into safe hlt-loop code on S3 path > > > > On 11/10/16 07:07, Jeff Fan wrote: >> On S3 path, we will wake up APs to restore CPU context in PiSmmCpuDxeSmm >> driver. In case, one NMI or SMI happens, APs may exit from hlt state and >> execute the instruction after HLT instruction. >> >> But APs are not running on safe code, it leads OVMF S3 boot unstable. >> >> https://bugzilla.tianocore.org/show_bug.cgi?id=216 >> >> I tested real platform with 64bit DXE. >> >> Jeff Fan (2): >> UefiCpuPkg/PiSmmCpuDxeSmm: Put AP into safe hlt-loop code on S3 path >> UefiCpuPkg/PiSmmCpuDxeSmm: Place AP to 32bit protected mode on S3 path >> >> UefiCpuPkg/PiSmmCpuDxeSmm/CpuS3.c | 31 ++++++++++++++ >> UefiCpuPkg/PiSmmCpuDxeSmm/Ia32/SmmFuncsArch.c | 25 ++++++++++++ >> UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.h | 13 ++++++ >> UefiCpuPkg/PiSmmCpuDxeSmm/X64/SmmFuncsArch.c | 59 +++++++++++++++++++++++++++ >> 4 files changed, 128 insertions(+) >> > > I applied this on top of Jiewen's v2, for testing. > > This series (with my addition for patch #1) doesn't fix the boot failure in case 8. (See "case 8" in .) I don't think the series aims to do that at all, but since it modifies the Ia32/SmmFuncsArch.c file, I thought I'd give it a shot. > > The series (with my addition for patch #1) changed the behavior of S3 resume, in case 13. There seem to be no crashes / emulation failures now. However, in some of the tries, the resume seems to include a several second long busy loop, and after that -- although the guest OS does come back up --, I cannot access *some* of the APs from within the OS: > > # this works, quickly > taskset -c 0 efibootmgr > > # this fails > taskset -c 1 efibootmgr > taskset: failed to set pid 0's affinity: Invalid argument > > # these work again, albeit more slowly (as expected) > taskset -c 2 efibootmgr > taskset -c 3 efibootmgr > > I've seen this symptom ("AP goes lost during S3 resume") with the Ia32 SMM build before (without Jiewen's v2 series applied). > > If I run the "info cpus" QEMU command, I get: > > * CPU #0: pc=0xffffffff8105eb26 (halted) thread_id=22745 > CPU #1: pc=0x00000000fffffff0 thread_id=22746 > CPU #2: pc=0xffffffff8105eb26 (halted) thread_id=22747 > CPU #3: pc=0xffffffff8105eb26 (halted) thread_id=22748 > > The halted status for #0, #2 and #3 is fine; that's just Linux at work. CPU#1 is strange -- not halted, but somehow stuck in the reset vector (0xfffffff0)? > > The gust kernel dmesg contains the following messages: > >> [ 55.805153] PM: Restoring platform NVS memory >> [ 55.805153] Enabling non-boot CPUs ... >> [ 55.805153] x86: Booting SMP configuration: >> [ 55.805516] smpboot: Booting Node 0 Processor 1 APIC 0x1 >> [ 65.816049] smpboot: do_boot_cpu failed(-1) to wakeup CPU#1 <- HERE >> [ 65.816738] Error taking CPU1 up: -5 >> [ 65.817050] smpboot: Booting Node 0 Processor 2 APIC 0x2 >> [ 65.817029] kvm-clock: cpu 2, msr 1:7ffd6081, secondary cpu clock >> [ 65.817029] kvm: enabling virtualization on CPU2 >> [ 65.832296] KVM setup async PF for cpu 2 >> [ 65.832607] kvm-stealtime: cpu 2, msr 17fd0e100 >> [ 65.833031] CPU2 is up >> [ 65.833242] smpboot: Booting Node 0 Processor 3 APIC 0x3 >> [ 65.833229] kvm-clock: cpu 3, msr 1:7ffd60c1, secondary cpu clock >> [ 65.833229] kvm: enabling virtualization on CPU3 >> [ 65.848594] KVM setup async PF for cpu 3 >> [ 65.848940] kvm-stealtime: cpu 3, msr 17fd8e100 >> [ 65.849393] CPU3 is up >> [ 65.849722] ACPI: Waking up from system sleep state S3 > > Note the 10 second gap where I put the marker (and the error message itself, too). > > Here's an excerpt from the KVM trace: > >> CPU-23509 [002] 8406.908787: kvm_enter_smm: vcpu 1: entering SMM, smbase 0x30000 >> CPU-23509 [002] 8406.908836: kvm_enter_smm: vcpu 1: leaving SMM, smbase 0x7ffb3000 >> CPU-23510 [003] 8406.908850: kvm_enter_smm: vcpu 2: entering SMM, smbase 0x30000 >> CPU-23510 [003] 8406.908881: kvm_enter_smm: vcpu 2: leaving SMM, smbase 0x7ffb5000 >> CPU-23511 [001] 8406.908908: kvm_enter_smm: vcpu 3: entering SMM, smbase 0x30000 >> CPU-23511 [001] 8406.908941: kvm_enter_smm: vcpu 3: leaving SMM, smbase 0x7ffb7000 >> CPU-23508 [005] 8406.908951: kvm_enter_smm: vcpu 0: entering SMM, smbase 0x30000 >> CPU-23508 [005] 8406.908989: kvm_enter_smm: vcpu 0: leaving SMM, smbase 0x7ffb1000 >> CPU-23511 [001] 8406.920215: kvm_enter_smm: vcpu 3: entering SMM, smbase 0x7ffb7000 >> CPU-23509 [002] 8406.920225: kvm_enter_smm: vcpu 1: entering SMM, smbase 0x7ffb3000 >> CPU-23510 [003] 8406.920225: kvm_enter_smm: vcpu 2: entering SMM, smbase 0x7ffb5000 >> CPU-23508 [005] 8406.920227: kvm_enter_smm: vcpu 0: entering SMM, smbase 0x7ffb1000 >> CPU-23508 [005] 8406.920262: kvm_enter_smm: vcpu 0: leaving SMM, smbase 0x7ffb1000 >> CPU-23511 [001] 8406.920263: kvm_enter_smm: vcpu 3: leaving SMM, smbase 0x7ffb7000 >> CPU-23508 [005] 8407.020292: kvm_enter_smm: vcpu 0: entering SMM, smbase 0x7ffb1000 >> CPU-23509 [006] 8407.020338: kvm_enter_smm: vcpu 1: leaving SMM, smbase 0x7ffb3000 >> CPU-23510 [003] 8407.020338: kvm_enter_smm: vcpu 2: leaving SMM, smbase 0x7ffb5000 >> CPU-23508 [005] 8407.020338: kvm_enter_smm: vcpu 0: leaving SMM, smbase 0x7ffb1000 > > It seems that VCPU#0 still leaves (and then re-enters) SMM while VCPU#1 and VCPU#2 are firmly in SMM. > > So this series is a clear improvement, but something else remains amiss. > > If I remove Jiewen's v2 series, and apply only this one, then the symptom shows up much less frequently, but it does exist: > - With (Jiewen's v2 + this one), testing case 13, I hit the symptom on the second resume, > - With just this set applied, I hit the symptom (= one AP disappearing from Linux after resume) only on the 24th resume. > > Thanks > Laszlo >