From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 3D2F981DFF for ; Thu, 10 Nov 2016 03:17:51 -0800 (PST) Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga104.fm.intel.com with ESMTP; 10 Nov 2016 03:17:54 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.31,618,1473145200"; d="scan'208,217";a="1083337884" Received: from fmsmsx105.amr.corp.intel.com ([10.18.124.203]) by fmsmga002.fm.intel.com with ESMTP; 10 Nov 2016 03:17:54 -0800 Received: from fmsmsx113.amr.corp.intel.com (10.18.116.7) by FMSMSX105.amr.corp.intel.com (10.18.124.203) with Microsoft SMTP Server (TLS) id 14.3.248.2; Thu, 10 Nov 2016 03:17:54 -0800 Received: from shsmsx104.ccr.corp.intel.com (10.239.4.70) by FMSMSX113.amr.corp.intel.com (10.18.116.7) with Microsoft SMTP Server (TLS) id 14.3.248.2; Thu, 10 Nov 2016 03:17:53 -0800 Received: from shsmsx102.ccr.corp.intel.com ([169.254.2.239]) by SHSMSX104.ccr.corp.intel.com ([169.254.5.142]) with mapi id 14.03.0248.002; Thu, 10 Nov 2016 19:17:51 +0800 From: "Yao, Jiewen" To: Laszlo Ersek , "Fan, Jeff" CC: "edk2-devel@ml01.01.org" , Paolo Bonzini Thread-Topic: [edk2] [PATCH 0/2] Put AP into safe hlt-loop code on S3 path Thread-Index: AQHSOxjbYak8Ib6hL0y+4v8/Np6YMqDRgayAgACPpeA= Date: Thu, 10 Nov 2016 11:17:50 +0000 Message-ID: <74D8A39837DF1E4DA445A8C0B3885C50386CE428@shsmsx102.ccr.corp.intel.com> References: <20161110060708.13932-1-jeff.fan@intel.com> <0528a12e-3755-99cb-861a-ac927d484ec1@redhat.com> In-Reply-To: <0528a12e-3755-99cb-861a-ac927d484ec1@redhat.com> Accept-Language: zh-CN, en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.239.127.40] MIME-Version: 1.0 X-Content-Filtered-By: Mailman/MimeDel 2.1.21 Subject: Re: [PATCH 0/2] Put AP into safe hlt-loop code on S3 path X-BeenThere: edk2-devel@lists.01.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: EDK II Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Nov 2016 11:17:51 -0000 Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hi Laszlo Thanks to test for us. Are you saying Jeff's patch introduces a new issue? Or is this a previous issue but just not fixed by Jeff's patch? Thank you Yao Jiewen From: Laszlo Ersek [mailto:lersek@redhat.com] Sent: Thursday, November 10, 2016 6:41 PM To: Fan, Jeff Cc: edk2-devel@ml01.01.org; Yao, Jiewen ; Paolo Bonzi= ni Subject: Re: [edk2] [PATCH 0/2] Put AP into safe hlt-loop code on S3 path On 11/10/16 07:07, Jeff Fan wrote: > On S3 path, we will wake up APs to restore CPU context in PiSmmCpuDxeSmm > driver. In case, one NMI or SMI happens, APs may exit from hlt state and > execute the instruction after HLT instruction. > > But APs are not running on safe code, it leads OVMF S3 boot unstable. > > https://bugzilla.tianocore.org/show_bug.cgi?id=3D216 > > I tested real platform with 64bit DXE. > > Jeff Fan (2): > UefiCpuPkg/PiSmmCpuDxeSmm: Put AP into safe hlt-loop code on S3 path > UefiCpuPkg/PiSmmCpuDxeSmm: Place AP to 32bit protected mode on S3 path > > UefiCpuPkg/PiSmmCpuDxeSmm/CpuS3.c | 31 ++++++++++++++ > UefiCpuPkg/PiSmmCpuDxeSmm/Ia32/SmmFuncsArch.c | 25 ++++++++++++ > UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.h | 13 ++++++ > UefiCpuPkg/PiSmmCpuDxeSmm/X64/SmmFuncsArch.c | 59 +++++++++++++++++++++= ++++++ > 4 files changed, 128 insertions(+) > I applied this on top of Jiewen's v2, for testing. This series (with my addition for patch #1) doesn't fix the boot failure in= case 8. (See "case 8" in .) I don't think the series aims to do that at all, but= since it modifies the Ia32/SmmFuncsArch.c file, I thought I'd give it a sh= ot. The series (with my addition for patch #1) changed the behavior of S3 resum= e, in case 13. There seem to be no crashes / emulation failures now. Howeve= r, in some of the tries, the resume seems to include a several second long = busy loop, and after that -- although the guest OS does come back up --, I = cannot access *some* of the APs from within the OS: # this works, quickly taskset -c 0 efibootmgr # this fails taskset -c 1 efibootmgr taskset: failed to set pid 0's affinity: Invalid argument # these work again, albeit more slowly (as expected) taskset -c 2 efibootmgr taskset -c 3 efibootmgr I've seen this symptom ("AP goes lost during S3 resume") with the Ia32 SMM = build before (without Jiewen's v2 series applied). If I run the "info cpus" QEMU command, I get: * CPU #0: pc=3D0xffffffff8105eb26 (halted) thread_id=3D22745 CPU #1: pc=3D0x00000000fffffff0 thread_id=3D22746 CPU #2: pc=3D0xffffffff8105eb26 (halted) thread_id=3D22747 CPU #3: pc=3D0xffffffff8105eb26 (halted) thread_id=3D22748 The halted status for #0, #2 and #3 is fine; that's just Linux at work. CPU= #1 is strange -- not halted, but somehow stuck in the reset vector (0xfffff= ff0)? The gust kernel dmesg contains the following messages: > [ 55.805153] PM: Restoring platform NVS memory > [ 55.805153] Enabling non-boot CPUs ... > [ 55.805153] x86: Booting SMP configuration: > [ 55.805516] smpboot: Booting Node 0 Processor 1 APIC 0x1 > [ 65.816049] smpboot: do_boot_cpu failed(-1) to wakeup CPU#1 <- HERE > [ 65.816738] Error taking CPU1 up: -5 > [ 65.817050] smpboot: Booting Node 0 Processor 2 APIC 0x2 > [ 65.817029] kvm-clock: cpu 2, msr 1:7ffd6081, secondary cpu clock > [ 65.817029] kvm: enabling virtualization on CPU2 > [ 65.832296] KVM setup async PF for cpu 2 > [ 65.832607] kvm-stealtime: cpu 2, msr 17fd0e100 > [ 65.833031] CPU2 is up > [ 65.833242] smpboot: Booting Node 0 Processor 3 APIC 0x3 > [ 65.833229] kvm-clock: cpu 3, msr 1:7ffd60c1, secondary cpu clock > [ 65.833229] kvm: enabling virtualization on CPU3 > [ 65.848594] KVM setup async PF for cpu 3 > [ 65.848940] kvm-stealtime: cpu 3, msr 17fd8e100 > [ 65.849393] CPU3 is up > [ 65.849722] ACPI: Waking up from system sleep state S3 Note the 10 second gap where I put the marker (and the error message itself= , too). Here's an excerpt from the KVM trace: > CPU-23509 [002] 8406.908787: kvm_enter_smm: vcpu 1: entering SMM= , smbase 0x30000 > CPU-23509 [002] 8406.908836: kvm_enter_smm: vcpu 1: leaving SMM,= smbase 0x7ffb3000 > CPU-23510 [003] 8406.908850: kvm_enter_smm: vcpu 2: entering SMM= , smbase 0x30000 > CPU-23510 [003] 8406.908881: kvm_enter_smm: vcpu 2: leaving SMM,= smbase 0x7ffb5000 > CPU-23511 [001] 8406.908908: kvm_enter_smm: vcpu 3: entering SMM= , smbase 0x30000 > CPU-23511 [001] 8406.908941: kvm_enter_smm: vcpu 3: leaving SMM,= smbase 0x7ffb7000 > CPU-23508 [005] 8406.908951: kvm_enter_smm: vcpu 0: entering SMM= , smbase 0x30000 > CPU-23508 [005] 8406.908989: kvm_enter_smm: vcpu 0: leaving SMM,= smbase 0x7ffb1000 > CPU-23511 [001] 8406.920215: kvm_enter_smm: vcpu 3: entering SMM= , smbase 0x7ffb7000 > CPU-23509 [002] 8406.920225: kvm_enter_smm: vcpu 1: entering SMM= , smbase 0x7ffb3000 > CPU-23510 [003] 8406.920225: kvm_enter_smm: vcpu 2: entering SMM= , smbase 0x7ffb5000 > CPU-23508 [005] 8406.920227: kvm_enter_smm: vcpu 0: entering SMM= , smbase 0x7ffb1000 > CPU-23508 [005] 8406.920262: kvm_enter_smm: vcpu 0: leaving SMM,= smbase 0x7ffb1000 > CPU-23511 [001] 8406.920263: kvm_enter_smm: vcpu 3: leaving SMM,= smbase 0x7ffb7000 > CPU-23508 [005] 8407.020292: kvm_enter_smm: vcpu 0: entering SMM= , smbase 0x7ffb1000 > CPU-23509 [006] 8407.020338: kvm_enter_smm: vcpu 1: leaving SMM,= smbase 0x7ffb3000 > CPU-23510 [003] 8407.020338: kvm_enter_smm: vcpu 2: leaving SMM,= smbase 0x7ffb5000 > CPU-23508 [005] 8407.020338: kvm_enter_smm: vcpu 0: leaving SMM,= smbase 0x7ffb1000 It seems that VCPU#0 still leaves (and then re-enters) SMM while VCPU#1 and= VCPU#2 are firmly in SMM. So this series is a clear improvement, but something else remains amiss. If I remove Jiewen's v2 series, and apply only this one, then the symptom s= hows up much less frequently, but it does exist: - With (Jiewen's v2 + this one), testing case 13, I hit the symptom on the = second resume, - With just this set applied, I hit the symptom (=3D one AP disappearing fr= om Linux after resume) only on the 24th resume. Thanks Laszlo