From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id C5C7381C97 for ; Thu, 10 Nov 2016 12:45:24 -0800 (PST) Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 1D4E98EA48; Thu, 10 Nov 2016 20:45:28 +0000 (UTC) Received: from lacos-laptop-7.usersys.redhat.com (ovpn-116-106.phx2.redhat.com [10.3.116.106]) by int-mx13.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id uAAKjQVu017171; Thu, 10 Nov 2016 15:45:26 -0500 To: "Yao, Jiewen" , "Fan, Jeff" References: <20161110060708.13932-1-jeff.fan@intel.com> <0528a12e-3755-99cb-861a-ac927d484ec1@redhat.com> <74D8A39837DF1E4DA445A8C0B3885C50386CE428@shsmsx102.ccr.corp.intel.com> <170e5527-057f-9bff-af43-5b131fc93335@redhat.com> Cc: Paolo Bonzini , "edk2-devel@ml01.01.org" From: Laszlo Ersek Message-ID: Date: Thu, 10 Nov 2016 21:45:25 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <170e5527-057f-9bff-af43-5b131fc93335@redhat.com> X-Scanned-By: MIMEDefang 2.68 on 10.5.11.26 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.28]); Thu, 10 Nov 2016 20:45:28 +0000 (UTC) Subject: Re: [PATCH 0/2] Put AP into safe hlt-loop code on S3 path X-BeenThere: edk2-devel@lists.01.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: EDK II Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Nov 2016 20:45:24 -0000 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit On 11/10/16 13:08, Laszlo Ersek wrote: > On 11/10/16 12:17, Yao, Jiewen wrote: >> Hi Laszlo >> >> Thanks to test for us. >> >> >> >> Are you saying Jeff’s patch introduces a new issue? >> >> Or is this a previous issue but just not fixed by Jeff’s patch? > > With your v2 series applied, Jeff's patches replace the crash / > emulation failure symptoms during S3 resume with less intrusive > symptoms, namely that some of the APs cannot be brought up by the OS, > occasionally. > > Without your v2 series applied, Jeff's patches seem to present the same > symptoms (OS cannot bring up some APs), although much less frequently. > However, I cannot say definitively whether or not this exact same issues > exists, on Ia32X64, with none of the patch sets applied. I haven't seen > it before (on Ia32X64), but maybe I just haven't tried hard enough. > > I guess I should try harder and see if the "lost AP" issue exists > without either patch set applied. With none of the patch sets applied, the "lost AP" issue doesn't exist; instead, the emulation failure is experienced. So, for Ia32X64, Jeff's not applied Jeff's applied --------------------------- ----------------- Jiewen's v2 not applied emulation failure, rare AP lost, rare Jiewen's v2 applied emulation failure, frequent AP lost, frequent Thanks Laszlo >> >> >> >> >> >> Thank you >> >> Yao Jiewen >> >> >> >> *From:*Laszlo Ersek [mailto:lersek@redhat.com] >> *Sent:* Thursday, November 10, 2016 6:41 PM >> *To:* Fan, Jeff >> *Cc:* edk2-devel@ml01.01.org; Yao, Jiewen ; Paolo >> Bonzini >> *Subject:* Re: [edk2] [PATCH 0/2] Put AP into safe hlt-loop code on S3 path >> >> >> >> On 11/10/16 07:07, Jeff Fan wrote: >>> On S3 path, we will wake up APs to restore CPU context in PiSmmCpuDxeSmm >>> driver. In case, one NMI or SMI happens, APs may exit from hlt state and >>> execute the instruction after HLT instruction. >>> >>> But APs are not running on safe code, it leads OVMF S3 boot unstable. >>> >>> https://bugzilla.tianocore.org/show_bug.cgi?id=216 >>> >>> I tested real platform with 64bit DXE. >>> >>> Jeff Fan (2): >>> UefiCpuPkg/PiSmmCpuDxeSmm: Put AP into safe hlt-loop code on S3 path >>> UefiCpuPkg/PiSmmCpuDxeSmm: Place AP to 32bit protected mode on S3 path >>> >>> UefiCpuPkg/PiSmmCpuDxeSmm/CpuS3.c | 31 ++++++++++++++ >>> UefiCpuPkg/PiSmmCpuDxeSmm/Ia32/SmmFuncsArch.c | 25 ++++++++++++ >>> UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.h | 13 ++++++ >>> UefiCpuPkg/PiSmmCpuDxeSmm/X64/SmmFuncsArch.c | 59 +++++++++++++++++++++++++++ >>> 4 files changed, 128 insertions(+) >>> >> >> I applied this on top of Jiewen's v2, for testing. >> >> This series (with my addition for patch #1) doesn't fix the boot failure in case 8. (See "case 8" in .) I don't think the series aims to do that at all, but since it modifies the Ia32/SmmFuncsArch.c file, I thought I'd give it a shot. >> >> The series (with my addition for patch #1) changed the behavior of S3 resume, in case 13. There seem to be no crashes / emulation failures now. However, in some of the tries, the resume seems to include a several second long busy loop, and after that -- although the guest OS does come back up --, I cannot access *some* of the APs from within the OS: >> >> # this works, quickly >> taskset -c 0 efibootmgr >> >> # this fails >> taskset -c 1 efibootmgr >> taskset: failed to set pid 0's affinity: Invalid argument >> >> # these work again, albeit more slowly (as expected) >> taskset -c 2 efibootmgr >> taskset -c 3 efibootmgr >> >> I've seen this symptom ("AP goes lost during S3 resume") with the Ia32 SMM build before (without Jiewen's v2 series applied). >> >> If I run the "info cpus" QEMU command, I get: >> >> * CPU #0: pc=0xffffffff8105eb26 (halted) thread_id=22745 >> CPU #1: pc=0x00000000fffffff0 thread_id=22746 >> CPU #2: pc=0xffffffff8105eb26 (halted) thread_id=22747 >> CPU #3: pc=0xffffffff8105eb26 (halted) thread_id=22748 >> >> The halted status for #0, #2 and #3 is fine; that's just Linux at work. CPU#1 is strange -- not halted, but somehow stuck in the reset vector (0xfffffff0)? >> >> The gust kernel dmesg contains the following messages: >> >>> [ 55.805153] PM: Restoring platform NVS memory >>> [ 55.805153] Enabling non-boot CPUs ... >>> [ 55.805153] x86: Booting SMP configuration: >>> [ 55.805516] smpboot: Booting Node 0 Processor 1 APIC 0x1 >>> [ 65.816049] smpboot: do_boot_cpu failed(-1) to wakeup CPU#1 <- HERE >>> [ 65.816738] Error taking CPU1 up: -5 >>> [ 65.817050] smpboot: Booting Node 0 Processor 2 APIC 0x2 >>> [ 65.817029] kvm-clock: cpu 2, msr 1:7ffd6081, secondary cpu clock >>> [ 65.817029] kvm: enabling virtualization on CPU2 >>> [ 65.832296] KVM setup async PF for cpu 2 >>> [ 65.832607] kvm-stealtime: cpu 2, msr 17fd0e100 >>> [ 65.833031] CPU2 is up >>> [ 65.833242] smpboot: Booting Node 0 Processor 3 APIC 0x3 >>> [ 65.833229] kvm-clock: cpu 3, msr 1:7ffd60c1, secondary cpu clock >>> [ 65.833229] kvm: enabling virtualization on CPU3 >>> [ 65.848594] KVM setup async PF for cpu 3 >>> [ 65.848940] kvm-stealtime: cpu 3, msr 17fd8e100 >>> [ 65.849393] CPU3 is up >>> [ 65.849722] ACPI: Waking up from system sleep state S3 >> >> Note the 10 second gap where I put the marker (and the error message itself, too). >> >> Here's an excerpt from the KVM trace: >> >>> CPU-23509 [002] 8406.908787: kvm_enter_smm: vcpu 1: entering SMM, smbase 0x30000 >>> CPU-23509 [002] 8406.908836: kvm_enter_smm: vcpu 1: leaving SMM, smbase 0x7ffb3000 >>> CPU-23510 [003] 8406.908850: kvm_enter_smm: vcpu 2: entering SMM, smbase 0x30000 >>> CPU-23510 [003] 8406.908881: kvm_enter_smm: vcpu 2: leaving SMM, smbase 0x7ffb5000 >>> CPU-23511 [001] 8406.908908: kvm_enter_smm: vcpu 3: entering SMM, smbase 0x30000 >>> CPU-23511 [001] 8406.908941: kvm_enter_smm: vcpu 3: leaving SMM, smbase 0x7ffb7000 >>> CPU-23508 [005] 8406.908951: kvm_enter_smm: vcpu 0: entering SMM, smbase 0x30000 >>> CPU-23508 [005] 8406.908989: kvm_enter_smm: vcpu 0: leaving SMM, smbase 0x7ffb1000 >>> CPU-23511 [001] 8406.920215: kvm_enter_smm: vcpu 3: entering SMM, smbase 0x7ffb7000 >>> CPU-23509 [002] 8406.920225: kvm_enter_smm: vcpu 1: entering SMM, smbase 0x7ffb3000 >>> CPU-23510 [003] 8406.920225: kvm_enter_smm: vcpu 2: entering SMM, smbase 0x7ffb5000 >>> CPU-23508 [005] 8406.920227: kvm_enter_smm: vcpu 0: entering SMM, smbase 0x7ffb1000 >>> CPU-23508 [005] 8406.920262: kvm_enter_smm: vcpu 0: leaving SMM, smbase 0x7ffb1000 >>> CPU-23511 [001] 8406.920263: kvm_enter_smm: vcpu 3: leaving SMM, smbase 0x7ffb7000 >>> CPU-23508 [005] 8407.020292: kvm_enter_smm: vcpu 0: entering SMM, smbase 0x7ffb1000 >>> CPU-23509 [006] 8407.020338: kvm_enter_smm: vcpu 1: leaving SMM, smbase 0x7ffb3000 >>> CPU-23510 [003] 8407.020338: kvm_enter_smm: vcpu 2: leaving SMM, smbase 0x7ffb5000 >>> CPU-23508 [005] 8407.020338: kvm_enter_smm: vcpu 0: leaving SMM, smbase 0x7ffb1000 >> >> It seems that VCPU#0 still leaves (and then re-enters) SMM while VCPU#1 and VCPU#2 are firmly in SMM. >> >> So this series is a clear improvement, but something else remains amiss. >> >> If I remove Jiewen's v2 series, and apply only this one, then the symptom shows up much less frequently, but it does exist: >> - With (Jiewen's v2 + this one), testing case 13, I hit the symptom on the second resume, >> - With just this set applied, I hit the symptom (= one AP disappearing from Linux after resume) only on the 24th resume. >> >> Thanks >> Laszlo >> > > _______________________________________________ > edk2-devel mailing list > edk2-devel@lists.01.org > https://lists.01.org/mailman/listinfo/edk2-devel >