From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jiewen.yao@intel.com>
Received: from mga05.intel.com (mga05.intel.com [192.55.52.43])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by ml01.01.org (Postfix) with ESMTPS id 3EB0781D70
 for <edk2-devel@ml01.01.org>; Thu, 10 Nov 2016 02:41:50 -0800 (PST)
Received: from orsmga001.jf.intel.com ([10.7.209.18])
 by fmsmga105.fm.intel.com with ESMTP; 10 Nov 2016 02:41:53 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.31,618,1473145200"; 
 d="scan'208,217";a="1057795432"
Received: from fmsmsx107.amr.corp.intel.com ([10.18.124.205])
 by orsmga001.jf.intel.com with ESMTP; 10 Nov 2016 02:41:52 -0800
Received: from fmsmsx112.amr.corp.intel.com (10.18.116.6) by
 fmsmsx107.amr.corp.intel.com (10.18.124.205) with Microsoft SMTP Server (TLS)
 id 14.3.248.2; Thu, 10 Nov 2016 02:41:52 -0800
Received: from shsmsx152.ccr.corp.intel.com (10.239.6.52) by
 FMSMSX112.amr.corp.intel.com (10.18.116.6) with Microsoft SMTP Server (TLS)
 id 14.3.248.2; Thu, 10 Nov 2016 02:41:50 -0800
Received: from shsmsx102.ccr.corp.intel.com ([169.254.2.239]) by
 SHSMSX152.ccr.corp.intel.com ([169.254.6.138]) with mapi id 14.03.0248.002;
 Thu, 10 Nov 2016 18:41:48 +0800
From: "Yao, Jiewen" <jiewen.yao@intel.com>
To: Laszlo Ersek <lersek@redhat.com>
CC: "Tian, Feng" <feng.tian@intel.com>, "edk2-devel@ml01.01.org"
 <edk2-devel@ml01.01.org>, "Kinney, Michael D" <michael.d.kinney@intel.com>,
 Paolo Bonzini <pbonzini@redhat.com>, "Fan, Jeff" <jeff.fan@intel.com>,
 "Zeng, Star" <star.zeng@intel.com>
Thread-Topic: [edk2] [PATCH V2 0/6] Enable SMM page level protection.
Thread-Index: AQHSNn5Npc+l51c2NUCmigdj/tAIBqDNyfkAgAJjohCAAHPkAIABR3PQ
Date: Thu, 10 Nov 2016 10:41:48 +0000
Message-ID: <74D8A39837DF1E4DA445A8C0B3885C50386CE375@shsmsx102.ccr.corp.intel.com>
References: <1478251854-14660-1-git-send-email-jiewen.yao@intel.com>
 <08406bf5-4377-63a1-8dd9-34479c015d4b@redhat.com>
 <74D8A39837DF1E4DA445A8C0B3885C50386C0CB8@shsmsx102.ccr.corp.intel.com>
 <e230a3ea-dfa9-4537-149d-07ceb465bc23@redhat.com>
In-Reply-To: <e230a3ea-dfa9-4537-149d-07ceb465bc23@redhat.com>
Accept-Language: zh-CN, en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.239.127.40]
MIME-Version: 1.0
X-Content-Filtered-By: Mailman/MimeDel 2.1.21
Subject: Re: [PATCH V2 0/6] Enable SMM page level protection.
X-BeenThere: edk2-devel@lists.01.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: EDK II Development  <edk2-devel.lists.01.org>
List-Unsubscribe: <https://lists.01.org/mailman/options/edk2-devel>,
 <mailto:edk2-devel-request@lists.01.org?subject=unsubscribe>
List-Archive: <http://lists.01.org/pipermail/edk2-devel/>
List-Post: <mailto:edk2-devel@lists.01.org>
List-Help: <mailto:edk2-devel-request@lists.01.org?subject=help>
List-Subscribe: <https://lists.01.org/mailman/listinfo/edk2-devel>,
 <mailto:edk2-devel-request@lists.01.org?subject=subscribe>
X-List-Received-Date: Thu, 10 Nov 2016 10:41:50 -0000
Content-Language: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Thanks to report case 3 issue on bugzillar.

Let's focus on Case 8.
It seems another random failure issue.

I did more test.

1)      I tested some other our internal real platform for UEFI32 OS boot. =
I cannot reproduce the ASSERT.

2)      I wrote a small test app to call ExitBootServices and send SMI. I r=
un it on current my windows QEMU but I still cannot reproduce the ASSERT.

It seem your env is the only way to repo the issue. I am trying to follow y=
our step by step to install OS on QEMU/KVM. I haven't finish all thing yet,=
 because of some network proxy issue. :(

Your information and analysis is great. It gives us some clue.

I guess the same thing as you and checked: InitializeSpinLock (mSmmMpSyncDa=
ta->CpuData[CpuIndex].Busy);

This address is initialized in InitializeMpSyncData(), with gSmmCpuPrivate-=
>SmmCoreEntryContext.NumberOfCpus which is got from MpServices->GetNumberOf=
Processors().
I do not know why this address is zero.

I also did not quite understand below log.

* CPU #0: pc=3D0x00000000c1844555 (halted) thread_id=3D7835
  CPU #1: pc=3D0x00000000c1844555 (halted) thread_id=3D7836
  CPU #2: pc=3D0x00000000c1844555 (halted) thread_id=3D7837
  CPU #3: pc=3D0x000000007ffd17ca thread_id=3D7838

As I recall, writing to B2 only cause BSP get SMI on OVMF. AP does not ente=
r SMM mode.
So why #3 can enter SMM mode? Is that expected behavior? Or unexpected beha=
vior?
If this is expected, how this happened? Does OS send SendSmiIpiAllExcluding=
Self, after ExitBootServices()?

I will see if I can finish QEMU/KVM installation tomorrow.

If you have some idea on why and how #3 enter SMM, please let us know.


Thank you
Yao Jiewen


From: Laszlo Ersek [mailto:lersek@redhat.com]
Sent: Thursday, November 10, 2016 4:46 AM
To: Yao, Jiewen <jiewen.yao@intel.com>
Cc: Tian, Feng <feng.tian@intel.com>; edk2-devel@ml01.01.org; Kinney, Micha=
el D <michael.d.kinney@intel.com>; Paolo Bonzini <pbonzini@redhat.com>; Fan=
, Jeff <jeff.fan@intel.com>; Zeng, Star <star.zeng@intel.com>
Subject: Re: [edk2] [PATCH V2 0/6] Enable SMM page level protection.

On 11/09/16 07:25, Yao, Jiewen wrote:
> Hi Laszlo
> I will fix DEBUG message issue in V3 patch.
>
> Below is rest issues:
>
>
> l  Case 13: S3 fails randomly.
> A good news: I worked with Jeff Fan to root-cause the S3 resume issue. He=
re is detail.
>
>
> 1)      We believe the dead CPU is AP. Not BSP.
> The reason is that:
>
> 1.1)   The BSP already transfer control to OS waking vector. The GDT/IDT/=
CR3 should be set by OS.
>
> 1.2)   The current dead CPU still has GDT/IDT point to a BIOS reserved me=
mory. The CS/DS/SS is typical BIOS X64 mode setting.
>
> 1.3)   The current dead CPU still has CR3 in SMM. (Which is obvious wrong=
)
>
>
> 2)      Based upon the 1), we reviewed S3 resume AP flow.
> Current BSP will wake up AP in SMRAM, for security consideration. At that=
 time, we are using SMM mode CR3. It is OK for BSP because BSP is NOT in SM=
M mode yet. Even after SMM rebase, we can still use it because SMRR is not =
set in first SMM rebase.
> Current BSP just uses its own context to initialize AP. So that AP takes =
BSP CR3, which is SMM CR3, unfortunately.
> After BSP initialized APs, the AP is put to HALT-LOOP in X64 mode. It is =
the last straw, because X64 mode halt still need paging.
>
>
> 3)      The error happen, once the AP receives an interrupt (for whatever=
 reason), AP starts executing code. However, that that time the AP might no=
t be in SMM mode. It means SMM CR3 is not available. And then we see this.
>
>
> 4)      I guess we did not see the error, or this is RANDOM issue, becaus=
e it depends on if AP receives an interrupt before BSP send INIT-SIPI-SIPI.
>
>
> 5)      The fix, I think, should be below:
> We should always put AP to protected mode, so that no paging is needed.
> We should put AP in above 1M reserved memory, instead of <1M memory, beca=
use <1M memory is restored.
>
>
> Would you please file a bugzillar? I think we need assign CPU owner to fi=
x that critical issue.
>
> There is no need to do more investigation. Thanks for your great help on =
that. :)

Thank you for your help!

I filed <https://bugzilla.tianocore.org/show_bug.cgi?id=3D216>. The title i=
s

    BSP exits SMM and closes SMRAM on the S3 resume path before
    meeting with AP(s)

I hope the title is mostly right. I didn't add any other details (I
haven't gone through the thread in detail yet, and without that I can't
even write up a semi-reasonable report myself). Instead, I referenced
this message of yours in the report, and I also linked Paolo's analysis
from elsewhere in the thread. I hope this will do for the report.

(Also, thank you Paolo, from the amazing analysis -- I haven't digested
it yet, but I can already tell it's amazing! :))

> l  Case 17 - I do not think it is a real issue, because SMM is out of res=
ource.
>
>
> l  Case 8 - that is a very weird issue. I talk with Jeff again. I do not =
have a clear clue yet.
>> ASSERT MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(73): S=
pinLock !=3D ((void *) 0)
> Here is code. We do not know why there is some code need InitializeSpinLo=
ck after ExitBootServices.
> SPIN_LOCK *
> EFIAPI
> InitializeSpinLock (
>   OUT      SPIN_LOCK                 *SpinLock
>   )
> {
>   ASSERT (SpinLock !=3D NULL);
>
>   _ReadWriteBarrier();
>   *SpinLock =3D SPIN_LOCK_RELEASED;
>   _ReadWriteBarrier();
>
>   return SpinLock;
> }
>
> If you can have a quick check on below, that would be great.
>
> 1)      Which processor triggers this ASSERT? BSP or AP.
>
> 2)      Which module triggers this ASSERT? Which module contains current =
RIP value?

First, one additional piece of info I have learned is that the issue
does not always present itself. Sometimes the boot just works fine,
other times the assert fires.

Using the QEMU monitor, I managed to get the following information with
the "info cpus" command:

* CPU #0: pc=3D0x00000000c1844555 (halted) thread_id=3D7835
  CPU #1: pc=3D0x00000000c1844555 (halted) thread_id=3D7836
  CPU #2: pc=3D0x00000000c1844555 (halted) thread_id=3D7837
  CPU #3: pc=3D0x000000007ffd17ca thread_id=3D7838

VCPU#3 is an AP (the last AP), I think. The instruction pointer seems to
point into SMRAM again.

In the OVMF log, I see

Loading SMM driver at 0x0007FFC9000 EntryPoint=3D0x0007FFC9253
PiSmmCpuDxeSmm.efi

So the offset into PiSmmCpuDxeSmm.efi is 0x87CA (or, relative to the
entry point, 0x8577, 0x253 bytes less).

Running

  objdump -x -S Build/OvmfIa32/DEBUG_GCC48/IA32/PiSmmCpuDxeSmm.debug

first I see confirmation that

  start address 0x00000253

and then

000087bd <CpuDeadLoop>:
VOID
EFIAPI
CpuDeadLoop (
  VOID
  )
{
    87bd:       55                      push   %ebp
    87be:       89 e5                   mov    %esp,%ebp
    87c0:       83 ec 10                sub    $0x10,%esp
  volatile UINTN  Index;

  for (Index =3D 0; Index =3D=3D 0;);
    87c3:       c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%ebp)
    87ca:       8b 45 fc                mov    -0x4(%ebp),%eax  <-- HERE
    87cd:       85 c0                   test   %eax,%eax
    87cf:       74 f9                   je     87ca <CpuDeadLoop+0xd>
}
    87d1:       c9                      leave
    87d2:       c3                      ret

This seems consistent with an assertion failure.

I searched UefiCpuPkg/PiSmmCpuDxeSmm/ for InitializeSpinLock(), and the
SmiRendezvous() function [UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c] looks
like a possible caller:

      //
      // The BUSY lock is initialized to Released state. This needs to
      // be done early enough to be ready for BSP's SmmStartupThisAp()
      // call. E.g., with Relaxed AP flow, SmmStartupThisAp() may be
      // called immediately after AP's present flag is detected.
      //
      InitializeSpinLock (mSmmMpSyncData->CpuData[CpuIndex].Busy);

Just a guess, of course.

> At same time, all my OS test is on real platform. I have not setup OVMF e=
nv to run an OS yet.
> If you can share a step by step to me, that would be great.

(1) Grab a host computer with a CPU that supports VMX and EPT.

(2) Download and install Fedora 24 (for example):

https://getfedora.org/en/workstation/download/
http://docs.fedoraproject.org/install-guide

(3) Install the "qemu-system-x86" package with DNF

dnf install qemu-system-x86

(4) clone edk2 with git

(5) embed OpenSSL optionally (for secure boot); see
"CryptoPkg/Library/OpensslLib/Patch-HOWTO.txt"

(6) build OVMF:

source edksetup.sh
make -C "$EDK_TOOLS_PATH"

# Ia32
build \
  -a IA32 \
  -p OvmfPkg/OvmfPkgIa32.dsc \
  -D SMM_REQUIRE -D SECURE_BOOT_ENABLE \
  -t GCC5 -b DEBUG

# Ia32X64
build \
  -a IA32 -a X64 \
  -p OvmfPkg/OvmfPkgIa32X64.dsc \
  -D SMM_REQUIRE -D SECURE_BOOT_ENABLE \
  -t GCC5 -b DEBUG

(7) Create disk images:

qemu-img create -f qcow2 -o compat=3D1.1 -o cluster_size=3D65536 \
  -o preallocation=3Dmetadata -o lazy_refcounts=3Don disk-ia32.img 100G

qemu-img create -f qcow2 -o compat=3D1.1 -o cluster_size=3D65536 \
  -o preallocation=3Dmetadata -o lazy_refcounts=3Don disk-ia32x64.img 100G

(8) For a 64-bit guest OS, you can again use the Fedora 24 Workstation
that you downloaded already (the ISO image).

For 32-bit guest OS, this one used to work:

https://www.happyassassin.net/fedlet-a-fedora-remix-for-bay-trail-tablets/

minimally the 20141209 release. Hm... actually, I think the maintainer
of that image has discontinued the downloadable files :(

So, I don't know what 32-bit UEFI OS to recommend for testing.

32-bit Windows doesn't boot on OVMF (I looked into that earlier, several
times, with some help from a Microsoft developer, but we couldn't solve
it), so I can't recommend Windows as an alternative.

Perhaps you can use

https://linuxiumcomau.blogspot.com/2016/10/running-ubuntu-on-intel-bay-trai=
l-and.html

as a 32-bit guest OS, I never tried.

(9) Anyway, once you have an installer ISO, set the "ISO" environment
variable to the ISO image's full pathname, and then run QEMU like this:

# Settings for Ia32 only:

ISO=3D...
DISK=3D.../disk-ia32.img
FW=3D.../Build/OvmfIa32/DEBUG_GCC5/FV/OVMF_CODE.fd
TEMPLATE=3D.../Build/OvmfIa32/DEBUG_GCC5/FV/OVMF_VARS.fd
VARS=3Dvars-32.fd
QEMU_COMMAND=3D"qemu-system-i386 -cpu coreduo,-nx"
DEBUG=3Ddebug-32.log

# Settings for Ia32X64 only:

ISO=3D...
DISK=3D.../disk-ia32x64.img
FW=3D.../Build/Ovmf3264/DEBUG_GCC5/FV/OVMF_CODE.fd
TEMPLATE=3D.../Build/Ovmf3264/DEBUG_GCC5/FV/OVMF_VARS.fd
VARS=3Dvars-3264.fd
QEMU_COMMAND=3Dqemu-system-x86_64
DEBUG=3Ddebug-3264.log

# Common commands for both target arches:

# create variable store from varstore template
# if the former doesn't exist yet
if ! [ -e "$VARS" ]; then
  cp -- "$TEMPLATE" "$VARS"
fi

$QEMU_COMMAND \
  -machine q35,smm=3Don,accel=3Dkvm \
  -m 4096 \
  -smp sockets=3D1,cores=3D2,threads=3D2 \
  -global driver=3Dcfi.pflash01,property=3Dsecure,value=3Don \
  -drive if=3Dpflash,format=3Draw,unit=3D0,file=3D${FW},readonly=3Don \
  -drive if=3Dpflash,format=3Draw,unit=3D1,file=3D${VARS} \
  \
  -chardev file,id=3Ddebugfile,path=3D$DEBUG \
  -device isa-debugcon,iobase=3D0x402,chardev=3Ddebugfile \
  \
  -chardev stdio,id=3Dchar0,signal=3Doff,mux=3Don \
  -mon chardev=3Dchar0,mode=3Dreadline,default \
  -serial chardev:char0 \
  \
  -drive id=3Diso,if=3Dnone,format=3Draw,readonly,file=3D$ISO \
  -drive id=3Ddisk,if=3Dnone,format=3Dqcow2,file=3D$DISK \
  \
  -device virtio-scsi-pci,id=3Dscsi0 \
  -device scsi-cd,drive=3Diso,bus=3Dscsi0.0,bootindex=3D2 \
  -device scsi-hd,drive=3Ddisk,bus=3Dscsi0.0,bootindex=3D1 \
  \
  -device VGA

This will capture the OVMF debug output in the $DEBUG file. Also, the
terminal where you run the command can be switched between the guest's
serial console and the QEMU monitor with [Ctrl-A C].

Thanks
Laszlo

>
> Thank you
> Yao Jiewen
>
> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of La=
szlo Ersek
> Sent: Tuesday, November 8, 2016 9:22 AM
> To: Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>>
> Cc: Tian, Feng <feng.tian@intel.com<mailto:feng.tian@intel.com>>; edk2-de=
vel@ml01.01.org<mailto:edk2-devel@ml01.01.org>; Kinney, Michael D <michael.=
d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>; Paolo Bonzini <pbon=
zini@redhat.com<mailto:pbonzini@redhat.com>>; Fan, Jeff <jeff.fan@intel.com=
<mailto:jeff.fan@intel.com>>; Zeng, Star <star.zeng@intel.com<mailto:star.z=
eng@intel.com>>
> Subject: Re: [edk2] [PATCH V2 0/6] Enable SMM page level protection.
>
> On 11/04/16 10:30, Jiewen Yao wrote:
>> =3D=3D=3D=3D below is V2 description =3D=3D=3D=3D
>> 1) PiSmmCpu: resolve OVMF multiple processors boot hang issue.
>> 2) PiSmmCpu: Add debug info on StartupAp() fails.
>> 3) PiSmmCpu: Add ASSERT for AllocatePages().
>> 4) PiSmmCpu: Add protection detail in commit message.
>> 5) UefiCpuPkg.dsc: Add page table footprint info in commit message.
>>
>> =3D=3D=3D=3D below is V1 description =3D=3D=3D=3D
>> This series patch enables SMM page level protection.
>> Features are:
>> 1) PiSmmCore reports SMM PE image code/data information
>> in EdkiiPiSmmMemoryAttributeTable, if the SMM image is page aligned.
>> 2) PiSmmCpu consumes EdkiiPiSmmMemoryAttributeTable
>> and set XD for data page and RO for code page.
>> 3) PiSmmCpu enables Static Paging for X64 according to
>> PcdCpuSmmStaticPageTable. If it is true, 1G paging for above 4G
>> is used as long as it is supported.
>> 4) PiSmmCpu sets importance data structure to be read only,
>> such as Gdt, Idt, SmmEntrypoint, and PageTable itself.
>>
>> tested platform:
>> 1) Intel internal platform (X64).
>> 2) EDKII Quark IA32
>> 3) EDKII Vlv2  X64
>> 4) EDKII OVMF IA32 and IA32X64. (with -smp 8)
>>
>> Cc: Jeff Fan <jeff.fan@intel.com<mailto:jeff.fan@intel.com<mailto:jeff.f=
an@intel.com%3cmailto:jeff.fan@intel.com>>>
>> Cc: Feng Tian <feng.tian@intel.com<mailto:feng.tian@intel.com<mailto:fen=
g.tian@intel.com%3cmailto:feng.tian@intel.com>>>
>> Cc: Star Zeng <star.zeng@intel.com<mailto:star.zeng@intel.com<mailto:sta=
r.zeng@intel.com%3cmailto:star.zeng@intel.com>>>
>> Cc: Michael D Kinney <michael.d.kinney@intel.com<mailto:michael.d.kinney=
@intel.com<mailto:michael.d.kinney@intel.com%3cmailto:michael.d.kinney@inte=
l.com>>>
>> Cc: Laszlo Ersek <lersek@redhat.com<mailto:lersek@redhat.com<mailto:lers=
ek@redhat.com%3cmailto:lersek@redhat.com>>>
>> Contributed-under: TianoCore Contribution Agreement 1.0
>> Signed-off-by: Jiewen Yao <jiewen.yao@intel.com<mailto:jiewen.yao@intel.=
com<mailto:jiewen.yao@intel.com%3cmailto:jiewen.yao@intel.com>>>
>
> I have new test results. Let's start with the table again:
>
> Legend:
>
> - "untested" means the test was not executed because the same test
>   failed or proved unreliable in a less demanding configuration already,
>
> - "n/a" means a setting or test case was impossible,
>
> - "fail" and "unreliable" (lower case) are outside the scope of this
>   series; they either capture the pre-series status, or are expected
>   even with the series applied due to the pre-series status,
>
> - "FAIL" and "UNRELIABLE" mean regressions caused (or exposed) by the
>   series.
>
> In all cases, 36 bits were used as address width in the CPU HOB (--> up
> to 64GB guest-phys address space).
>
>    series  OVMF                                                          =
    VCPU     boot       S3 resume
>  # applied platform PcdCpuMaxLogicalProcessorNumber PcdCpuSmmStaticPageTa=
ble topology result     result
> -- ------- -------- ------------------------------- ---------------------=
--- -------- ------     ---------
>  1 no      Ia32      64                             n/a                  =
    1x2x2    pass       unreliable
>  2 no      Ia32     255                             n/a                  =
    52x2x2   pass       untested
>  3 no      Ia32     255                             n/a                  =
    53x2x2   unreliable untested
>  4 no      Ia32X64   64                             n/a                  =
    1x2x2    pass       unreliable
>  5 no      Ia32X64  255                             n/a                  =
    52x2x2   pass       untested
>  6 no      Ia32X64  255                             n/a                  =
    54x2x2   fail       n/a
>  7 v2      Ia32      64                             FALSE                =
    1x2x2    pass       untested
>  8 v2      Ia32      64                             TRUE                 =
    1x2x2    FAIL       untested
>  9 v2      Ia32     255                             FALSE                =
    52x2x2   pass       untested
> 10 v2      Ia32     255                             FALSE                =
    53x2x2   untested   untested
> 11 v2      Ia32     255                             TRUE                 =
    52x2x2   untested   untested
> 12 v2      Ia32     255                             TRUE                 =
    53x2x2   untested   untested
> 13 v2      Ia32X64   64                             FALSE                =
    1x2x2    pass       unreliable
> 14 v2      Ia32X64   64                             TRUE                 =
    1x2x2    pass       untested
> 15 v2      Ia32X64  255                             FALSE                =
    52x2x2   pass       untested
> 16 v2      Ia32X64  255                             FALSE                =
    54x2x2   untested   untested
> 17 v2      Ia32X64  255                             TRUE                 =
    52x2x2   FAIL       untested
> 18 v2      Ia32X64  255                             TRUE                 =
    54x2x2   untested   untested
>
> * Case 8: this test case failed with v2 as well, but this time with
>   different symptoms:
>
>> FSOpen: Open '\EFI\fedora\grubia32.efi' Success
>> InstallProtocolInterface: [EfiLoadedImageProtocol] 7E4037A8
>> Loading driver at 0x0007DA7F000 EntryPoint=3D0x0007DA7F400
>> InstallProtocolInterface: [EfiLoadedImageDevicePathProtocol] 7E403A90
>> PixelBlueGreenRedReserved8BitPerColor
>> ConvertPages: Incompatible memory types
>> PixelBlueGreenRedReserved8BitPerColor
>> ConvertPages: Incompatible memory types
>> SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0
>> MpInitExitBootServicesCallback() done!
>> ASSERT MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(73): S=
pinLock !=3D ((void *) 0)
>
>   I didn't try to narrow this down.
>
> * Case 13 (the "unreliable S3 resume" case): Here the news are both bad
>   and good. The good news is for Jiewen: this patch series does not
>   cause the unreliability, it "only" amplifies it severely. The bad news
>   is correspondingly for everyone else: S3 resume is actually unreliable
>   even in case 4, that is, without this series applied, it's just the
>   failure rate is much-much lower.
>
>   Namely, in my new testing, in case 13, S3 resume failed 8 times out of
>   21 tries. (I stopped testing at the 8th failure.)
>
>   Whereas in case 4, S3 resume failed with *identical symptoms* (e.g.,
>   exact same RIP=3D000000000009f0fd), 1 time out of 55 tries. (It was try
>   #12 that failed; I continued testing and aborted the test after the
>   55th try.)
>
>   So, while the series hugely amplifies the failure rate, the failure
>   does exist without the series. Which is why I modified the case 4
>   results in the table, and also lower-cased the word "unreliable" in
>   case 13.
>
>   Below I will return to this problem separately; let's go over the rest
>   of the table first.
>
> * Case 17: I guess this is not a real failure, I'm just including it for
>   completeness, as PcdCpuSmmStaticPageTable=3D=3DTRUE is known to present
>   additional SMRAM demand (see the commit message on patch V2 4/6). This
>   case fails with
>
>> SmmLockBox Command - 4
>> SmmLockBoxSmmLib SetLockBoxAttributes - Enter
>> SmmLockBoxSmmLib SetLockBoxAttributes - Exit (Success)
>> SmmLockBox SmmLockBoxHandler Exit
>> SmmLockBoxDxeLib SetLockBoxAttributes - Exit (Success)
>> SmmInstallProtocolInterface: [EfiSmmReadyToLockProtocol] 0
>> ASSERT UefiCpuPkg/PiSmmCpuDxeSmm/CpuS3.c(892): mGdtForAp !=3D ((void *) =
0)
>
>   which is an SMRAM allocation failure. If I lower the VCPU count to
>   50x2x2, then the guest boots fine.
>
> ----*----
>
> Before I get to the S3 resume problem (which, again, reproduces without
> this series, although much less frequently), I'd like to comment on the
> removal of the ASSERT(), from v1 to v2, in the FlushTlbForAll()
> function, on the return value of SmmBlockingStartupThisAp(). This change
> allows v2 to proceed past that point; however, I'm seeing a whole lot of
>
>> !mSmmMpSyncData->CpuData[1].Present
>> !mSmmMpSyncData->CpuData[2].Present
>> !mSmmMpSyncData->CpuData[3].Present
>> ...
>
> messages in the OVMF boot log, interspersed with
>
>> ConvertPageEntryAttribute 0x7F92B067->0x7F92B065
>
> style messages. (That is, one error message for each AP, per
> ConvertPageEntryAttribute() message.)
>
> Is this okay / intentional? The number of these messages can go up to
> several thousands and that sort of drowns out everything else in the
> log.
>
> It's also not easy to mask the message, because it's logged on the
> DEBUG_ERROR level.
>
> ----*----
>
> * Okay, so the S3 problem. Last time I suspected that the failure point
>   (RIP=3D9f0fd) was in the Linux guest's S3 wakeup vector (which starts a=
t
>   9A1D0, according to the OVMF log). In order to test this idea, I
>   exercised this series with S3 against a Windows 8.1 guest (--> case 13
>   again). The failure reproduced on the second S3 resume, with identical
>   RIP, despite the Windows wakeup vector being located elsewhere (at
>   0x1000).
>
>   Quoting the OVMF log leading up to the resume:
>
>> Call AsmDisablePaging64() to return to S3 Resume in PEI Phase
>> Install PPI: [PeiPostScriptTablePpi]
>> Install PPI: [EfiEndOfPeiSignalPpi]
>> Notify: PPI Guid: [EfiEndOfPeiSignalPpi], Peim notify entry point: 85789=
5
>> PeiMpInitLib: CpuMpEndOfPeiCallback () invoked
>> Transfer to 16bit OS waking vector - 1000
>
>   QEMU log (same as before):
>
>> KVM internal error. Suberror: 1
>> KVM internal error. Suberror: 1
>> emulation failure
>> emulation failure
>> RAX=3D0000000000000002 RBX=3D0000000000000000 RCX=3D000000007ffdb168 RDX=
=3D000000007fedd058
>> RSI=3D0000000000000004 RDI=3D000000007fedd040 RBP=3D0000000000000000 RSP=
=3D000000007e1a7000
>> R8 =3D0000000000000000 R9 =3D0000000000000000 R10=3D0000000000000000 R11=
=3D0000000000000000
>> R12=3D0000000000000000 R13=3D0000000000000000 R14=3D0000000000000000 R15=
=3D0000000000000000
>> RIP=3D000000000009f0fd RFL=3D00010006 [-----P-] CPL=3D0 II=3D0 A20=3D1 S=
MM=3D0 HLT=3D0
>> ES =3D0030 0000000000000000 ffffffff 00c09300 DPL=3D0 DS   [-WA]
>> CS =3D0038 0000000000000000 ffffffff 00a09b00 DPL=3D0 CS64 [-RA]
>> SS =3D0030 0000000000000000 ffffffff 00c09300 DPL=3D0 DS   [-WA]
>> DS =3D0030 0000000000000000 ffffffff 00c09300 DPL=3D0 DS   [-WA]
>> FS =3D0018 0000000000000000 ffffffff 00c09300 DPL=3D0 DS   [-WA]
>> GS =3D0018 0000000000000000 ffffffff 00c09300 DPL=3D0 DS   [-WA]
>> LDT=3D0000 0000000000000000 0000ffff 00008200 DPL=3D0 LDT
>> TR =3D0000 0000000000000000 0000ffff 00008b00 DPL=3D0 TSS64-busy
>> GDT=3D     000000007f294000 00000047
>> IDT=3D     000000007f294048 00000fff
>> CR0=3De0000011 CR2=3D0000000000000000 CR3=3D000000007ff7f000 CR4=3D00000=
220
>> DR0=3D0000000000000000 DR1=3D0000000000000000 DR2=3D0000000000000000 DR3=
=3D0000000000000000
>> DR6=3D00000000ffff0ff0 DR7=3D0000000000000400
>> EFER=3D0000000000000500
>> Code=3D?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> =
?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? =
?? ?? ?? ??
>> RAX=3D0000000000000001 RBX=3D0000000000000000 RCX=3D000000007ffdb168 RDX=
=3D000000007fedd070
>> RSI=3D0000000000000004 RDI=3D000000007fedd058 RBP=3D0000000000000000 RSP=
=3D000000007e19f000
>> R8 =3D0000000000000000 R9 =3D0000000000000000 R10=3D0000000000000000 R11=
=3D0000000000000000
>> R12=3D0000000000000000 R13=3D0000000000000000 R14=3D0000000000000000 R15=
=3D0000000000000000
>> RIP=3D000000000009f0fd RFL=3D00010006 [-----P-] CPL=3D0 II=3D0 A20=3D1 S=
MM=3D0 HLT=3D0
>> ES =3D0030 0000000000000000 ffffffff 00c09300 DPL=3D0 DS   [-WA]
>> CS =3D0038 0000000000000000 ffffffff 00a09b00 DPL=3D0 CS64 [-RA]
>> SS =3D0030 0000000000000000 ffffffff 00c09300 DPL=3D0 DS   [-WA]
>> DS =3D0030 0000000000000000 ffffffff 00c09300 DPL=3D0 DS   [-WA]
>> FS =3D0018 0000000000000000 ffffffff 00c09300 DPL=3D0 DS   [-WA]
>> GS =3D0018 0000000000000000 ffffffff 00c09300 DPL=3D0 DS   [-WA]
>> LDT=3D0000 0000000000000000 0000ffff 00008200 DPL=3D0 LDT
>> TR =3D0000 0000000000000000 0000ffff 00008b00 DPL=3D0 TSS64-busy
>> GDT=3D     000000007f294000 00000047
>> IDT=3D     000000007f294048 00000fff
>> CR0=3De0000011 CR2=3D0000000000000000 CR3=3D000000007ff7f000 CR4=3D00000=
220
>> DR0=3D0000000000000000 DR1=3D0000000000000000 DR2=3D0000000000000000 DR3=
=3D0000000000000000
>> DR6=3D00000000ffff0ff0 DR7=3D0000000000000400
>> EFER=3D0000000000000500
>> Code=3D?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> =
?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? =
?? ?? ?? ??
>
>   So, we can exclude the suspicion that the problem is guest OS
>   dependent.
>
> * Then I looked for the base address of the page containing the
>   RIP=3D9f0fd address, in earlier parts of the OVMF log, on the hunch tha=
t
>   some firmware component might have allocated that area actually. Here
>   we go:
>
>> Loading PEIM at 0x000008552C0 EntryPoint=3D0x000008554E0 CpuMpPei.efi
>> AP Loop Mode is 1
>> WakeupBufferStart =3D 9F000, WakeupBufferSize =3D 1000
>
>   That is, the failure hits (when it hits -- not always) in the area
>   where the CpuMpPei driver *borrows* memory for the startup vector of
>   the APs, for the purposes of the MP service PPI. ("Wakeup" is an
>   overloaded word here; the "wakeup buffer" has nothing to do with S3
>   resume, it just serves for booting the APs temporarily in PEI, for
>   implementing the MP service PPI.)
>
>   When exiting the PEI phase (on the S3 resume path), CpuMpPei restores
>   the original contents of this area. This occurs just before
>   transfering control to the guest OS wakeup vector: see the
>   "EfiEndOfPeiSignalPpi" and "CpuMpEndOfPeiCallback" strings just above,
>   quoted from the OVMF log.
>
>   I documented (parts of) this logic in OVMF commit
>
>     https://github.com/tianocore/edk2/commit/e3e3090a959a0
>
>   (see the code comments as well).
>
> * At that time, I thought to have identified a memory management bug in
>   CpuMpPei; see the following discussion and bug report for details:
>
>     https://www.mail-archive.com/edk2-devel@lists.01.org/msg13892.html
>     https://bugzilla.tianocore.org/show_bug.cgi?id=3D67
>
>   However, with the extraction / introduction of MpInitLib, this issue
>   has been fixed: GetWakeupBuffer() now calls
>   CheckOverlapWithAllocatedBuffer(), so that "memory management bug" is
>   no longer; we shouldn't be looking there for the root cause.
>
> * Either way, I don't understand why anything would want to execute code
>   in the one page that happens to host the MP services PPI startup
>   buffer for APs during PEI.
>
>   Not understanding the "why", I looked at the "what", and resorted to
>   tracing KVM. Because the problem readily reproduces with this series
>   applied (case 13), it wasn't hard to start the tracing while the guest
>   was suspended, and capture just the actions that led from the
>   KVM-level wakeup to the failure.
>
>   The QEMU state dumps are visible above in the email. I've also
>   uploaded the compressed OVMF log and the textual KVM trace here:
>
>     http://people.redhat.com/lersek/s3-crash-8d1dfed7-ca92-4e25-8d2b-b1c9=
ac2a53db/
>
>   I sincerely hope that Paolo will have a field day with the KVM trace
>   :) I managed to identify the following curiosities (remember this is
>   all on the S3 resume path):
>
>   * First, the VCPUs (there are four of them) enter and leave SMM in a
>     really funky pattern:
>
>       vcpu#0  vcpu#1  vcpu#2  vcpu#3
>       ------  ------  ------  ------
>               enter
>                |
>               leave
>
>                       enter
>                         |
>                       leave
>
>                               enter
>                                 |
>                               leave
>
>       enter
>         |
>       leave
>
>               enter           enter
>        enter    |     enter     |
>          |      |       |       |
>        leave    |       |       |
>                 |       |       |
>        enter    |       |       |
>          |      |       |       |
>        leave  leave   leave   leave
>
>     That is, first we have each VCPU enter and leave SMM in complete
>     isolation (1, 2, 3, 0). Then VCPUs 1 and 3 enter SMM together, soon
>     followed by VPCUS 0 and 2, also together. VCPU#0 drops out of SMM
>     temporarily (it comes back in later), while the other three remain
>     in SMM. Finally all four of them leave SMM together.
>
>     After which the problem occurs.
>
>   * Second, the instruction that causes things to blow up is <0f aa>,
>     i.e., RSM. I have absolutely no clue why RSM is executed:
>
>     (a) in the area that used to host the AP startup routine for the MP
>     services PPI -- note that we also have "Transfer to 16bit OS waking
>     vector" in the log, so CpuMpEndOfPeiCallback() restores the borrowed
>     area completeley! --,
>
>     (b) and why *after* all four VCPUs have just left SMM, together.
>
>   * The RSM instruction is handled successfully elsewhere, for example
>     when all four VCPUs leave SMM, at the bottom of the diagram above:
>
>> CPU-24447 [002] 39841.982810: kvm_emulate_insn:     0:7ffbf179: 0f aa
>> CPU-24446 [000] 39841.982810: kvm_emulate_insn:     0:7ffbd179: 0f aa
>> CPU-24445 [005] 39841.982810: kvm_emulate_insn:     0:7ffbb179: 0f aa
>> CPU-24444 [006] 39841.982811: kvm_emulate_insn:     0:7ffb9179: 0f aa
>
>   * The guest-phys address 7ff7f000 that we see just before the error:
>
>> CPU-24447 [002] 39841.982825: kvm_page_fault:       address 7ff7f000 err=
or_code 83
>> CPU-24446 [000] 39841.982825: kvm_page_fault:       address 7ff7f000 err=
or_code 83
>> CPU-24447 [002] 39841.982826: kvm_emulate_insn:     0:9f0fd: 0f aa
>> CPU-24444 [006] 39841.982827: kvm_exit:             reason EXTERNAL_INTE=
RRUPT rip 0xffffffff813a954f info 0 800000fc
>> CPU-24447 [002] 39841.982827: kvm_emulate_insn:     0:9f0fd: 0f aa FAIL
>> CPU-24447 [002] 39841.982827: kvm_userspace_exit:   reason KVM_EXIT_INTE=
RNAL_ERROR (17)
>
>     can be found higher up in the trace; namely, it is written to CR3
>     several times. It's the root of the page tables.
>
>   * The 7F80_1000..7FFF_FFFF guest-phys addresses are all in SMRAM.
>
> * I also tried the "info tlb" monitor command, via "virsh
>   qemu-monitor-command --hmp", while the guest was auto-paused after the
>   crash.
>
>   I cannot provide results: QEMU appeared to return a message that would
>   be longer than 16MB after encoding by libvirt, and libvirt rejected
>   that ("Unable to encode message payload", see VIR_NET_MESSAGE_MAX).
>
>   Anyway, the KVM trace, and the QEMU register dump, look consistent
>   with what Paolo said about "Code=3D?? ?? ??...":
>
>     The question marks usually mean that the page tables do not map a
>     page at that address.
>
>   CR3=3D000000007ff7f000 points into SMRAM, but we are outside of SMM
>   (SMM=3D0). We can't translate *any* guest-virtual address, as we can't
>   even begin walking the page tables.
>
> Thanks
> Laszlo
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org<mailto:edk2-devel@=
lists.01.org%3cmailto:edk2-devel@lists.01.org>>
> https://lists.01.org/mailman/listinfo/edk2-devel
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>
> https://lists.01.org/mailman/listinfo/edk2-devel
>