From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jiewen.yao@intel.com>
Received: from mga02.intel.com (mga02.intel.com [134.134.136.20])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by ml01.01.org (Postfix) with ESMTPS id 8088081D78
 for <edk2-devel@ml01.01.org>; Tue,  8 Nov 2016 22:25:16 -0800 (PST)
Received: from orsmga002.jf.intel.com ([10.7.209.21])
 by orsmga101.jf.intel.com with ESMTP; 08 Nov 2016 22:25:19 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.31,465,1473145200"; d="scan'208,217";a="1601988"
Received: from fmsmsx106.amr.corp.intel.com ([10.18.124.204])
 by orsmga002.jf.intel.com with ESMTP; 08 Nov 2016 22:25:19 -0800
Received: from fmsmsx124.amr.corp.intel.com (10.18.125.39) by
 FMSMSX106.amr.corp.intel.com (10.18.124.204) with Microsoft SMTP Server (TLS)
 id 14.3.248.2; Tue, 8 Nov 2016 22:25:18 -0800
Received: from shsmsx104.ccr.corp.intel.com (10.239.4.70) by
 fmsmsx124.amr.corp.intel.com (10.18.125.39) with Microsoft SMTP Server (TLS)
 id 14.3.248.2; Tue, 8 Nov 2016 22:25:17 -0800
Received: from shsmsx102.ccr.corp.intel.com ([169.254.2.206]) by
 SHSMSX104.ccr.corp.intel.com ([169.254.5.209]) with mapi id 14.03.0248.002;
 Wed, 9 Nov 2016 14:25:15 +0800
From: "Yao, Jiewen" <jiewen.yao@intel.com>
To: Laszlo Ersek <lersek@redhat.com>
CC: "Tian, Feng" <feng.tian@intel.com>, "edk2-devel@ml01.01.org"
 <edk2-devel@ml01.01.org>, "Kinney, Michael D" <michael.d.kinney@intel.com>,
 Paolo Bonzini <pbonzini@redhat.com>, "Fan, Jeff" <jeff.fan@intel.com>,
 "Zeng, Star" <star.zeng@intel.com>
Thread-Topic: [edk2] [PATCH V2 0/6] Enable SMM page level protection.
Thread-Index: AQHSNn5Npc+l51c2NUCmigdj/tAIBqDNyfkAgAJjohA=
Date: Wed, 9 Nov 2016 06:25:15 +0000
Message-ID: <74D8A39837DF1E4DA445A8C0B3885C50386C0CB8@shsmsx102.ccr.corp.intel.com>
References: <1478251854-14660-1-git-send-email-jiewen.yao@intel.com>
 <08406bf5-4377-63a1-8dd9-34479c015d4b@redhat.com>
In-Reply-To: <08406bf5-4377-63a1-8dd9-34479c015d4b@redhat.com>
Accept-Language: zh-CN, en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.239.127.40]
MIME-Version: 1.0
X-Content-Filtered-By: Mailman/MimeDel 2.1.21
Subject: Re: [PATCH V2 0/6] Enable SMM page level protection.
X-BeenThere: edk2-devel@lists.01.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: EDK II Development  <edk2-devel.lists.01.org>
List-Unsubscribe: <https://lists.01.org/mailman/options/edk2-devel>,
 <mailto:edk2-devel-request@lists.01.org?subject=unsubscribe>
List-Archive: <http://lists.01.org/pipermail/edk2-devel/>
List-Post: <mailto:edk2-devel@lists.01.org>
List-Help: <mailto:edk2-devel-request@lists.01.org?subject=help>
List-Subscribe: <https://lists.01.org/mailman/listinfo/edk2-devel>,
 <mailto:edk2-devel-request@lists.01.org?subject=subscribe>
X-List-Received-Date: Wed, 09 Nov 2016 06:25:16 -0000
Content-Language: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Hi Laszlo
I will fix DEBUG message issue in V3 patch.

Below is rest issues:


l  Case 13: S3 fails randomly.
A good news: I worked with Jeff Fan to root-cause the S3 resume issue. Here=
 is detail.


1)      We believe the dead CPU is AP. Not BSP.
The reason is that:

1.1)   The BSP already transfer control to OS waking vector. The GDT/IDT/CR=
3 should be set by OS.

1.2)   The current dead CPU still has GDT/IDT point to a BIOS reserved memo=
ry. The CS/DS/SS is typical BIOS X64 mode setting.

1.3)   The current dead CPU still has CR3 in SMM. (Which is obvious wrong)


2)      Based upon the 1), we reviewed S3 resume AP flow.
Current BSP will wake up AP in SMRAM, for security consideration. At that t=
ime, we are using SMM mode CR3. It is OK for BSP because BSP is NOT in SMM =
mode yet. Even after SMM rebase, we can still use it because SMRR is not se=
t in first SMM rebase.
Current BSP just uses its own context to initialize AP. So that AP takes BS=
P CR3, which is SMM CR3, unfortunately.
After BSP initialized APs, the AP is put to HALT-LOOP in X64 mode. It is th=
e last straw, because X64 mode halt still need paging.


3)      The error happen, once the AP receives an interrupt (for whatever r=
eason), AP starts executing code. However, that that time the AP might not =
be in SMM mode. It means SMM CR3 is not available. And then we see this.


4)      I guess we did not see the error, or this is RANDOM issue, because =
it depends on if AP receives an interrupt before BSP send INIT-SIPI-SIPI.


5)      The fix, I think, should be below:
We should always put AP to protected mode, so that no paging is needed.
We should put AP in above 1M reserved memory, instead of <1M memory, becaus=
e <1M memory is restored.


Would you please file a bugzillar? I think we need assign CPU owner to fix =
that critical issue.

There is no need to do more investigation. Thanks for your great help on th=
at. :)


l  Case 17 - I do not think it is a real issue, because SMM is out of resou=
rce.


l  Case 8 - that is a very weird issue. I talk with Jeff again. I do not ha=
ve a clear clue yet.
> ASSERT MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(73): Sp=
inLock !=3D ((void *) 0)
Here is code. We do not know why there is some code need InitializeSpinLock=
 after ExitBootServices.
SPIN_LOCK *
EFIAPI
InitializeSpinLock (
  OUT      SPIN_LOCK                 *SpinLock
  )
{
  ASSERT (SpinLock !=3D NULL);

  _ReadWriteBarrier();
  *SpinLock =3D SPIN_LOCK_RELEASED;
  _ReadWriteBarrier();

  return SpinLock;
}

If you can have a quick check on below, that would be great.

1)      Which processor triggers this ASSERT? BSP or AP.

2)      Which module triggers this ASSERT? Which module contains current RI=
P value?

At same time, all my OS test is on real platform. I have not setup OVMF env=
 to run an OS yet.
If you can share a step by step to me, that would be great.

Thank you
Yao Jiewen

From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Lasz=
lo Ersek
Sent: Tuesday, November 8, 2016 9:22 AM
To: Yao, Jiewen <jiewen.yao@intel.com>
Cc: Tian, Feng <feng.tian@intel.com>; edk2-devel@ml01.01.org; Kinney, Micha=
el D <michael.d.kinney@intel.com>; Paolo Bonzini <pbonzini@redhat.com>; Fan=
, Jeff <jeff.fan@intel.com>; Zeng, Star <star.zeng@intel.com>
Subject: Re: [edk2] [PATCH V2 0/6] Enable SMM page level protection.

On 11/04/16 10:30, Jiewen Yao wrote:
> =3D=3D=3D=3D below is V2 description =3D=3D=3D=3D
> 1) PiSmmCpu: resolve OVMF multiple processors boot hang issue.
> 2) PiSmmCpu: Add debug info on StartupAp() fails.
> 3) PiSmmCpu: Add ASSERT for AllocatePages().
> 4) PiSmmCpu: Add protection detail in commit message.
> 5) UefiCpuPkg.dsc: Add page table footprint info in commit message.
>
> =3D=3D=3D=3D below is V1 description =3D=3D=3D=3D
> This series patch enables SMM page level protection.
> Features are:
> 1) PiSmmCore reports SMM PE image code/data information
> in EdkiiPiSmmMemoryAttributeTable, if the SMM image is page aligned.
> 2) PiSmmCpu consumes EdkiiPiSmmMemoryAttributeTable
> and set XD for data page and RO for code page.
> 3) PiSmmCpu enables Static Paging for X64 according to
> PcdCpuSmmStaticPageTable. If it is true, 1G paging for above 4G
> is used as long as it is supported.
> 4) PiSmmCpu sets importance data structure to be read only,
> such as Gdt, Idt, SmmEntrypoint, and PageTable itself.
>
> tested platform:
> 1) Intel internal platform (X64).
> 2) EDKII Quark IA32
> 3) EDKII Vlv2  X64
> 4) EDKII OVMF IA32 and IA32X64. (with -smp 8)
>
> Cc: Jeff Fan <jeff.fan@intel.com<mailto:jeff.fan@intel.com>>
> Cc: Feng Tian <feng.tian@intel.com<mailto:feng.tian@intel.com>>
> Cc: Star Zeng <star.zeng@intel.com<mailto:star.zeng@intel.com>>
> Cc: Michael D Kinney <michael.d.kinney@intel.com<mailto:michael.d.kinney@=
intel.com>>
> Cc: Laszlo Ersek <lersek@redhat.com<mailto:lersek@redhat.com>>
> Contributed-under: TianoCore Contribution Agreement 1.0
> Signed-off-by: Jiewen Yao <jiewen.yao@intel.com<mailto:jiewen.yao@intel.c=
om>>

I have new test results. Let's start with the table again:

Legend:

- "untested" means the test was not executed because the same test
  failed or proved unreliable in a less demanding configuration already,

- "n/a" means a setting or test case was impossible,

- "fail" and "unreliable" (lower case) are outside the scope of this
  series; they either capture the pre-series status, or are expected
  even with the series applied due to the pre-series status,

- "FAIL" and "UNRELIABLE" mean regressions caused (or exposed) by the
  series.

In all cases, 36 bits were used as address width in the CPU HOB (--> up
to 64GB guest-phys address space).

   series  OVMF                                                            =
  VCPU     boot       S3 resume
 # applied platform PcdCpuMaxLogicalProcessorNumber PcdCpuSmmStaticPageTabl=
e topology result     result
-- ------- -------- ------------------------------- -----------------------=
- -------- ------     ---------
 1 no      Ia32      64                             n/a                    =
  1x2x2    pass       unreliable
 2 no      Ia32     255                             n/a                    =
  52x2x2   pass       untested
 3 no      Ia32     255                             n/a                    =
  53x2x2   unreliable untested
 4 no      Ia32X64   64                             n/a                    =
  1x2x2    pass       unreliable
 5 no      Ia32X64  255                             n/a                    =
  52x2x2   pass       untested
 6 no      Ia32X64  255                             n/a                    =
  54x2x2   fail       n/a
 7 v2      Ia32      64                             FALSE                  =
  1x2x2    pass       untested
 8 v2      Ia32      64                             TRUE                   =
  1x2x2    FAIL       untested
 9 v2      Ia32     255                             FALSE                  =
  52x2x2   pass       untested
10 v2      Ia32     255                             FALSE                  =
  53x2x2   untested   untested
11 v2      Ia32     255                             TRUE                   =
  52x2x2   untested   untested
12 v2      Ia32     255                             TRUE                   =
  53x2x2   untested   untested
13 v2      Ia32X64   64                             FALSE                  =
  1x2x2    pass       unreliable
14 v2      Ia32X64   64                             TRUE                   =
  1x2x2    pass       untested
15 v2      Ia32X64  255                             FALSE                  =
  52x2x2   pass       untested
16 v2      Ia32X64  255                             FALSE                  =
  54x2x2   untested   untested
17 v2      Ia32X64  255                             TRUE                   =
  52x2x2   FAIL       untested
18 v2      Ia32X64  255                             TRUE                   =
  54x2x2   untested   untested

* Case 8: this test case failed with v2 as well, but this time with
  different symptoms:

> FSOpen: Open '\EFI\fedora\grubia32.efi' Success
> InstallProtocolInterface: [EfiLoadedImageProtocol] 7E4037A8
> Loading driver at 0x0007DA7F000 EntryPoint=3D0x0007DA7F400
> InstallProtocolInterface: [EfiLoadedImageDevicePathProtocol] 7E403A90
> PixelBlueGreenRedReserved8BitPerColor
> ConvertPages: Incompatible memory types
> PixelBlueGreenRedReserved8BitPerColor
> ConvertPages: Incompatible memory types
> SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0
> MpInitExitBootServicesCallback() done!
> ASSERT MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(73): Sp=
inLock !=3D ((void *) 0)

  I didn't try to narrow this down.

* Case 13 (the "unreliable S3 resume" case): Here the news are both bad
  and good. The good news is for Jiewen: this patch series does not
  cause the unreliability, it "only" amplifies it severely. The bad news
  is correspondingly for everyone else: S3 resume is actually unreliable
  even in case 4, that is, without this series applied, it's just the
  failure rate is much-much lower.

  Namely, in my new testing, in case 13, S3 resume failed 8 times out of
  21 tries. (I stopped testing at the 8th failure.)

  Whereas in case 4, S3 resume failed with *identical symptoms* (e.g.,
  exact same RIP=3D000000000009f0fd), 1 time out of 55 tries. (It was try
  #12 that failed; I continued testing and aborted the test after the
  55th try.)

  So, while the series hugely amplifies the failure rate, the failure
  does exist without the series. Which is why I modified the case 4
  results in the table, and also lower-cased the word "unreliable" in
  case 13.

  Below I will return to this problem separately; let's go over the rest
  of the table first.

* Case 17: I guess this is not a real failure, I'm just including it for
  completeness, as PcdCpuSmmStaticPageTable=3D=3DTRUE is known to present
  additional SMRAM demand (see the commit message on patch V2 4/6). This
  case fails with

> SmmLockBox Command - 4
> SmmLockBoxSmmLib SetLockBoxAttributes - Enter
> SmmLockBoxSmmLib SetLockBoxAttributes - Exit (Success)
> SmmLockBox SmmLockBoxHandler Exit
> SmmLockBoxDxeLib SetLockBoxAttributes - Exit (Success)
> SmmInstallProtocolInterface: [EfiSmmReadyToLockProtocol] 0
> ASSERT UefiCpuPkg/PiSmmCpuDxeSmm/CpuS3.c(892): mGdtForAp !=3D ((void *) 0=
)

  which is an SMRAM allocation failure. If I lower the VCPU count to
  50x2x2, then the guest boots fine.

----*----

Before I get to the S3 resume problem (which, again, reproduces without
this series, although much less frequently), I'd like to comment on the
removal of the ASSERT(), from v1 to v2, in the FlushTlbForAll()
function, on the return value of SmmBlockingStartupThisAp(). This change
allows v2 to proceed past that point; however, I'm seeing a whole lot of

> !mSmmMpSyncData->CpuData[1].Present
> !mSmmMpSyncData->CpuData[2].Present
> !mSmmMpSyncData->CpuData[3].Present
> ...

messages in the OVMF boot log, interspersed with

> ConvertPageEntryAttribute 0x7F92B067->0x7F92B065

style messages. (That is, one error message for each AP, per
ConvertPageEntryAttribute() message.)

Is this okay / intentional? The number of these messages can go up to
several thousands and that sort of drowns out everything else in the
log.

It's also not easy to mask the message, because it's logged on the
DEBUG_ERROR level.

----*----

* Okay, so the S3 problem. Last time I suspected that the failure point
  (RIP=3D9f0fd) was in the Linux guest's S3 wakeup vector (which starts at
  9A1D0, according to the OVMF log). In order to test this idea, I
  exercised this series with S3 against a Windows 8.1 guest (--> case 13
  again). The failure reproduced on the second S3 resume, with identical
  RIP, despite the Windows wakeup vector being located elsewhere (at
  0x1000).

  Quoting the OVMF log leading up to the resume:

> Call AsmDisablePaging64() to return to S3 Resume in PEI Phase
> Install PPI: [PeiPostScriptTablePpi]
> Install PPI: [EfiEndOfPeiSignalPpi]
> Notify: PPI Guid: [EfiEndOfPeiSignalPpi], Peim notify entry point: 857895
> PeiMpInitLib: CpuMpEndOfPeiCallback () invoked
> Transfer to 16bit OS waking vector - 1000

  QEMU log (same as before):

> KVM internal error. Suberror: 1
> KVM internal error. Suberror: 1
> emulation failure
> emulation failure
> RAX=3D0000000000000002 RBX=3D0000000000000000 RCX=3D000000007ffdb168 RDX=
=3D000000007fedd058
> RSI=3D0000000000000004 RDI=3D000000007fedd040 RBP=3D0000000000000000 RSP=
=3D000000007e1a7000
> R8 =3D0000000000000000 R9 =3D0000000000000000 R10=3D0000000000000000 R11=
=3D0000000000000000
> R12=3D0000000000000000 R13=3D0000000000000000 R14=3D0000000000000000 R15=
=3D0000000000000000
> RIP=3D000000000009f0fd RFL=3D00010006 [-----P-] CPL=3D0 II=3D0 A20=3D1 SM=
M=3D0 HLT=3D0
> ES =3D0030 0000000000000000 ffffffff 00c09300 DPL=3D0 DS   [-WA]
> CS =3D0038 0000000000000000 ffffffff 00a09b00 DPL=3D0 CS64 [-RA]
> SS =3D0030 0000000000000000 ffffffff 00c09300 DPL=3D0 DS   [-WA]
> DS =3D0030 0000000000000000 ffffffff 00c09300 DPL=3D0 DS   [-WA]
> FS =3D0018 0000000000000000 ffffffff 00c09300 DPL=3D0 DS   [-WA]
> GS =3D0018 0000000000000000 ffffffff 00c09300 DPL=3D0 DS   [-WA]
> LDT=3D0000 0000000000000000 0000ffff 00008200 DPL=3D0 LDT
> TR =3D0000 0000000000000000 0000ffff 00008b00 DPL=3D0 TSS64-busy
> GDT=3D     000000007f294000 00000047
> IDT=3D     000000007f294048 00000fff
> CR0=3De0000011 CR2=3D0000000000000000 CR3=3D000000007ff7f000 CR4=3D000002=
20
> DR0=3D0000000000000000 DR1=3D0000000000000000 DR2=3D0000000000000000 DR3=
=3D0000000000000000
> DR6=3D00000000ffff0ff0 DR7=3D0000000000000400
> EFER=3D0000000000000500
> Code=3D?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?=
? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?=
? ?? ?? ??
> RAX=3D0000000000000001 RBX=3D0000000000000000 RCX=3D000000007ffdb168 RDX=
=3D000000007fedd070
> RSI=3D0000000000000004 RDI=3D000000007fedd058 RBP=3D0000000000000000 RSP=
=3D000000007e19f000
> R8 =3D0000000000000000 R9 =3D0000000000000000 R10=3D0000000000000000 R11=
=3D0000000000000000
> R12=3D0000000000000000 R13=3D0000000000000000 R14=3D0000000000000000 R15=
=3D0000000000000000
> RIP=3D000000000009f0fd RFL=3D00010006 [-----P-] CPL=3D0 II=3D0 A20=3D1 SM=
M=3D0 HLT=3D0
> ES =3D0030 0000000000000000 ffffffff 00c09300 DPL=3D0 DS   [-WA]
> CS =3D0038 0000000000000000 ffffffff 00a09b00 DPL=3D0 CS64 [-RA]
> SS =3D0030 0000000000000000 ffffffff 00c09300 DPL=3D0 DS   [-WA]
> DS =3D0030 0000000000000000 ffffffff 00c09300 DPL=3D0 DS   [-WA]
> FS =3D0018 0000000000000000 ffffffff 00c09300 DPL=3D0 DS   [-WA]
> GS =3D0018 0000000000000000 ffffffff 00c09300 DPL=3D0 DS   [-WA]
> LDT=3D0000 0000000000000000 0000ffff 00008200 DPL=3D0 LDT
> TR =3D0000 0000000000000000 0000ffff 00008b00 DPL=3D0 TSS64-busy
> GDT=3D     000000007f294000 00000047
> IDT=3D     000000007f294048 00000fff
> CR0=3De0000011 CR2=3D0000000000000000 CR3=3D000000007ff7f000 CR4=3D000002=
20
> DR0=3D0000000000000000 DR1=3D0000000000000000 DR2=3D0000000000000000 DR3=
=3D0000000000000000
> DR6=3D00000000ffff0ff0 DR7=3D0000000000000400
> EFER=3D0000000000000500
> Code=3D?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?=
? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?=
? ?? ?? ??

  So, we can exclude the suspicion that the problem is guest OS
  dependent.

* Then I looked for the base address of the page containing the
  RIP=3D9f0fd address, in earlier parts of the OVMF log, on the hunch that
  some firmware component might have allocated that area actually. Here
  we go:

> Loading PEIM at 0x000008552C0 EntryPoint=3D0x000008554E0 CpuMpPei.efi
> AP Loop Mode is 1
> WakeupBufferStart =3D 9F000, WakeupBufferSize =3D 1000

  That is, the failure hits (when it hits -- not always) in the area
  where the CpuMpPei driver *borrows* memory for the startup vector of
  the APs, for the purposes of the MP service PPI. ("Wakeup" is an
  overloaded word here; the "wakeup buffer" has nothing to do with S3
  resume, it just serves for booting the APs temporarily in PEI, for
  implementing the MP service PPI.)

  When exiting the PEI phase (on the S3 resume path), CpuMpPei restores
  the original contents of this area. This occurs just before
  transfering control to the guest OS wakeup vector: see the
  "EfiEndOfPeiSignalPpi" and "CpuMpEndOfPeiCallback" strings just above,
  quoted from the OVMF log.

  I documented (parts of) this logic in OVMF commit

    https://github.com/tianocore/edk2/commit/e3e3090a959a0

  (see the code comments as well).

* At that time, I thought to have identified a memory management bug in
  CpuMpPei; see the following discussion and bug report for details:

    https://www.mail-archive.com/edk2-devel@lists.01.org/msg13892.html
    https://bugzilla.tianocore.org/show_bug.cgi?id=3D67

  However, with the extraction / introduction of MpInitLib, this issue
  has been fixed: GetWakeupBuffer() now calls
  CheckOverlapWithAllocatedBuffer(), so that "memory management bug" is
  no longer; we shouldn't be looking there for the root cause.

* Either way, I don't understand why anything would want to execute code
  in the one page that happens to host the MP services PPI startup
  buffer for APs during PEI.

  Not understanding the "why", I looked at the "what", and resorted to
  tracing KVM. Because the problem readily reproduces with this series
  applied (case 13), it wasn't hard to start the tracing while the guest
  was suspended, and capture just the actions that led from the
  KVM-level wakeup to the failure.

  The QEMU state dumps are visible above in the email. I've also
  uploaded the compressed OVMF log and the textual KVM trace here:

    http://people.redhat.com/lersek/s3-crash-8d1dfed7-ca92-4e25-8d2b-b1c9ac=
2a53db/

  I sincerely hope that Paolo will have a field day with the KVM trace
  :) I managed to identify the following curiosities (remember this is
  all on the S3 resume path):

  * First, the VCPUs (there are four of them) enter and leave SMM in a
    really funky pattern:

      vcpu#0  vcpu#1  vcpu#2  vcpu#3
      ------  ------  ------  ------
              enter
               |
              leave

                      enter
                        |
                      leave

                              enter
                                |
                              leave

      enter
        |
      leave

              enter           enter
       enter    |     enter     |
         |      |       |       |
       leave    |       |       |
                |       |       |
       enter    |       |       |
         |      |       |       |
       leave  leave   leave   leave

    That is, first we have each VCPU enter and leave SMM in complete
    isolation (1, 2, 3, 0). Then VCPUs 1 and 3 enter SMM together, soon
    followed by VPCUS 0 and 2, also together. VCPU#0 drops out of SMM
    temporarily (it comes back in later), while the other three remain
    in SMM. Finally all four of them leave SMM together.

    After which the problem occurs.

  * Second, the instruction that causes things to blow up is <0f aa>,
    i.e., RSM. I have absolutely no clue why RSM is executed:

    (a) in the area that used to host the AP startup routine for the MP
    services PPI -- note that we also have "Transfer to 16bit OS waking
    vector" in the log, so CpuMpEndOfPeiCallback() restores the borrowed
    area completeley! --,

    (b) and why *after* all four VCPUs have just left SMM, together.

  * The RSM instruction is handled successfully elsewhere, for example
    when all four VCPUs leave SMM, at the bottom of the diagram above:

> CPU-24447 [002] 39841.982810: kvm_emulate_insn:     0:7ffbf179: 0f aa
> CPU-24446 [000] 39841.982810: kvm_emulate_insn:     0:7ffbd179: 0f aa
> CPU-24445 [005] 39841.982810: kvm_emulate_insn:     0:7ffbb179: 0f aa
> CPU-24444 [006] 39841.982811: kvm_emulate_insn:     0:7ffb9179: 0f aa

  * The guest-phys address 7ff7f000 that we see just before the error:

> CPU-24447 [002] 39841.982825: kvm_page_fault:       address 7ff7f000 erro=
r_code 83
> CPU-24446 [000] 39841.982825: kvm_page_fault:       address 7ff7f000 erro=
r_code 83
> CPU-24447 [002] 39841.982826: kvm_emulate_insn:     0:9f0fd: 0f aa
> CPU-24444 [006] 39841.982827: kvm_exit:             reason EXTERNAL_INTER=
RUPT rip 0xffffffff813a954f info 0 800000fc
> CPU-24447 [002] 39841.982827: kvm_emulate_insn:     0:9f0fd: 0f aa FAIL
> CPU-24447 [002] 39841.982827: kvm_userspace_exit:   reason KVM_EXIT_INTER=
NAL_ERROR (17)

    can be found higher up in the trace; namely, it is written to CR3
    several times. It's the root of the page tables.

  * The 7F80_1000..7FFF_FFFF guest-phys addresses are all in SMRAM.

* I also tried the "info tlb" monitor command, via "virsh
  qemu-monitor-command --hmp", while the guest was auto-paused after the
  crash.

  I cannot provide results: QEMU appeared to return a message that would
  be longer than 16MB after encoding by libvirt, and libvirt rejected
  that ("Unable to encode message payload", see VIR_NET_MESSAGE_MAX).

  Anyway, the KVM trace, and the QEMU register dump, look consistent
  with what Paolo said about "Code=3D?? ?? ??...":

    The question marks usually mean that the page tables do not map a
    page at that address.

  CR3=3D000000007ff7f000 points into SMRAM, but we are outside of SMM
  (SMM=3D0). We can't translate *any* guest-virtual address, as we can't
  even begin walking the page tables.

Thanks
Laszlo
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>
https://lists.01.org/mailman/listinfo/edk2-devel