Re: [PATCH V2 0/6] Enable SMM page level protection.

From: Laszlo Ersek <lersek@redhat.com>
To: Paolo Bonzini <pbonzini@redhat.com>,
	"Yao, Jiewen" <jiewen.yao@intel.com>
Cc: "Kinney, Michael D" <michael.d.kinney@intel.com>,
	"Tian, Feng" <feng.tian@intel.com>,
	"edk2-devel@ml01.01.org" <edk2-devel@ml01.01.org>,
	"Zeng, Star" <star.zeng@intel.com>,
	"Fan, Jeff" <jeff.fan@intel.com>
Subject: Re: [PATCH V2 0/6] Enable SMM page level protection.
Date: Wed, 9 Nov 2016 23:28:30 +0100	[thread overview]
Message-ID: <0e1380dc-85e6-324b-4614-10785d24f499@redhat.com> (raw)
In-Reply-To: <3be2f1bf-8c0a-e470-a5c0-a6130b159da5@redhat.com>

On 11/09/16 16:54, Paolo Bonzini wrote:
> 
> 
> On 09/11/2016 16:01, Yao, Jiewen wrote:
>> 1)      CpuS3.c – EarlyInitializeCpu()
>> 2)      CpuS3.c – SmmRelocateBases()
>> 3)      CpuS3.c – InitializeCpu()
>> 4)      S3Resume.c – SendSmiIpiAllExcludingSelf()
>>
>> I believe we can guarantee 1/2/3 is good, because I found we check BSP
>> check mNumberToFinish.
>>
>> 4 is a risk, because there is no AP finish check. If the AP is in below
>> 1M with CR3 in SMRAM, it will be a trouble.
>>
>> Once the AP executes RSM and return to non-SMM, the CR3 is no longer
>> valid and AP must be crashed immediately. WoW!
>>
>> The fix, I believe, is same.
>>
>> We should make 1) AP is in above 1M reserved memory,
> 
> Is this because of the NMI case?
> 
>> and 2) AP is in protected mode with paging disabled.
> 
> It is not clear to me what the (4) SIPI done is there for,

After reading through your great analysis with a keen focus :), I wanted
to ask the exact same thing. I managed to follow / recall the control
flow mostly, but when I saw that SMI, I didn't (and don't) understand
that it was (is) good for.

After all, we're not setting up any request parameters etc. for the
processors to handle in SMM. What's happening there?

Another question I have -- and I feel I should really know it, but I
don't... -- is *why* the APs are executing code from the page at
0x9f000. When the BSP exits SMM, replays the S3 boot script, and finally
finishes off the PEI phase and restores the page at 0x9f000, the APs
seem to be affected -- but why do they care about that page at all? That
page never belonged to PiSmmCpuSmmDxe, it belongs CpuMpPei.

I do understand that the CR3 registers for the APs point into SMRAM,
while they wait for the BSP in SMM. Thus, the BSP closing/locking down
SMRAM, in S3ResumeExecuteBootScript(), breaks the APs -- that's
understandable.

What I don't get is, again:
(1) why S3ResumeExecuteBootScript() raises SMIs at all, before locking
down SMRAM,
(2) what the AP SMM routine (from PiSmmCpuDxeSmm) has to do with the
Wakeup buffer that is allocated and used *solely* by CpuMpPei.

I could be utterly and inexcusably wrong, but I think that the
RIP=0x9f0fd symptom is a red herring. I wrote,

>       vcpu#0  vcpu#1  vcpu#2  vcpu#3
>       ------  ------  ------  ------
>               enter
>                |
>               leave
>
>                       enter
>                         |
>                       leave
>
>                               enter
>                                 |
>                               leave
>
>       enter
>         |
>       leave
>
>               enter           enter
>        enter    |     enter     |
>          |      |       |       |
>        leave    |       |       |
>             <--------------------------- BAD
>        enter    |       |       |
>          |      |       |       |
>        leave  leave   leave   leave

Thanks to Paolo's analysis, we now know where that gap comes from and
what it does (so I marked it with BAD now) -- in the gap, the BSP leaves
SMM alone, closes/locks SMRAM, finishes off the PEI phase, restores the
contents of the borrowed wakeup buffer of CpuMpPei, and even transfers
control to Linux's S3 resume vector.

I don't understand why we don't get horrible faults on the APs
*immediately* when the BSP closes/locks down SMRAM. Everything in SMRAM,
page tables, executable code, everything, will read as 0xff on QEMU. How
can the APs continue in SMM long enough to

(a) time out and pull the BSP back into SMM,
(b) complete the rendezvous and exit SMM?

... Anyway, I think I do have an idea for question (2). Namely, when the
BSP starts executing S3ResumeExecuteBootScript(), in
"UefiCpuPkg/Universal/Acpi/S3Resume2Pei/S3Resume.c" -- for which the cue
is ultimately given by the DXE IPL PEIM, as the last action in PEI --,
CpuMpPei has been dispatched already! And, CpuMpPei has placed all the
APs into their comfy HLT loops, so that the MP services PPI could serve
multiprocessing requests.

Thus, the APs are executing code (the HLT loop) from CpuMpPei's wakeup
buffer on page 0x9f000 as *normal business*. That is where the SMI,
raised by the BSP in S3ResumeExecuteBootScript(), rips them out of. And
that's also where KVM tries to return them to, once they finish in SMM
and execute RSM. Too bad by the time KVM returns them there, the wakeup
page has been restored by the BSP.

In other words, the address RIP=0x9f0fd *is* a red herring, that's
simply where the APs happened to be when the SMI was raised, and where
KVM remembers to return the APs to, once the APs execute RSM.

I think I sort of answered question (2). (Apologies if Paolo and Jiewen
explained the exact same thing before; I had to spell it out for
myself.) That leaves question (1) open. Why enter SMM in
S3ResumeExecuteBootScript() at all?

Anyway, I think if the BSP and the APs are properly synchronized around
the SMI injections in S3ResumeExecuteBootScript(), then this bug is
fixed. In that case, the APs' RSMs will restore the full context for the
APs, including their sleep in the HLT instruction, in CpuMpPei's wakeup
buffer. The BSP will proceed, exit PEI (restoring the CpuMpPei wakeup
buffer -- but the APs will sleep on), and then Linux will bring up the
APs, after taking control.

Thanks
Laszlo

> and why it is
> triggered in S3Resume.c rather than CpuS3.c.  And why does it take so
> much for APs to complete it?
> 
> That said, by the time you close and lock SMRAM, you aren't even sure
> that you have reached the cli;hlt loop in the rendezvous funnel.  In
> practice you will be there, but there is still a theoretical race.
> 
> InterlockedDecrement (&mNumberToFinish) should be moved from
> EarlyMPRendezvousProcedure/MPRendezvousProcedure to GoToSleep, and
> GoToSleep should leave 64-bit mode before doing it.  This will fix the
> S3 bug as well.  It's only needed for 64-bit mode, but it is doable for
> the Ia32 version as well.
> 
> Perhaps EarlyMPRendezvousProcedure and MPRendezvousProcedure can return
> &mNumberToFinish; what do you think?
> 
> Paolo
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org
> https://lists.01.org/mailman/listinfo/edk2-devel
>