public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed
* [PATCH] UefiPayloadPkg: Always split page table entry to 4K if it covers stack.
@ 2022-05-31  5:39 Zhiguang Liu
  2022-05-31  7:45 ` [edk2-devel] " Gerd Hoffmann
  0 siblings, 1 reply; 8+ messages in thread
From: Zhiguang Liu @ 2022-05-31  5:39 UTC (permalink / raw)
  To: devel; +Cc: Zhiguang Liu, Guo Dong, Ray Ni, Maurice Ma, Benjamin You,
	Sean Rhodes

There is a concern case that stack and a proteced DXE memory range is in
the same 2M Page Table entry, and somehow CPU doesn't flash the page
table entry cache for stack, and causes Page fault when using stack.
Always split the page table entry to 4K if it covers stack to avoid this
issue.

Cc: Guo Dong <guo.dong@intel.com>
Cc: Ray Ni <ray.ni@intel.com>
Cc: Maurice Ma <maurice.ma@intel.com>
Cc: Benjamin You <benjamin.you@intel.com>
Cc: Sean Rhodes <sean@starlabs.systems>
Signed-off-by: Zhiguang Liu <zhiguang.liu@intel.com>
---
 UefiPayloadPkg/UefiPayloadEntry/X64/VirtualMemory.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/UefiPayloadPkg/UefiPayloadEntry/X64/VirtualMemory.c b/UefiPayloadPkg/UefiPayloadEntry/X64/VirtualMemory.c
index ac0d58e685..74b667a62a 100644
--- a/UefiPayloadPkg/UefiPayloadEntry/X64/VirtualMemory.c
+++ b/UefiPayloadPkg/UefiPayloadEntry/X64/VirtualMemory.c
@@ -218,16 +218,8 @@ ToSplitPageTable (
     return TRUE;
   }
 
-  if (PcdGetBool (PcdCpuStackGuard)) {
-    if ((StackBase >= Address) && (StackBase < (Address + Size))) {
-      return TRUE;
-    }
-  }
-
-  if (PcdGetBool (PcdSetNxForStack)) {
-    if ((Address < StackBase + StackSize) && ((Address + Size) > StackBase)) {
-      return TRUE;
-    }
+  if ((Address < StackBase + StackSize) && ((Address + Size) > StackBase)) {
+    return TRUE;
   }
 
   if (GhcbBase != 0) {
-- 
2.16.2.windows.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [edk2-devel] [PATCH] UefiPayloadPkg: Always split page table entry to 4K if it covers stack.
  2022-05-31  5:39 [PATCH] UefiPayloadPkg: Always split page table entry to 4K if it covers stack Zhiguang Liu
@ 2022-05-31  7:45 ` Gerd Hoffmann
  2022-05-31  8:03   ` Ni, Ray
  0 siblings, 1 reply; 8+ messages in thread
From: Gerd Hoffmann @ 2022-05-31  7:45 UTC (permalink / raw)
  To: devel, zhiguang.liu
  Cc: Guo Dong, Ray Ni, Maurice Ma, Benjamin You, Sean Rhodes

On Tue, May 31, 2022 at 01:39:37PM +0800, Zhiguang Liu wrote:
> There is a concern case that stack and a proteced DXE memory range is in
> the same 2M Page Table entry, and somehow CPU doesn't flash the page
> table entry cache for stack, and causes Page fault when using stack.

Can you clarify the "somehow" please?  Are we discussing a workaround
for a cpu bug here?  If not this sounds like a tlbflush instruction is
missing somewhere ...

take care,
  Gerd


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [edk2-devel] [PATCH] UefiPayloadPkg: Always split page table entry to 4K if it covers stack.
  2022-05-31  7:45 ` [edk2-devel] " Gerd Hoffmann
@ 2022-05-31  8:03   ` Ni, Ray
  2022-05-31 11:21     ` Gerd Hoffmann
  0 siblings, 1 reply; 8+ messages in thread
From: Ni, Ray @ 2022-05-31  8:03 UTC (permalink / raw)
  To: Gerd Hoffmann, devel@edk2.groups.io, Liu, Zhiguang
  Cc: Dong, Guo, You, Benjamin, Rhodes, Sean

Gerd,
We saw page fault in the following situation:
* a 2M page entry (with present bit set) points to some memory [p, p+2M)
* Firmware code wants to mark [p, p+4k) as read-only
* Firstly [p, p+2M) is split to 512 page-entries with each pointing to 4K memory (with present bit set still)
* Secondly, the R/W bit in first page entry is cleared

The code is in UefiCpuPkg/CpuDxe/CpuPageTable.c:

    //
    // Split 2M to 4K
    //
    ASSERT (SplitAttribute == Page4K);
    if (SplitAttribute == Page4K) {
      NewPageEntry = AllocatePagesFunc (1);
      DEBUG ((DEBUG_VERBOSE, "Split - 0x%x\n", NewPageEntry));
      if (NewPageEntry == NULL) {
        return RETURN_OUT_OF_RESOURCES;
      }

      BaseAddress = *PageEntry & ~AddressEncMask & PAGING_2M_ADDRESS_MASK_64;
      for (Index = 0; Index < SIZE_4KB / sizeof (UINT64); Index++) {
        NewPageEntry[Index] = (BaseAddress + SIZE_4KB * Index) | AddressEncMask | ((*PageEntry) & PAGE_PROGATE_BITS);
      }

      (*PageEntry) = (UINT64)(UINTN)NewPageEntry | AddressEncMask | ((*PageEntry) & PAGE_ATTRIBUTE_BITS);

Page fault exception happens just after the above assignment.
We observed that the instruction causing the exception is accessing the stack and stack is within [p, p+2M) range.

To be frank, we are still trying to understand whether a CR3 flush or INVLPG should be performed immediately after the above assignment.

Before that's fully understood, we think the page table split for stack does no harm to the functionality and code complexity. That's why we choose this fix first.

I am not quite sure how Linux handles such case?

Thanks,
Ray

> -----Original Message-----
> From: Gerd Hoffmann <kraxel@redhat.com>
> Sent: Tuesday, May 31, 2022 3:45 PM
> To: devel@edk2.groups.io; Liu, Zhiguang <zhiguang.liu@intel.com>
> Cc: Dong, Guo <guo.dong@intel.com>; Ni, Ray <ray.ni@intel.com>; Maurice Ma <maurice.ma@intel.com>; You, Benjamin
> <benjamin.you@intel.com>; Rhodes, Sean <sean@starlabs.systems>
> Subject: Re: [edk2-devel] [PATCH] UefiPayloadPkg: Always split page table entry to 4K if it covers stack.
> 
> On Tue, May 31, 2022 at 01:39:37PM +0800, Zhiguang Liu wrote:
> > There is a concern case that stack and a proteced DXE memory range is in
> > the same 2M Page Table entry, and somehow CPU doesn't flash the page
> > table entry cache for stack, and causes Page fault when using stack.
> 
> Can you clarify the "somehow" please?  Are we discussing a workaround
> for a cpu bug here?  If not this sounds like a tlbflush instruction is
> missing somewhere ...
> 
> take care,
>   Gerd


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [edk2-devel] [PATCH] UefiPayloadPkg: Always split page table entry to 4K if it covers stack.
  2022-05-31  8:03   ` Ni, Ray
@ 2022-05-31 11:21     ` Gerd Hoffmann
  2022-05-31 13:24       ` Ni, Ray
  0 siblings, 1 reply; 8+ messages in thread
From: Gerd Hoffmann @ 2022-05-31 11:21 UTC (permalink / raw)
  To: devel, ray.ni; +Cc: Liu, Zhiguang, Dong, Guo, You, Benjamin, Rhodes, Sean

  Hi,

> I am not quite sure how Linux handles such case?

Oh, lovely.  CPU bugs lurking indeed.  linux has this longish comment
(see mm/huge_memory.c, in the middle of the __split_huge_pmd_locked()
function):

        /*
         * Up to this point the pmd is present and huge and userland has the
         * whole access to the hugepage during the split (which happens in
         * place). If we overwrite the pmd with the not-huge version pointing
         * to the pte here (which of course we could if all CPUs were bug
         * free), userland could trigger a small page size TLB miss on the
         * small sized TLB while the hugepage TLB entry is still established in
         * the huge TLB. Some CPU doesn't like that.
         * See http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum
         * 383 on page 105. Intel should be safe but is also warns that it's
         * only safe if the permission and cache attributes of the two entries
         * loaded in the two TLB is identical (which should be the case here).
         * But it is generally safer to never allow small and huge TLB entries
         * for the same virtual address to be loaded simultaneously. So instead
         * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the
         * current pmd notpresent (atomically because here the pmd_trans_huge
         * must remain set at all times on the pmd until the split is complete
         * for this pmd), then we flush the SMP TLB and finally we write the
         * non-huge version of the pmd entry with pmd_populate.
         */

So linux goes 2M -> not present -> 4K instead of direct 2M -> 4K (and
does the tlb flush in the not present state), which apparently is needed
on some CPUs to avoid confusing the tlb cache.

> Before that's fully understood, we think the page table split for
> stack does no harm to the functionality and code complexity. That's
> why we choose this fix first.

So this basically splits the page right from the start instead of doing
it later when page attributes are changed.  Which probably avoids the
huge page landing in the tlb cache, which in turn avoids triggering the
issues outlined above.

I think doing a linux-style page split will be the more robust solution.

take care,
  Gerd


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [edk2-devel] [PATCH] UefiPayloadPkg: Always split page table entry to 4K if it covers stack.
  2022-05-31 11:21     ` Gerd Hoffmann
@ 2022-05-31 13:24       ` Ni, Ray
  2022-05-31 15:32         ` Gerd Hoffmann
  0 siblings, 1 reply; 8+ messages in thread
From: Ni, Ray @ 2022-05-31 13:24 UTC (permalink / raw)
  To: Gerd Hoffmann, devel@edk2.groups.io
  Cc: Liu, Zhiguang, Dong, Guo, You, Benjamin, Rhodes, Sean

> > I am not quite sure how Linux handles such case?
> 
> Oh, lovely.  CPU bugs lurking indeed.  linux has this longish comment
> (see mm/huge_memory.c, in the middle of the __split_huge_pmd_locked()
> function):
> 
>         /*
>          * Up to this point the pmd is present and huge and userland has the
>          * whole access to the hugepage during the split (which happens in
>          * place). If we overwrite the pmd with the not-huge version pointing
>          * to the pte here (which of course we could if all CPUs were bug
>          * free), userland could trigger a small page size TLB miss on the
>          * small sized TLB while the hugepage TLB entry is still established in
>          * the huge TLB. Some CPU doesn't like that.
>          * See http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum
>          * 383 on page 105. Intel should be safe but is also warns that it's
>          * only safe if the permission and cache attributes of the two entries
>          * loaded in the two TLB is identical (which should be the case here).
>          * But it is generally safer to never allow small and huge TLB entries
>          * for the same virtual address to be loaded simultaneously. So instead
>          * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the
>          * current pmd notpresent (atomically because here the pmd_trans_huge
>          * must remain set at all times on the pmd until the split is complete
>          * for this pmd), then we flush the SMP TLB and finally we write the
>          * non-huge version of the pmd entry with pmd_populate.
>          */
> 
> So linux goes 2M -> not present -> 4K instead of direct 2M -> 4K (and
> does the tlb flush in the not present state), which apparently is needed
> on some CPUs to avoid confusing the tlb cache.
> 
> > Before that's fully understood, we think the page table split for
> > stack does no harm to the functionality and code complexity. That's
> > why we choose this fix first.
> 
> So this basically splits the page right from the start instead of doing
> it later when page attributes are changed.  Which probably avoids the
> huge page landing in the tlb cache, which in turn avoids triggering the
> issues outlined above.

yes:) Actually there is no split at all. The 4K page table is created in the very beginning(before setting to cr3).
So, no TLB cache issue at all.

> 
> I think doing a linux-style page split will be the more robust solution.

Thanks for explaining the linux behavior.

Intel's SDM also contain below wordings:
* As noted in Section 4.10.2, the TLBs may subsequently contain multiple translations for the address range if
* software modifies the paging structures so that the page size used for a 4-KByte range of linear addresses
* changes. A reference to a linear address in the address range may use any of these translations.
* Software wishing to prevent this uncertainty should not write to a paging-structure entry in a way that would
* change, for any linear address, both the page size and either the page frame, access rights, or other attributes.
* It can instead use the following algorithm: first clear the P flag in the relevant paging-structure entry (e.g.,
* PDE); then invalidate any translations for the affected linear addresses (see above); and then modify the
* relevant paging-structure entry to set the P flag and establish modified translation(s) for the new page size.

But I still have some doubts about using linux-style page split.
Because it's marked as not present:
1. Active code should not access data in the 2M region (stack is in the 2M region in our case)
2. Active code should not in the 2M region (how to guarantee that?)

How does Linux guarantee the above two points?

Thanks,
Ray

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [edk2-devel] [PATCH] UefiPayloadPkg: Always split page table entry to 4K if it covers stack.
  2022-05-31 13:24       ` Ni, Ray
@ 2022-05-31 15:32         ` Gerd Hoffmann
  2022-06-08  0:49           ` Ni, Ray
  0 siblings, 1 reply; 8+ messages in thread
From: Gerd Hoffmann @ 2022-05-31 15:32 UTC (permalink / raw)
  To: Ni, Ray
  Cc: devel@edk2.groups.io, Liu, Zhiguang, Dong, Guo, You, Benjamin,
	Rhodes, Sean

  Hi,

> yes:) Actually there is no split at all. The 4K page table is created in the very beginning(before setting to cr3).
> So, no TLB cache issue at all.

> > I think doing a linux-style page split will be the more robust solution.
> 
> Thanks for explaining the linux behavior.
> 
> Intel's SDM also contain below wordings:
> * As noted in Section 4.10.2, the TLBs may subsequently contain multiple translations for the address range if
> * software modifies the paging structures so that the page size used for a 4-KByte range of linear addresses
> * changes. A reference to a linear address in the address range may use any of these translations.

This is probably the section the "only safe if [ ... ] the two entries
[ ... ] identical" part refers to.

> * Software wishing to prevent this uncertainty should not write to a paging-structure entry in a way that would
> * change, for any linear address, both the page size and either the page frame, access rights, or other attributes.
> * It can instead use the following algorithm: first clear the P flag in the relevant paging-structure entry (e.g.,
> * PDE); then invalidate any translations for the affected linear addresses (see above); and then modify the
> * relevant paging-structure entry to set the P flag and establish modified translation(s) for the new page size.

So linux basically implements this recommendation.

> But I still have some doubts about using linux-style page split.
> Because it's marked as not present:
> 1. Active code should not access data in the 2M region (stack is in the 2M region in our case)
> 2. Active code should not in the 2M region (how to guarantee that?)
> 
> How does Linux guarantee the above two points?

Easy.  It's kernel code changing mappings for userspace, so no need to
worry about code removing its own mappings in the first place.  It's a
different story for edk2 though ...

Can this be covered by the page fault handler?  Update the entry like
the current code does, except for clearing the present bit, then flush
tlb, then set the present bit.  In case we take a page fault the only
action the handler must do is enable the present bit, which might even
be possible to do without additional state tracking.

Linux most likely has something simliar in the page fault handler.
Linux needs it for a different reason, it must handle SMP races.  When
temporary clearing the present bit linux might get a page fault on
*another* cpu which runs userspace code touching the page being updated.

take care,
  Gerd


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [edk2-devel] [PATCH] UefiPayloadPkg: Always split page table entry to 4K if it covers stack.
  2022-05-31 15:32         ` Gerd Hoffmann
@ 2022-06-08  0:49           ` Ni, Ray
  2022-06-08  6:52             ` Gerd Hoffmann
  0 siblings, 1 reply; 8+ messages in thread
From: Ni, Ray @ 2022-06-08  0:49 UTC (permalink / raw)
  To: devel@edk2.groups.io, kraxel@redhat.com
  Cc: Liu, Zhiguang, Dong, Guo, You, Benjamin, Rhodes, Sean

>   Hi,
> 
> > yes:) Actually there is no split at all. The 4K page table is created in the very beginning(before setting to cr3).
> > So, no TLB cache issue at all.
> 
> > > I think doing a linux-style page split will be the more robust solution.
> >
> > Thanks for explaining the linux behavior.
> >
> > Intel's SDM also contain below wordings:
> > * As noted in Section 4.10.2, the TLBs may subsequently contain multiple translations for the address range if
> > * software modifies the paging structures so that the page size used for a 4-KByte range of linear addresses
> > * changes. A reference to a linear address in the address range may use any of these translations.
> 
> This is probably the section the "only safe if [ ... ] the two entries
> [ ... ] identical" part refers to.
> 
> > * Software wishing to prevent this uncertainty should not write to a paging-structure entry in a way that would
> > * change, for any linear address, both the page size and either the page frame, access rights, or other attributes.
> > * It can instead use the following algorithm: first clear the P flag in the relevant paging-structure entry (e.g.,
> > * PDE); then invalidate any translations for the affected linear addresses (see above); and then modify the
> > * relevant paging-structure entry to set the P flag and establish modified translation(s) for the new page size.

> 
> So linux basically implements this recommendation.
> 
> > But I still have some doubts about using linux-style page split.
> > Because it's marked as not present:
> > 1. Active code should not access data in the 2M region (stack is in the 2M region in our case)
> > 2. Active code should not in the 2M region (how to guarantee that?)
> >
> > How does Linux guarantee the above two points?
> 
> Easy.  It's kernel code changing mappings for userspace, so no need to
> worry about code removing its own mappings in the first place.  It's a
> different story for edk2 though ...
> 
> Can this be covered by the page fault handler?  Update the entry like
> the current code does, except for clearing the present bit, then flush
> tlb, then set the present bit.  In case we take a page fault the only
> action the handler must do is enable the present bit, which might even
> be possible to do without additional state tracking.

It's a bit heavy from my perspective.
I prefer to split the page to 4K in the very beginning of stack setup.
It also guarantees such issue doesn't appear.

> 
> Linux most likely has something simliar in the page fault handler.
> Linux needs it for a different reason, it must handle SMP races.  When
> temporary clearing the present bit linux might get a page fault on
> *another* cpu which runs userspace code touching the page being updated.
> 
> take care,
>   Gerd
> 
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [edk2-devel] [PATCH] UefiPayloadPkg: Always split page table entry to 4K if it covers stack.
  2022-06-08  0:49           ` Ni, Ray
@ 2022-06-08  6:52             ` Gerd Hoffmann
  0 siblings, 0 replies; 8+ messages in thread
From: Gerd Hoffmann @ 2022-06-08  6:52 UTC (permalink / raw)
  To: Ni, Ray
  Cc: devel@edk2.groups.io, Liu, Zhiguang, Dong, Guo, You, Benjamin,
	Rhodes, Sean

  Hi,

> > > But I still have some doubts about using linux-style page split.
> > > Because it's marked as not present:
> > > 1. Active code should not access data in the 2M region (stack is in the 2M region in our case)
> > > 2. Active code should not in the 2M region (how to guarantee that?)
> > >
> > > How does Linux guarantee the above two points?
> > 
> > Easy.  It's kernel code changing mappings for userspace, so no need to
> > worry about code removing its own mappings in the first place.  It's a
> > different story for edk2 though ...
> > 
> > Can this be covered by the page fault handler?  Update the entry like
> > the current code does, except for clearing the present bit, then flush
> > tlb, then set the present bit.  In case we take a page fault the only
> > action the handler must do is enable the present bit, which might even
> > be possible to do without additional state tracking.
> 
> It's a bit heavy from my perspective.
> I prefer to split the page to 4K in the very beginning of stack setup.
> It also guarantees such issue doesn't appear.

Yes, avoiding hugepages being created in the first place will surely fix
that specific case.  The commit message should describe the problem
better, otherwise I'm fine with the patch.

But I think there are more cases where edk2 splits huge pages.
HeapGuard comes to mind for example.  So I'm wondering whenever there
are more simliar problems in the code base.

take care,
  Gerd


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-06-08  6:53 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-05-31  5:39 [PATCH] UefiPayloadPkg: Always split page table entry to 4K if it covers stack Zhiguang Liu
2022-05-31  7:45 ` [edk2-devel] " Gerd Hoffmann
2022-05-31  8:03   ` Ni, Ray
2022-05-31 11:21     ` Gerd Hoffmann
2022-05-31 13:24       ` Ni, Ray
2022-05-31 15:32         ` Gerd Hoffmann
2022-06-08  0:49           ` Ni, Ray
2022-06-08  6:52             ` Gerd Hoffmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox