From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
 by mx.groups.io with SMTP id smtpd.web12.48471.1653996114190087023
 for <devel@edk2.groups.io>;
 Tue, 31 May 2022 04:21:54 -0700
Authentication-Results: mx.groups.io;
 dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=FTliE8rd;
 spf=pass (domain: redhat.com, ip: 170.10.129.124, mailfrom: kraxel@redhat.com)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1653996113;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=Gw/fliKjUOP/GWOgDr5Xecwg5bA/aHs2ODG7gHsEnho=;
	b=FTliE8rdBPK6lEJfG7T4cuCpOd5xppXsTaN6UDi0sIVGFndAwSBLA704zxfNkoEA8oNiun
	w2rAlRFTjImYvFGeYqFAPgCSPTBitU7SCuyh5S+v5PFMvNuMpBAU/cd8fd+sxec/RHBDCE
	BM7kK2NkHcYl8g1x7JfnelWOh2A/NVU=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-175-xPEJj-fPOZu7hFc34hl1rA-1; Tue, 31 May 2022 07:21:50 -0400
X-MC-Unique: xPEJj-fPOZu7hFc34hl1rA-1
Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id B8605811E75;
	Tue, 31 May 2022 11:21:49 +0000 (UTC)
Received: from sirius.home.kraxel.org (unknown [10.39.192.41])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 382E5492C3B;
	Tue, 31 May 2022 11:21:49 +0000 (UTC)
Received: by sirius.home.kraxel.org (Postfix, from userid 1000)
	id A3A8C180039C; Tue, 31 May 2022 13:21:47 +0200 (CEST)
Date: Tue, 31 May 2022 13:21:47 +0200
From: "Gerd Hoffmann" <kraxel@redhat.com>
To: devel@edk2.groups.io, ray.ni@intel.com
Cc: "Liu, Zhiguang" <zhiguang.liu@intel.com>,
	"Dong, Guo" <guo.dong@intel.com>,
	"You, Benjamin" <benjamin.you@intel.com>,
	"Rhodes, Sean" <sean@starlabs.systems>
Subject: Re: [edk2-devel] [PATCH] UefiPayloadPkg: Always split page table entry to 4K if it covers stack.
Message-ID: <20220531112147.pvy4d6vetsgsqduu@sirius.home.kraxel.org>
References: <20220531053937.19696-1-zhiguang.liu@intel.com>
 <20220531074513.fciegyxkrgiwwqem@sirius.home.kraxel.org>
 <MWHPR11MB16318401DDBF85B0B6F4A6788CDC9@MWHPR11MB1631.namprd11.prod.outlook.com>
MIME-Version: 1.0
In-Reply-To: <MWHPR11MB16318401DDBF85B0B6F4A6788CDC9@MWHPR11MB1631.namprd11.prod.outlook.com>
X-Scanned-By: MIMEDefang 2.85 on 10.11.54.10
Authentication-Results: relay.mimecast.com;
	auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=kraxel@redhat.com
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

  Hi,

> I am not quite sure how Linux handles such case?

Oh, lovely.  CPU bugs lurking indeed.  linux has this longish comment
(see mm/huge_memory.c, in the middle of the __split_huge_pmd_locked()
function):

        /*
         * Up to this point the pmd is present and huge and userland has the
         * whole access to the hugepage during the split (which happens in
         * place). If we overwrite the pmd with the not-huge version pointing
         * to the pte here (which of course we could if all CPUs were bug
         * free), userland could trigger a small page size TLB miss on the
         * small sized TLB while the hugepage TLB entry is still established in
         * the huge TLB. Some CPU doesn't like that.
         * See http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum
         * 383 on page 105. Intel should be safe but is also warns that it's
         * only safe if the permission and cache attributes of the two entries
         * loaded in the two TLB is identical (which should be the case here).
         * But it is generally safer to never allow small and huge TLB entries
         * for the same virtual address to be loaded simultaneously. So instead
         * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the
         * current pmd notpresent (atomically because here the pmd_trans_huge
         * must remain set at all times on the pmd until the split is complete
         * for this pmd), then we flush the SMP TLB and finally we write the
         * non-huge version of the pmd entry with pmd_populate.
         */

So linux goes 2M -> not present -> 4K instead of direct 2M -> 4K (and
does the tlb flush in the not present state), which apparently is needed
on some CPUs to avoid confusing the tlb cache.

> Before that's fully understood, we think the page table split for
> stack does no harm to the functionality and code complexity. That's
> why we choose this fix first.

So this basically splits the page right from the start instead of doing
it later when page attributes are changed.  Which probably avoids the
huge page landing in the tlb cache, which in turn avoids triggering the
issues outlined above.

I think doing a linux-style page split will be the more robust solution.

take care,
  Gerd