From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from NAM04-CO1-obe.outbound.protection.outlook.com (NAM04-CO1-obe.outbound.protection.outlook.com [40.107.69.57]) by mx.groups.io with SMTP id smtpd.web12.3000.1618258459940885712 for ; Mon, 12 Apr 2021 13:14:20 -0700 Authentication-Results: mx.groups.io; dkim=fail reason="body hash did not verify" header.i=@amd.com header.s=selector1 header.b=jxKiRFpq; spf=permerror, err=parse error for token &{10 18 %{i}._ip.%{h}._ehlo.%{d}._spf.vali.email}: invalid domain name (domain: amd.com, ip: 40.107.69.57, mailfrom: brijesh.singh@amd.com) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Oi3V9edseQ4qwORilLBwNcjBgNH4mZ1ITsmA0yWf3P3OwtvTIHY7pZnEGGm7eaQXb/PHVvyReahWrBps+zCERD37TylFTpOyTtl9mlILWaebPt+bqjuz7+Em62MRmORRSPnQZnSZFnIB1twh9M5WT/Lpf/tSRtdOjV94aHg+BGipxUm9kcrUwPPk8kP1uxKO5arYiYWhDEdN3QBsE5aw9Aw9kOSIpZ+A4bbt6InLc7qKQG+JOmEKujjAi3c7FW9co0XhvsPZuttxLO8SYizR+TTtEdsiCu2E1Dk0ZIZACjrys9DJtVCSYXNfNFgSESx5w3zFV5D0MqpO3KtCLLGK5g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=16K6iPsREX3wO9hWuMn/TyCrDxu1IWsEdOzCHvvF01g=; b=bNGlrVBKJNeXlvPRWnnm0Kn77IvALddXFz/8/XRPtvVzXRq/yrHt4g7g0kKOrX5vS6Yo4E3UGxFwhmYDpoHAlY18KNfVWHAgKsmjwOicI4Or+q9Y7WPriL4aeSovKl1b73VNrse79E5qhuJPyQi+Marc6eEd4ebBMovPtMJacc4iv7FTn0n7wxYGWA/fohTrORDVo9olSj5c3udliCzyMEswX8f0cIlDRKZhWy+a99nsqBhydzijYIP7UzQpHLOhieUmB1ye46ExzKjyDNBi6X5Z+hzg1qrauwlfbLn+L3w/Bfo9WdpoWqWm8xLg8/phbNqf7hPtKRlEU1VLoMbaMw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=amd.com; dmarc=pass action=none header.from=amd.com; dkim=pass header.d=amd.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=16K6iPsREX3wO9hWuMn/TyCrDxu1IWsEdOzCHvvF01g=; b=jxKiRFpqUGCtviqV2OawclbQikU2F+a9fOWxvBrcrs6YlJymo/KW6masBe9FNRmkWMDm75h+VM8639EyyOc+yF8+eV4Ch5/dFu/EmLB137GidRkfS/uWUtd7R8mk8oz8RYk66zIPAtccFLCiM5IJOzlvjSV790aw+vqmt/B+D58= Authentication-Results: kernel.org; dkim=none (message not signed) header.d=none;kernel.org; dmarc=none action=none header.from=amd.com; Received: from SN6PR12MB2718.namprd12.prod.outlook.com (2603:10b6:805:6f::22) by SN6PR12MB2784.namprd12.prod.outlook.com (2603:10b6:805:68::33) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4020.18; Mon, 12 Apr 2021 20:14:17 +0000 Received: from SN6PR12MB2718.namprd12.prod.outlook.com ([fe80::9898:5b48:a062:db94]) by SN6PR12MB2718.namprd12.prod.outlook.com ([fe80::9898:5b48:a062:db94%6]) with mapi id 15.20.4020.022; Mon, 12 Apr 2021 20:14:17 +0000 CC: brijesh.singh@amd.com, James Bottomley , Min Xu , Jiewen Yao , Tom Lendacky , Jordan Justen , Ard Biesheuvel Subject: Re: [edk2-devel] [RFC PATCH 00/19] Add AMD Secure Nested Paging (SEV-SNP) support To: Laszlo Ersek , devel@edk2.groups.io References: <20210324153215.17971-1-brijesh.singh@amd.com> <34a50603-32b9-1476-04a5-8476dd810fe4@redhat.com> <213ce382-0b04-16f4-dd77-f9bb2cc32698@amd.com> <2cb456d3-c407-ff00-c274-af30d4385749@amd.com> <867ca68c-0a55-7466-43c2-cf6845fb8d75@redhat.com> From: "Brijesh Singh" Message-ID: Date: Mon, 12 Apr 2021 15:14:15 -0500 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.9.0 In-Reply-To: <867ca68c-0a55-7466-43c2-cf6845fb8d75@redhat.com> X-Originating-IP: [70.112.153.56] X-ClientProxiedBy: SA0PR11CA0209.namprd11.prod.outlook.com (2603:10b6:806:1bc::34) To SN6PR12MB2718.namprd12.prod.outlook.com (2603:10b6:805:6f::22) Return-Path: brijesh.singh@amd.com MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 Received: from Brijeshs-MacBook-Pro.local (70.112.153.56) by SA0PR11CA0209.namprd11.prod.outlook.com (2603:10b6:806:1bc::34) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4020.22 via Frontend Transport; Mon, 12 Apr 2021 20:14:16 +0000 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 6e6f3389-d83f-4c0c-7dd8-08d8fdef8ca4 X-MS-TrafficTypeDiagnostic: SN6PR12MB2784: X-MS-Exchange-Transport-Forked: True X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:7219; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: nBbmHcNrfusBo4BgpLzKeOHjJ1S736Vzdzzx1n3aAj8BwAuZkna0BMaiXv4paZ/018fXWrMQwWITjToetQInX79jtlouQi3MkI+Isy+CVrtUTVBJQTAo0eFJdAXuP73lnZmyBkvjKG3oSiF55jVcN6RWTsl+r6LhjfKctcJ48BUBbZRH5TSg23BV0DSZProcuMNlcoB7b1MZadq4rZXshajEfzUKG74Xt326kRTpVmTcWZJWB8Yih5ktq9M4onsCDuqLLe2Nn3M8Q4pAyx5jKHS9XsACamhHT9JOYUf9qEZ7V4TdGjmsFWHDvEOCzH98qTUIu/sgpiDOklbVrY68JozcPdChy8Xn6SkuDVlZMnpDgLlADYblgEAgEk9/mpq9VardTCYEF0v1zPwVXC9jbolECWhROUMiQpnvxa4Ll/zLhxNLtCj3OFHbXI3WG/jVccRp2vY8aORUgMSz6kHYH7gDAIXLx2j/NVpSq4NnsjxQv/QLSQF2Tqc+6iOCZUYkK2WvIvUQ+Av5l67/4t8f3RB4ye1fLxYiVbj7RwPiapoq/gJXDV1Yqq5oNh3VdTeWhRIDjcl+6JTmpaeXJgMF9aBvppjF1r1Dtkmtql4Q8cAVpqo3hZoeYrYxHlTe3ITXKJ/mW8SlAgqFViCpXxI9eJhOt4eCrwgDrnENu/tzAs3DbpTl4NGY4DtlCDT8A+5LbeJBt4lJhqTfxhGhDQpV74CC7dBgtet3BCcmjwijSIiKS8f7rLtCLQg7rL6i96egzYhGTIsIATyW8+G9XF++PAQfo3D//hjkHUHvrTDzV8dxRlgVqEuVVJQ6qGWCUuA5 X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:SN6PR12MB2718.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(4636009)(376002)(396003)(136003)(346002)(366004)(39860400002)(966005)(6512007)(38350700002)(86362001)(38100700002)(4326008)(478600001)(31696002)(5660300002)(36756003)(6486002)(66556008)(53546011)(30864003)(66476007)(66946007)(316002)(2906002)(8936002)(16526019)(956004)(18074004)(44832011)(52116002)(26005)(6506007)(54906003)(31686004)(83380400001)(8676002)(186003)(2616005)(45980500001)(43740500002);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData: =?us-ascii?Q?0WQf0pGmy/4BkuUNACyxGiLPRDK7OZW06Hca7+Jw+lPaYnyHUV1zZUHtNwyf?= =?us-ascii?Q?UDWmkZAJfhloHuo3KnkA86ZsZpEaUprM0KH6pWRdX7g9d488QyjOkV1MDSDK?= =?us-ascii?Q?yb8Waw8MijBCgq50w31pUDBfCMQVfMC7aS9rzMsoPZokBc6rEdTwE4brJ4rl?= =?us-ascii?Q?JhsxF4nlKfaCJFv4XQO38L32XJ5poj68PSorHAlIFfLGL5N6bX1HpGXj4DdB?= =?us-ascii?Q?8ANmmqZR8OPQu/1RM16b09+ekSWF4ZGhJXYQfL/OVGvAV3Un0kPAMyCSkXMV?= =?us-ascii?Q?SbS581lTDMBfFkgHIqvcWUM070zNH3Ma3F5JxdZaPAnHvTDGKXeBC1kqh5aD?= =?us-ascii?Q?lOpAdtMhcBNt/4+uu4T9yTQyV+O5CgU+zj5LDKFZ35B/rjTPrp5e73/29O0O?= =?us-ascii?Q?uTJQLaVCHM3sRzCO9XVkQkGDaHRP9EmZ3Wsvz1x4Iayqhs7hoJAHh4yiBd8R?= =?us-ascii?Q?yNOQxxwOXli61ubFNDS/ZDumiAB368BPaHS4SYBPPGas/ocDBd7p75oYGZ/S?= =?us-ascii?Q?dpMaxxflDulD62HPDg3dB8SVR3Z438ZCZDKKXW3av/zBzooWeEICpHi+aVFU?= =?us-ascii?Q?Dnf+yrm5i7o33Dou8cmS7VMjkOJLoI6BNBFAx6cpB9rGeIMaTjeaMtIQFu5H?= =?us-ascii?Q?8ijz8IaBzqOXLktNx5Fy2fk/VxtRAGnw36Zv0GaeeQWvnqVd4l5rs6OJPyNp?= =?us-ascii?Q?2ksFIhXU+HnwJKoZIytxink1GrWe9Z077zG3EUy0oPsWfe476OLw0S8NRxGP?= =?us-ascii?Q?Jc3tpTeYxDMcRqDcjMTLGPvnMO8rxvhMe8qrhBiqOstDAiUt+K8uiczXjwXk?= =?us-ascii?Q?hdWIkVVlm/KsMXaO/EuMafapcbP5xKrbrt5qHzOsc82hDZuzNidCl1lLGgX0?= =?us-ascii?Q?Ax9Aq6/a4D/2ZAB30+AcjX/kW5DEXcHqR59uhkg+93fNHpTwSW6OKQo6tvyX?= =?us-ascii?Q?1dSHNDaaAzcGDr4dqvcPZkyYZD1Z+MBXfwFatMVB4OnptehCGCnjMGbry42P?= =?us-ascii?Q?P82WYHQGINd9N1ZYJUV4dSO6CsOz+qb7E5Jcsd8y7OsWTfyxz+jjIgg2A/sx?= =?us-ascii?Q?MAe55hpBA3mC3faT+VgSclajfCrlfaXi5hv8CuXHDNXDGxNG7CDtksl699m3?= =?us-ascii?Q?tHjpNJifSrTrP2Bd5mpuiNp7S+60rWKOTuL8Ixo14z4N/2f04JrobNYwk1gL?= =?us-ascii?Q?12sKebc0RVTqaMPGJZtmksTH5wL6z5mvZaW2fLu+0yOD3bAvzEvrURean301?= =?us-ascii?Q?18NTOslPUZ71y6T+g5Soevo22TX+rQ6ICW89oHpvuPJ944ouVb6eLTqMegSt?= =?us-ascii?Q?Ixnx5MSoDvO+s3nz6SnLGHoI?= X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-Network-Message-Id: 6e6f3389-d83f-4c0c-7dd8-08d8fdef8ca4 X-MS-Exchange-CrossTenant-AuthSource: SN6PR12MB2718.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 12 Apr 2021 20:14:17.3385 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: MfF5JzB/xtB6i0L3hhL0fbhc5R8VJ1W06ljKi7x0FQYBGQUEI/Qqq+ezuvMwUOyUBdIvaeoqzwgb6fLVqtwYmw== X-MS-Exchange-Transport-CrossTenantHeadersStamped: SN6PR12MB2784 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Content-Language: en-US On 4/12/21 11:23 AM, Laszlo Ersek wrote: > On 04/10/21 00:43, Brijesh Singh wrote: >> On 4/9/21 7:24 AM, Laszlo Ersek wrote: >>> On 04/08/21 13:59, Brijesh Singh wrote: >>>> On 4/8/21 4:58 AM, Laszlo Ersek wrote: >>>>> On 03/24/21 16:31, Brijesh Singh wrote: >>>>>> At this time we only support the pre-validation. OVMF detects all >>>>>> the available system RAM in the PEI phase. When SEV-SNP is >>>>>> enabled, the memory is validated before it is made available to >>>>>> the EDK2 core. >>>>> Can you describe this in a bit more detail, before I look at the >>>>> individual patches? Specifically, what existing logic in the PEI >>>>> phase was taken, and extended, and how? >>>> One of the key requirement is that the guest private pages much be >>>> validated before the access. If guest tries to access the pages >>>> before the validation then it will result in #VC >>>> (page-not-validated) exception. To avoid the #VC, we propose the >>>> validating the memory before the access. We will incrementally add >>>> the support to lazy validate (i.e validate on access). >>> What is the primary threat (in simple terms please) that validation >>> is supposed to prevent? >> >> To protect against the memory re-mapping attack the guest pages must >> be validated. The idea is that every guest page can map only to a >> single physical memory page at one time. > I don't understand. How can a given guest page frame map to multiple > host page frames? > > Do you mean that the situation to prevent is when multiple guest page > frames (GPAs) map to the same host page frame, as set up by the > hypervisor? > A malicious hypervisor may attempt to remap validated gpa to a different host frame. The PVALIDATE is designed to protect against those type of attacks. See https://static.sched.com/hosted_files/lsseu2019/65/SEV-SNP%20Slides%20Nov%2= 01%202019.pdf (slide 13). You can also find more information about it in the SEV-SNP whitepaper. >>> And, against that particular threat, does pre-validation offer some >>> protection, or will that only come with lazy validation? >> For the hardware it does not matter how the memory was validated -- >> lazy vs prevalidate. Both approaches use the PVALIDATE instruction to >> validate the page. >> >> In the case of pre-validation, the memory is validated before the >> access. Whereas in the lazy validation, the access will cause a fault >> and fault handler should validate the page and retry the access. Its >> similar to a page fault handler, OS can populate the page table or >> back the pages on demand. >> >> The only downside of pre-validation is, we will take a hit on the boot >> time. The GHCB spec provides method by which we can batch multiple >> requests at once to minimize the context switches. > If pre-validation is simpler, and the only drawback is a one-time hit > during boot, then wouldn't it be better to stick with pre-validation > forever? I expect that would be a lot simpler for (a) the #VC handlers, > (b) the tracking of the "already validated" information. This could be an issue for the larger VMs. The boot delay will vary based on the VM size. In addition to the boot delay,=C2=A0 the PVALIDATE instruction requires that there is a valid entry for the page in the nested page table. If the entry does not exist in the NPT then HV will get #NPF and will fill the NPT with the backing host page. By using the lazy-validation we can push allocation of the backing pages to only when required and thus release the memory pressure on the VMM. > > >>>> Similar to SEV, the OVMF_CODE.fd is encrypted through the SNP >>>> firmware before the launch. The SNP firmware also validates the >>>> memory page after encrypting. This allows us to boot the initial >>>> entry code without guest going through the validation process. >>>> >>>> The OVMF reset vector uses few data pages (e.g page table, early Sec >>>> stack). Access to these data pages will result in #VC. There are two >>>> approaches we can take to validate these data pages: >>>> >>>> 1. Ask SNP firmware to pre-validate it -- SNP firmware provides an >>>> special command that can be used to pre-validate the pages without >>>> affecting the measurement. >>> This means the two pflash chips, right? >> >> This does not need to be two pflash chips. The SNP firmware command >> does not know anything about the ROM or pflash chip. The command >> accepts the system physical address that need to be validated by the >> firmware. In patch #2, OVMF provides a range of data pages that need >> to be validated by the SNP firmware before booting the guest. > I guess my question related to the guest code that executes from pflash, > and/or accesses data from pflash, before the guest itself could ask for > any validation. Such as, the reset vector (which runs from pflash). > Then, some non-volatile UEFI variable data that resides in the other > pflash chip, and is accessed (read-only) during the PEI phase (the > memory type information variable namely). I understood your comments as > QEMU pre-validating those areas before guest launch. Is that correct? Yes that is correct. All the OVMF pflash0 memory will be pre-validated before the guest is launched. IIRC, the variables which resides on the other pflash chip are accessed unencrypted in the guest. Only the encrypted memory access (aka C=3D1) need to be validated. The host MMIO regions does not need to be validated. > > Anyway, I guess at this point I should just start reviewing the series. > > Some more comments below. > > >>> - We need more areas made accessible in SEC than just the >>> decompression buffer; for example the temporary SEC/PEI heap and >>> stack, and (IIRC) various special SEV-ES areas laid out via MEMFD. >>> Where are all of those handled / enumerated? >> >> Sorry, I simplified my response by saying just decompression. You are >> right that its more than the decompression buffer. In my current >> patch, the SEC phase validates all the memory up to >> PcdOvmfDecompressionScratchEnd (which includes more than just output >> buffer). See patch #13. > That sounds good (will check the patch out of course). > > >>>> One of the important thing is we should *never* validate the pages >>>> twice. >>> What are the symptoms / consequences of: >>> >>> - the guest accessing an unvalidated page (I understand it causes a >>> #VC, but what is the direct result of that, when this series is >>> applied?), >> >> After the series it applied, an access to an unvalidated page will >> result in #VC. The series does not extend the VC handler to handle the >> page-not-validated error code. The current VC handler >> [InternalVmgExitHandleVc()] will inject a #GP to terminate the guest >> if #VC is raised for unvalidated page. > OK. > > >>> - doubly-validating a page? >>> >>> The first question is relevant because we should crash as soon as we >>> touch a page we forgot to validate (we shouldn't let any corruption >>> or similar spread out until we finally realize there's a problem). >>> >>> The second question is relevant for security I guess. What attacks >>> become possible, and/or what symptoms are produced, if we >>> doubly-validate a page? >> Assuming that guest validates its memory, this guarantees the >> injective mapping between GPAs and SPAs. >> >> As I noted that the page validation process consist of two steps #1 >> RMPUPDATE and 2#PVALIDATE. The RMPUPDATE is a hypervisor instruction >> and PVALIDATE is a guest instruction. A malicious hypervisor can remap >> gpa to different physical address in RMP table using the RMPUPDATE >> just before the second PVALIDATE instruction is executed. The guest >> need to ensure that it does not leave the security vulnerability that >> can be exploited by a malicious hypervisor. The SNP white paper >> recommends that a guest OS should keep track of what is has validate >> so that it can avoid getting into these situation. > Hm, OK. I think this answers my top-most question. > > >>> Furthermore, IIRC, we have separate #VC handlers for the different >>> firmware phases; do they behave consistently / identicall when a >>> #VC(page-not-validated) occurs, when this patch set is applied? >>> >>> My first question is basically asking whether we can *exclusively* >>> rely on #VC(page-not-validated) faults to tell us that we missed >>> validating a particular page. If we can do that, then the job is a >>> bit easier, because from the GPA, we can more or less also derive >>> *when and where* we should pre-validate the page (at least until >>> validation is done completely lazily). >> Yes, we should be able to rely #VC to tell us that we missed >> validating the page. In my this series so far I have not seen a #VC >> due to page-not-validated because we have carefully validated the >> pages before they are accessed. I am thinking that >> #VC(page-not-validated) should be treated as an error in the first >> phase. Incrementally we will add the lazy validation, in which the >> #VC(page-not-validated) will provide a hint as to what has not been >> validated. > I'm OK with the current handling of #VC(page-not-validated) -- i.e., > fatal error. > > I'm not yet convinced that lazy validation will be useful at all, but > that could be just my lack of understanding. See if my above comment makes sense on why we may need the lazy validation. IMO, we should not be going with the lazy validation in OVMF. To minimize the boot delay and avoid the complexity of the code in the best case OVMF validated lower 4GB memory and mark rest of the memory as "unaccepted" in the UEFI map. David from google has started a thread about lazy validation in kernel https://www.spinics.net/lists/linux-mm/msg243954.html. We would like to find a method which works for both Intel TDX and AMD SEV. We will build these support incrementally both in the kernel and OVMF. > > >>>> The MemEncryptSevSnpValidateRam() uses a interval search tree to >>>> keep the record of what has been validated. Before validating the >>>> range, it lookup in its tree and if it finds that range is already >>>> validated then do nothing. If it detects an overlap then it will >>>> validate only non overlapping regions -- see patch #14. hat data >>>> structure is used for implementing the interval tree? >>> I'm not necessarily looking for a data structure with "nice" >>> asymptotic time complexity. With pre-validation especially, I think >>> simplicity (ease of review) is more important for the data structure >>> than performance. If it's not an actual "tree", we shouldn't call it >>> a "tree". (An "interval tree" is usually an extension of a Red-Black >>> Tree, and that's not the simplest data structure in existence; >>> although edk2 does offer an rbtree library.) >> >> I am using binary search tree. Currently, we don't have many nodes to >> track. Most of the guest memory is private and are validated before we >> exit from the PEI phase. > My preliminary comments about MemEncryptSevSnpValidateSystemRam(): > > (1) The function prototype (patch#12) promises that the function can > never fail. In patch#14 however, I see some AddRangeToIntervalTree() > calls whose return values are not checked. I think error checking should > be strict, and if an error is detected (such as memory allocation > failure), a plain ASSERT (FALSE) is not sufficient. Both ASSERT() and > CpuDeadLoop() should be used explicitly, for the sake of RELEASE builds. > And the function prototype (the leading comment) should highlight that > the function either succeeds, or doesn't return. > > > (2) I don't like the recursive nature of the functions. That's always > convenient first, and almost always a mess to get out of later. However, > this is not a request to complicate the tree's implementation, > because... > > > (3) ... you mention we don't have many nodes to track. Is a tree > structure, with dynamically allocated nodes, justified at all? > > > (4) Regarding repeating dynamic allocation -- that's never a great thing > to do in the SEC / PEI phases. Can we use a fixed size, once-allocated, > or even static, array somewhere? > > In particular, patch#14 uses AllocatePool() for SNP_VALIDATED_RANGE, in > the PEI phase, that results in EFI_HOB_TYPE_MEMORY_POOL HOBs; see: > > AllocatePool() [MdePkg/Library/PeiMemoryAllocationLib/Memo= ryAllocationLib.c] > PeiServicesAllocatePool() [MdePkg/Library/PeiServicesLib/PeiServicesL= ib.c] > PeiAllocatePool() [MdeModulePkg/Core/Pei/Memory/MemoryService= s.c] > > Now, dependent on where you call this AllocatePool() from, it could > occur still on the temporary SEC/PEI RAM -- i.e., before permanent PEI > RAM is installed. That means that such HOBs would be subject to the HOB > list migration, after permanent PEI RAM is installed. The PI spec writes > about EFI_HOB_MEMORY_POOL: "The HOB consumer phase should be able to > ignore these HOBs". The PI spec writes about AllocatePool(): "The early > allocations from temporary memory will be migrated to permanent memory > when permanent main memory is installed; this migration shall occur when > the HOB list is migrated to permanent memory". > > So, whenever we use AllocatePool() in the PEI phase, we need an "address > stability proof", namely that the AllocatePool() call, and any later > access to the allocated area (the HOB), are never separated by the > installation of "gEfiPeiMemoryDiscoveredPpiGuid" in the PPI database, by > the PEI Core. If we can't guarantee that, then an access later on could > occur to the original allocation address, which has by then been > invalidated by the temporary RAM migration (=3D the allocation HOB has > been moved). > > AllocatePages() is "stable" however, in this sense; no proof like the > one described above is necessary. > > > (5) The "mRootNode" variable is defined in the file > "OvmfPkg/Library/BaseMemEncryptSevLib/X64/PeiSnpSystemRamValidate.c" > (patch #14). That means that every single module (i.e., PEIM) which > consumes this library instance will own a separate "mRootNode" variable, > with a validation-tracking tree that's correspondingly private to that > PEIM. > > But the validation state is global to the entire firmware. I think we > should somehow clarify, or even enforce, that the tracking data > structure is firmware-global. > > Perhaps this can be solved by (a) moving the ownership of the data > structure to a particular firmware module, with sole responsibility for > pre-validation, and (b) updating the validation APIs to take this data > structure explicitly. > > Alternatively, if multiple modules are expected to inter-operate on the > tracker data structure (by which I mean inter-operation *in sequence*, > not concurrently in a multiprocessing sense), then we might need a > dynamic PCD for inter-driver understanding. > > > (NB: both points (4) and (5) seem to conflict with *existent* code in > BaseMemEncryptSevLib. However: > > - concerning (4), the current allocations are all performed with > AllocateAlignedPages(), which are unaffected by temporary-to-permanent > RAM migration, > > - and concerning (5), namely the existent "mPageTablePool" global > variable, is a *known bug* -- see commit b721aa749b86a and > TianoCore#847. This issue (namely, multiple modules messing with the > page tables independently) remains unsolved, and I'd like to avoid > introducing *more* paging-related allocations that are module-owned, > and not firmware-level.) > > > All in all, here's what I'd prefer (and please correct me if it's > unrealistic): > > - make the SNP validation functions take a simple, *unresizeable* > tracker structure for input/output, > > - allocate that data structure with a single AllocatePages() call in the > proper PEIM, > > - if the SNP validation functions run out of space in the structure, > hang hard, > > - avoid recursion, and use a simple linear search through the array, > > - evaluate whether graceful handling of overlaps (=3D range splitting, > merging) is actually a requirement. See more on this below. Yes this will work fine. The primary reason for me adding the interval tree was to detect the overlap condition before we finalize the pre-validation in PEI. The validation flow works like this: - Qemu validated few pages - SEC validates few more page - When we detect the memory in the PEI we need to exclude the pages validated by the Qemu and SEC. The page ranges validated using the Qemu and SEC are accessible through a fixed PCDs. I can do simple overlap check with those fixed PCDs and skip the pre-validated ranges. I don't have any strong reason to use the interval tree to detect the overlap condition. There is no call to toggle the C-bit before we finalize the validation.=C2=A0 IMO, there is no need for tracking the page state after we finish the pre-validation. After the pre-validtion is completed, the caller will use either clear the C-bit or set the C-bit. If it attempts to set the C-bit without clearing it first then its a bug -- which will be caught by the page state change internal function. We can use the output of CF flag from the PVALIDATE to determine where caller is trying to do doubly validation or invalidation and abort it. > > (6) Can you please confirm that the order in which > MemEncryptSevSnpValidateSystemRam()'s declaration, call sites, and > implementations are introduced, is sensible? I can't tell immediately, > but I'd like to be sure that the tree at least builds at every stage (no > linkage errors at any stage). I can certainly say the order in which they are introduced and the call sites are a sensible. I don't expect any linkage failure but since it was an RFC so I didn't go through making sure that it builds in every stage. I wanted to get the overall direction on where community would like to go before making the final changes. You have been asking some very good question and that is certainly helping us to reduce some of the complexity in the patch. > >>> Furthermore, what you describe above is called idempotency. No matter >>> how many times we attempt to validate a range, it may (or may not >>> even) cause an actual change in the first action only. Is this >>> property (=3Didempotency) an inherent requirement of the technology, or >>> is it a simplification of the implementation? Put differently: if you >>> called CpuDeadLoop() in the validation function any time an >>> overlapping validation request were received, would that hugely >>> complicate the call sites? >> From the technology point of view you can validate the same page again >> and again. Hardware does not force any restriction. To protect against >> time-based remap attack a guest also need to ensure that it does not >> blindly validates the pages otherwise guest is taking risk. Avoiding a >> double validation is strong recommendation. In current implementation, >> the validation burden is not put on the caller. Caller calls standard >> APIs to toggle the encryption attribute. Under the hood, if we see >> that page is already validated then no need to issue the PVALIDATE. If >> the page is getting changed from C=3D1 -> 0 then unvalidate it. > There remains a huge gap in my understanding here. (The rest of my > earlier questions did concern the relationship between the C bit and > validation, but I might as well comment right at this spot.) > > "Invalidation in connection to C-bit toggling" means that there's going > to be significant "validation churn" in the DXE phase. For example when > fw_cfg is used (which relies on the edk2 IOMMU protocol), or when virtio > transfers are performed (which also relies on the edk2 IOMMU protocol, > although less directly -- through PciIo, namely). > > That, however, brings up several problems for me: > > - Is the tracker data structure scalable? I don't see any balancing in > it. (And I wouldn't want a new, complex data structure for it anyway.) > > - In fact, I don't see any code related to invalidation -- more > precisely, I don't see where and how the C-bit changes consult the > tracker structure *at all*. > > - Given issue (5), I don't see how the tracker structure would be > inherited from the PEI phase to the DXE phase. > > In particular, in patch#18, the flipping of the C-bit results in calls > to SetPageStateInternal(). But SetPageStateInternal() does not go > *through* the tracker structure. Even in patch#14, > SevSnpValidateSystemRam() calls SetPageStateInternal() *in addition to* > AddRangeToIntervalTree(). > > In other words, I don't see where validity is tracked in the DXE phase, > in response to the C-bit toggling. As I highlighted above there is actually no need to track the page state after we successfully complete the pre-validation. All these guest page validation or invalidation are applicable to the system RAM. But AmdSevDxe driver calls to clear the C-bit of the MMIO range. These range are not a system and have never been validated so we invalidating it will cause an issue. Since I had the data structures available in PEI phase for the tracking the page state hence made those available to DXE to verify that we are called to invalidate the SYSTEM_RAM and not MMIO. IMO, we should either extend the MemEncryptSev{Set,Clear}PageEncMask() to pass either a new flag to hint where its system RAM or non-RAM. Thoughts ? > >>> I'm kind of "obsessing" about idempotency because you say we must >>> *never* doubly-validate a page, so the *difference* between: >>> - explicitly crashing on such an attempt, >>> - and silently ignoring such an attempt, >>> may be meaningful. >> >> I said we should never doubly validate to avoid creating a >> vulnerability that can later be exploited by the malicious hypervisor. >> Currently we lookup address in our tree to ensure that we don't do a >> double validation. The PVALIDATE instruction also can be used to >> determine if we are attempting to validate an already validated page. >> In current implementation after the PVALIDATE completion, I check the >> rflags.CF to ensure that its not a double validation condition. If its >> double validation then abort the guest. Its bug in the guest. >> >>> It's kind of the difference between "oops this is a call site *bug*, >>> but we patched it up", vs. "this is expected of call sites, we should >>> just handle it internally". >> Our attempt should be to patch to avoid the double validation. I would >> still check the status of PVALIDATE, if instruction detects it as a >> double validation then we have a bug in our tracking and should abort >> the guest. > Let me rephrase. My question concerns the *abstraction level* at which > we should catch and handle double-validation attempts. > > If we silently split and merge ranges in the tracker structure, then the > guest (as a whole) will not double-validate, which is good. However, > double-validation attempts will potentially still *exist*, in > higher-level modules in the firmware. My question is whether it is > *right* to paper over such overlaps, coming from the high-level code, in > the tracker structure. I suspect that it's not right. Yes we should bail out if such a request comes from the high level module. This is why I was actually not checking the tracking structure when toggling the C-bit. The complete memory much have been validated before we are asked to toggle the C-bit. If a module is performing a doubly validation or invalidation then it should be aborted to avoid any future exploits. > > In other words, the spot where we should abort, due to > "double-validation detected", is not when the PVALIDATE instruction > reports a problem in the tracker structure. Instead, the spot where we > should abort is when we detect an overlapping request *before* issuing > PVALIDATE. I'd like the tracker to be as simple and dumb as possible, > and I'd like the actual sites that request validation to blow up > immediately, if they issue an overlapping request. Is my expectation > wrong? > If we go with the tracking only until we complete the pre-validation then we can use the hardware PVALIDATE help to detect the doubly validation and abort it. There is no need to check the overlap conditions. >>>> The patch #18 extend the MemEncrypt{Set,Clear}PageEncMask() to call >>>> the SNP page state change during the C-bit toggle. >>> - When exactly do we invalidate? >> Basically before the page is transition from C=3D1 -> 0 in the page >> table. >> >>> - Does the time and place of invalidation depend on whether we >>> perform pre-validation, or lazy validation? >> It does not matter. Both type of validation update the Validated bit >> in the RMP table. The important thing is that invalidation can happen >> only for a already validated private page. If the page is shared then >> invalidation will result in #GP. >>> - Is page invalidation idempotent too? >> We should treat this as idempotent too. As I said, tacking the page >> state is strongly recommended for the security reason. The SNP white >> paper has a section dedicated for the validation and may able to >> clarify things which is not covered in my response. >>> - What is the consequence (security impact) if we forget >>> invalidation? >> I can't think of a security implication of it but if we forget to >> invalidate then it can lead issues in the tracking data structure when >> it comes to validation next time. e.g a page was mapped as C=3D1 and we >> changed it to C=3D0 without invalidating it. Now if the same page is >> being changed from C=3D0 -> 1 then tracking data structure may think >> that page is already validated and will skip the validation process >> and thus access will result in #VC (page-not-validated). > As I said above, I don't see where the validation tracking and the C-bit > flipping "meet". Especially across the PEI and DXE phases. > > Basically I'm missing the core concept for maintaining a firmware-global > structure, for tracking the validation status of GPA intervals. > > Up to and *excluding* the DXE phase, the validation tracking should be > trivial, naive, and unforgiving. This is because, in the PEI phase, > validation is one-shot. I would expect the tracker structure to live in > the PlatformPei module. > > Starting with the DXE phase, where we need (in)validation to closely > accompany C-bit toggling, we need a different, performant data structure > for tracking validation status. This data structure should be primed > from the status last set up in the PEI phase. The tracking should still > be unforgiving. I believe *this* tracker structure may have to live in > the IOMMU driver. > > BaseMemEncryptSevLib may offer helper functions, but even if it does, it > should never attempt to *own* tracking information, via an internal > global variable. Furthermore, those helper functions that only make > sense for a particular firmware phase, should have such implementations > that hang hard, in library instances that match the *other* firmware > phases. For example, SevSnpValidateSystemRam() should have a > DXE-specific implementation that hangs hard. > > It's entirely possible that my points above are complete garbage. I > don't have a grasp on the core concept, for the firmware-wide tracking > of the validation status. > > ... Ugh, wait. I've just noticed something. In patch#19, > "SEC_SEV_ES_WORK_AREA.SnpSystemRamValidatedRootAddress" is introduced, > which seems (?) to be implementing the kind of global (inter-driver, > inter-phase) tracking that I've been missing. However, even if my guess > (?) is correct about that, this idea definitely doesn't belong in the > last patch, presented as a "bugfix". Even if it were a bugfix, it should > be incorporated into one of the earlier patches (so that we preferably > not introduce the bug in the first place). However, my argument is that > it's not a bugfix -- to me that seems to be the core concept. That's > what the whole series needs to be built upon. (In all of my thinking, > before this particular paragraph, I *never* looked at patch#19. There > was no reason to.) It was my bad that I should have highlighted this before and not left it to the last patch. As you can see the only reason pass the data structure from PEI to DXE is to catch those non SYSTEM invalidation request coming from the AmdSev Driver. I am too happy about the solution I have in this RFC. I want to go with option to change the call sites to tell us whether its a system RAM or non existence memory type.=C2=A0 > >>> - There are four page states I can imagine: >>> - encrypted for host access, valid for guest access >>> - encrypted for host access, invalid for guest access >>> - decrypted for host access, valid for guest access >>> - decrypted for host access, invalid for guest access >>> >>> Does a state exist, from these four, that's never used? (Potentially >>> caught by the hardware?) >> I would not mix the page state with the host access. Basically there >> is two page state exist >> >> - Guest-Valid >> >> - Guest-Invalid >> >> By default all the guest memory is in Guest-Invalid state on boot. The >> PVALIDATE instruction is used to transition from Valid->Invalid. > Do you mean "Invalid->Valid" transition? Ah yes that is correct. > > >> Guest can use the PVALIDATE to change the state from Valid to Invalid. >> >> A guest private page (i.e page mapped C=3D1) must be in Guest Valid >> state. If hardware see that a private access page is not in the >> Guest-Valid state then it will trigger #VC. >> >> >>> Do the patches highlight / explain the validity transitions, in >>> comments? My understanding is that the C-bit toggle and the guest >>> access valid/invalid toggle are separate actions, so "middle" states >>> do exist, but perhaps only temporarily. >> I have tried to document the steps in the patch description and I can >> add more comments as required. When SEV-SNP is active then we need to >> invalidate the page before toggling the C-bit from C=3D1 -> 0 and >> similarly after the page is toggled from C=3D0 -> 1 we must validate it. >> >> >>> I'm curious how it works, for example, with variousvirtio transfers >>> (bus master read, bus master write, bus master common buffer). In the >>> IoMmuDxe driver minimally, we access memory both through PTEs with >>> the C-bit set and through PTEs with the C-bit clear, meaning that >>> "encrypted, valid", and "decrypted, valid" are both required states. >>> But that seems to conflict with the notion that "C-bit toggle" be >>> directly coupled with a "validity toggle". >> That should not be an issue. Each page have entry in the RMP table. >> So, you can have one page as guest invalid and another page as a guest >> valid. >> >>> Put differently, I'm happy that modifying IoMmuDxe appears >>> unnecessary, but then that tells me I'm missing something about the >>> state transitions between the above *four* states. >> >> Only thing which I would propose changing in IoMmuDxe driver is to use >> a pre-allocated buffer and avoid toggling the C-bit on every >> read/write. In past the C-bit toggle was all contained inside the >> guest. But when SNP is enabled, each C-bit toggle causes two >> additional steps #1 RMPUPDATE and #2 PVALIDATE. The RMPUPDATE is >> requested using the VMEXIT so it will cause a VM context switch. I >> would prefer to avoid it. >> >> My current patch set does not have this optimization yet. > Please let's stick with the absolute minimum in this series, yes. > > I realize my email is complete chaos. That's because my understanding of > this work is complete chaos. > > Can we approach this feature as follows: > > - Make "validation tracking" the core tenet (central principle) of this > feature. > > - Design the validation *patterns* that are required in each firmware > phase. > > - Identify the *sole* owner module, for validation tracking, in each > firmware phase. If we can't readily do this, we might have to > introduce new drivers (PEIMs, DXE drivers?) and new interfaces (PPIs, > protocols?) > > - Design the hand-off between the owner modules (on firmware phase > boundaries). > > - Add only helper functions to BaseMemEncryptSevLib, without any > ownership of tracking info. The helpers should be as restricted as > possible. > > I'd like to avoid "smearing" page validation responsibility over a bunch > of modules. (See again: TianoCore#847.) I'd like responsibilities > strictly centralized. > > If you feel that my points make no sense, I can start going over the > individual patches. However, even that way, I'd have found patch#19 only > in the end, and that doesn't seem right. And I wouldn't like to go > "numb" on the actual code, being distracted by style issues and so on, > before we have some agreement on the core approach. Certainly your points makes sense. Let me know what you think about removing the tracking data structure and using a liner array of pre-valiated range (build with a Fixed PCD) works ? I was hoping you to just glance over it and provide me high level feedback. I will go through styles and comments in great details on non RFC patches. I still have some TODO items (e.g Secrets, CPUID etc) before we take off the RFC tag. Thank you so much for your detail feedback. > > Another tip: if you can add content to MdePkg that is governed by AMD > specifications, such as patches #3, #5, #9, I would recommend splitting > that off to a separate (pre-requisite) series. That content seems > unaffected by whatever we do in OvmfPkg, and it would help decrease the > size of the OvmfPkg series. It's not like we're going to abandon the > SEV-SNP feature, so I don't think there's a risk in merging the MdePkg > series first, while the OvmfPkg series is not ready. The MdePkg stuff > will not remain unused indefinitely. Ah, sure I can do that.=C2=A0 -Brijsh