From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by mx.groups.io with SMTP id smtpd.web11.8753.1618318828816365034 for ; Tue, 13 Apr 2021 06:00:29 -0700 Authentication-Results: mx.groups.io; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=YGtZ0XnJ; spf=pass (domain: redhat.com, ip: 216.205.24.124, mailfrom: lersek@redhat.com) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1618318828; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=QL6bNqpKeor4g1PDFRf5jy9bqXe53c6hmMCavB5w3xk=; b=YGtZ0XnJN7PehsRbk2qndhMaHmr1sHqWs9OY8t6iCCHrzNwW+/viE3IMRNvHLGl/vT01Rw eAfWKh2N6B43zk2fk7nNkLDVm5vTow3VScKPNArtaLY+RS5rugZPYBt8O+ReON6Zys4HWp 8cYnHtEmDutLi+s5dfJXITfN3f6wehc= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-3-AK93LUseMnOUoCkEBMoPJw-1; Tue, 13 Apr 2021 09:00:23 -0400 X-MC-Unique: AK93LUseMnOUoCkEBMoPJw-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 3D6188030A0; Tue, 13 Apr 2021 13:00:22 +0000 (UTC) Received: from lacos-laptop-7.usersys.redhat.com (ovpn-115-199.ams2.redhat.com [10.36.115.199]) by smtp.corp.redhat.com (Postfix) with ESMTP id 66EDE60C04; Tue, 13 Apr 2021 13:00:19 +0000 (UTC) Subject: Re: [edk2-devel] [RFC PATCH 00/19] Add AMD Secure Nested Paging (SEV-SNP) support To: devel@edk2.groups.io, brijesh.singh@amd.com Cc: James Bottomley , Min Xu , Jiewen Yao , Tom Lendacky , Jordan Justen , Ard Biesheuvel References: <20210324153215.17971-1-brijesh.singh@amd.com> <34a50603-32b9-1476-04a5-8476dd810fe4@redhat.com> <213ce382-0b04-16f4-dd77-f9bb2cc32698@amd.com> <2cb456d3-c407-ff00-c274-af30d4385749@amd.com> <867ca68c-0a55-7466-43c2-cf6845fb8d75@redhat.com> From: "Laszlo Ersek" Message-ID: <8002d119-f772-3f8c-ee07-7973ef97428d@redhat.com> Date: Tue, 13 Apr 2021 15:00:18 +0200 MIME-Version: 1.0 In-Reply-To: X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=lersek@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit On 04/12/21 22:14, Brijesh Singh wrote: > > On 4/12/21 11:23 AM, Laszlo Ersek wrote: >> On 04/10/21 00:43, Brijesh Singh wrote: >>> On 4/9/21 7:24 AM, Laszlo Ersek wrote: >>>> What is the primary threat (in simple terms please) that validation >>>> is supposed to prevent? >>> >>> To protect against the memory re-mapping attack the guest pages must >>> be validated. The idea is that every guest page can map only to a >>> single physical memory page at one time. >> I don't understand. How can a given guest page frame map to multiple >> host page frames? >> >> Do you mean that the situation to prevent is when multiple guest page >> frames (GPAs) map to the same host page frame, as set up by the >> hypervisor? >> > A malicious hypervisor may attempt to remap validated gpa to a > different host frame. The PVALIDATE is designed to protect against > those type of attacks. See > https://static.sched.com/hosted_files/lsseu2019/65/SEV-SNP%20Slides%20Nov%201%202019.pdf > (slide 13). You can also find more information about it in the SEV-SNP > whitepaper. Thanks for the link. Page 7 confirms my understanding: """ Memory Aliasing Example attack: Map two guest pages to same DRAM page """ (On the other hand, slide 12 seems to be buggy, "Page validation guarantees that each GPA only maps to one valid SPA at any given time" -- the GPA:SPA relationship is n:1, so any given GPA could never ever map to multiple SPAs, regardless of SEV-SNP. The multiplicity is possible in the other direction (a single SPA is exposed as multiple GPAs, maybe in a single guest, maybe in multiple guests), and that's what SEV-SNP is supposed to prevent. The GPA->SPA mapping must be injective, but that's not what the words on the slide actually express.) The page remapping slide (13) does not necessarily break injectivity, and even if the hypervisor does that, the same GPA *still* corresponds to a single SPA, at this new ("one") time. It's just that the correspondence is not what the guest has approved. Expressing this correctly would require a temporal logic formula of sorts. Anyway... at least now I (believe I) understand that these are two separate attacks (aliasing vs. remapping). >> If pre-validation is simpler, and the only drawback is a one-time hit >> during boot, then wouldn't it be better to stick with pre-validation >> forever? I expect that would be a lot simpler for (a) the #VC >> handlers, (b) the tracking of the "already validated" information. > > This could be an issue for the larger VMs. The boot delay will vary > based on the VM size. In addition to the boot delay, the PVALIDATE > instruction requires that there is a valid entry for the page in the > nested page table. If the entry does not exist in the NPT then HV will > get #NPF and will fill the NPT with the backing host page. By using > the lazy-validation we can push allocation of the backing pages to > only when required and thus release the memory pressure on the VMM. Ugh... OK, I guess. I'd need to digest this a lot more, for a fully informed agreement, but I'm happy with this answer (and my rudimentary understanding thereof) for now. >> I guess my question related to the guest code that executes from >> pflash, and/or accesses data from pflash, before the guest itself >> could ask for any validation. Such as, the reset vector (which runs >> from pflash). Then, some non-volatile UEFI variable data that resides >> in the other pflash chip, and is accessed (read-only) during the PEI >> phase (the memory type information variable namely). I understood >> your comments as QEMU pre-validating those areas before guest launch. >> Is that correct? > > > Yes that is correct. All the OVMF pflash0 memory will be pre-validated > before the guest is launched. IIRC, the variables which resides on the > other pflash chip are accessed unencrypted in the guest. Only the > encrypted memory access (aka C=1) need to be validated. The host MMIO > regions does not need to be validated. Good point. Thanks for the explanation / reminder! > See if my above comment makes sense on why we may need the lazy > validation. IMO, we should not be going with the lazy validation in > OVMF. To minimize the boot delay and avoid the complexity of the code > in the best case OVMF validated lower 4GB memory and mark rest of the > memory as "unaccepted" in the UEFI map. I'm going to violently support any idea that keeps things simple for OVMF -- is this "lower 4GB" what the current patch set does already? Or is it the proposed next step? > David from google has started a thread about lazy validation in kernel > https://www.spinics.net/lists/linux-mm/msg243954.html. We would like > to find a method which works for both Intel TDX and AMD SEV. We will > build these support incrementally both in the kernel and OVMF. Ah, OK. That explains it. Next step then. Hmm, yes, patch#15 seems to pre-validate everything now. OK. > Yes this will work fine. The primary reason for me adding the interval > tree was to detect the overlap condition before we finalize the > pre-validation in PEI. The validation flow works like this: > > - Qemu validated few pages > > - SEC validates few more page > > - When we detect the memory in the PEI we need to exclude the pages > validated by the Qemu and SEC. > > The page ranges validated using the Qemu and SEC are accessible > through a fixed PCDs. I can do simple overlap check with those fixed > PCDs and skip the pre-validated ranges. I don't have any strong reason > to use the interval tree to detect the overlap condition. There is no > call to toggle the C-bit before we finalize the validation. Thank you. This is a *huge* relief. > IMO, there is no need for tracking the page state after we finish the > pre-validation. After the pre-validtion is completed, the caller will > use either clear the C-bit or set the C-bit. If it attempts to set the > C-bit without clearing it first then its a bug -- which will be caught > by the page state change internal function. We can use the output of > CF flag from the PVALIDATE to determine where caller is trying to do > doubly validation or invalidation and abort it. Sounds good. Even in case the final PVALIDATE action proves incorrect, if we catch that immediately via CF, and refuse to do anything else (i.e. we won't go ahead and use the page(s) in question), things should be OK. >> (6) Can you please confirm that the order in which >> MemEncryptSevSnpValidateSystemRam()'s declaration, call sites, and >> implementations are introduced, is sensible? I can't tell >> immediately, but I'd like to be sure that the tree at least builds at >> every stage (no linkage errors at any stage). > > I can certainly say the order in which they are introduced and the > call sites are a sensible. I don't expect any linkage failure but > since it was an RFC so I didn't go through making sure that it builds > in every stage. I wanted to get the overall direction on where > community would like to go before making the final changes. You have > been asking some very good question and that is certainly helping us > to reduce some of the complexity in the patch. Thanks. I've been very stressed as to whether my questions or "proposals" make any sense. And yes, it's a desirable trait for a patch set to build at every stage (bisectability). > As I highlighted above there is actually no need to track the page > state after we successfully complete the pre-validation. Again, I welcome this very much. > All these guest page > validation or invalidation are applicable to the system RAM. But > AmdSevDxe driver calls to clear the C-bit of the MMIO range. These > range are not a system and have never been validated so we > invalidating it will cause an issue. Understood. > Since I had the data structures available in PEI > phase for the tracking the page state hence made those available to > DXE to verify that we are called to invalidate the SYSTEM_RAM and not > MMIO. IMO, we should either extend the > MemEncryptSev{Set,Clear}PageEncMask() to pass either a new flag to > hint where its system RAM or non-RAM. Thoughts ? I think I'd prefer a new function in the lib class, one that's dedicated to clearing the C bit on MMIO without attempting invalidation. It's a special-case function, and the dedicated name helps with two things: - no need to mess with existing call sites except in AmdSevDxe, - the new function is easier to grep for. BTW, there would be at least one other approach for this, I think -- the PEI instance of the library could consult the HOB list, and the DXE instance of the library could consult the GCD memory space map, to decide whether the page being C-bit-flipped is MMIO or not. But that's a lot of complexity (and likely some performance hit too), when in fact we know at the call sites whether the area being encrypted/decrypted is MMIO or RAM. Note also that (IIUC) in AmdSevDxe, we only *decrypt*, so we only need *one* new function, "MemEncryptSevClearMmioPageEncMask" or similar. BTW, in the "MemEncryptSevLib.h" header, the documentation of the MemEncryptSev{Set,Clear}PageEncMask function declarations should be updated -- the leading comments should explain that, in case SEV-SNP is found active, then page (in)validation will occur too (as appropriate). > Yes we should bail out if such a request comes from the high level > module. This is why I was actually not checking the tracking structure > when toggling the C-bit. The complete memory much have been validated > before we are asked to toggle the C-bit. If a module is performing a > doubly validation or invalidation then it should be aborted to avoid > any future exploits. Great, thank you. > If we go with the tracking only until we complete the pre-validation > then we can use the hardware PVALIDATE help to detect the doubly > validation and abort it. There is no need to check the overlap > conditions. Again, great. In the pre-validation step, where you explicitly carve out the PCD-described ranges that have been handled already by QEMU and earlier phases of the firmware, you can still check the PVALIDATE result (CF), just to be sure. > Certainly your points makes sense. Let me know what you think about > removing the tracking data structure and using a liner array of > pre-valiated range (build with a Fixed PCD) works ? I couldn't have hoped for such a great resolution to this conundrum. Yes, please do that. > I was hoping you to just glance over it and provide me high level > feedback. I will go through styles and comments in great details on > non RFC patches. I still have some TODO items (e.g Secrets, CPUID etc) > before we take off the RFC tag. Thank you so much for your detail > feedback. Well, this is the first email on the SEV-SNP topic where what I feel can be described as "relief" :) Thank you! Laszlo