From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 4468821E62BBE for ; Thu, 7 Sep 2017 04:44:00 -0700 (PDT) Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id E1A55356D8; Thu, 7 Sep 2017 11:46:50 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com E1A55356D8 Authentication-Results: ext-mx06.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx06.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=lersek@redhat.com Received: from lacos-laptop-7.usersys.redhat.com (ovpn-120-54.rdu2.redhat.com [10.10.120.54]) by smtp.corp.redhat.com (Postfix) with ESMTP id 29EF360603; Thu, 7 Sep 2017 11:46:48 +0000 (UTC) To: "Yao, Jiewen" , Brijesh Singh , "Zeng, Star" , edk2-devel-01 Cc: "Dong, Eric" References: <20170903195449.30261-1-lersek@redhat.com> <0C09AFA07DD0434D9E2A0C6AEB0483103B93A125@shsmsx102.ccr.corp.intel.com> <4b24b1eb-362f-3b46-97e2-bdfda53f40c9@redhat.com> <74D8A39837DF1E4DA445A8C0B3885C503A9A79BD@shsmsx102.ccr.corp.intel.com> <5f1fdc84-5824-bee2-5a1a-fbd67adf5443@redhat.com> <74D8A39837DF1E4DA445A8C0B3885C503A9A7F10@shsmsx102.ccr.corp.intel.com> <12d71f32-9dcf-4f6e-b033-d5f82104caca@redhat.com> <74D8A39837DF1E4DA445A8C0B3885C503A9A852C@shsmsx102.ccr.corp.intel.com> <24d66e81-336e-3924-8045-4749d98e2fbb@redhat.com> <94154cfe-92bf-ba6f-f25f-3963891f6932@amd.com> <74D8A39837DF1E4DA445A8C0B3885C503A9A8F0C@shsmsx102.ccr.corp.intel.com> From: Laszlo Ersek Message-ID: Date: Thu, 7 Sep 2017 13:46:48 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 In-Reply-To: <74D8A39837DF1E4DA445A8C0B3885C503A9A8F0C@shsmsx102.ccr.corp.intel.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Thu, 07 Sep 2017 11:46:51 +0000 (UTC) Subject: Re: [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() X-BeenThere: edk2-devel@lists.01.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: EDK II Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Sep 2017 11:44:00 -0000 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit On 09/07/17 06:46, Yao, Jiewen wrote: > Thanks for the sharing, Brijesh. > > "If a page was marked as "shared" > then its guest responsibility to make it "private" after its done communicating with > hypervisor." > > I believe I have same understanding - a *guest* should guarantee that. > > My question is: During the *transition* from firmware to OS, *which guest* should make the shared buffer to be private? Is it "guest firmware" or "guest OS"? > > Maybe I can ask the specific question to get it more clear. > > 1) If the common DMA buffer is not unmapped at ExitBootService, is that treated as an issue? > > 2) If the read/write DMA buffer is not unmapped at ExitBootService, is that treated as an issue? Very good questions, totally to the point. On the authoritative answer, I defer to Brijesh. ( My personal opinion is that both situations (#1 and #2) are problems, because they break the *practical* security invariant for SEV guests: - most memory should be encrypted at all times, *and* - any memory that is decrypted must have an owner that is responsible for re-encrypting the memory eventually. Therefore, *either* the firmware has to re-encrypt all remaining DMA buffers at ExitBootServices(), *or* a new information channel must be designed, from firmware to OS, to carry over the decryption status. I strongly prefer the first option, for the following reason: the same questions apply to all EDK2 IOMMU protocol interfaces, not just the one exported by the SEV driver. ) Thanks, Laszlo > From: Brijesh Singh [mailto:brijesh.singh@amd.com] > Sent: Wednesday, September 6, 2017 11:40 PM > To: Laszlo Ersek ; Yao, Jiewen ; Zeng, Star ; edk2-devel-01 > Cc: brijesh.singh@amd.com; Dong, Eric > Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() > > > > On 09/06/2017 07:14 AM, Laszlo Ersek wrote: >> On 09/06/17 06:37, Yao, Jiewen wrote: >>> Thanks for the clarification. Comment in line. >>> >>> From: Laszlo Ersek [mailto:lersek@redhat.com] >>> Sent: Wednesday, September 6, 2017 1:57 AM >>> To: Yao, Jiewen >; Zeng, Star >; edk2-devel-01 > >>> Cc: Dong, Eric >; Brijesh Singh > >>> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() >> >>>> Then after ExitBootService, the OS will take control of CR3 and set correct >>>> encryption bit. >>> >>> This is true, the guest OS will set up new page tables, and in those >>> PTEs, the C-bit ("encrypt") will likely be set by default. >>> >>> However, the guest OS will not *rewrite* all of the memory, with the >>> C-bit set. This means that any memory that the firmware didn't actually >>> re-encrypt (in the IOMMU driver) will still expose stale data to the >>> hypervisor. >>> [Jiewen] That is an interesting question. >>> Is there any security concern associated? >> >> I can't tell for sure. Answering this question is up to the proponents >> of SEV, who have designed the threat model for SEV. >> >> My operating assumption is that we should keep memory as tightly >> encrypted as possible at the firmware --> OS control transfer. It could >> be an exaggerated expectation from my side; I'd just better be safe than >> sorry :) >> >> > > Let me give some brief intro on SEV (Secure Encrypted Virtualization) and then > we can discuss a security model (if needed). > > SEV is an extension to the AMD-V architecture which supports running encrypted > virtual machines (VMs) under the control of a hypervisor. Encrypted VMs have their > pages (code and data) secured such that only the guest itself has access to the > unencrypted version. Each encrypted VMs is associated with a unique encryption > key; if its data is accessed by a different entity using a different key the > encrypted guest data will be incorrectly decrypted, leading to unintelligible data. > You can also find some detail for SEV in white paper [1]. > > SEV encrypted Vs have the concept of private and shared memory. The private memory > is encrypted with the guest-specific key, while shared memory may be encrypted > with hypervisor key. SEV guest VMs can choose which pages they would like to > be private. But the instruction pages and guest page tables are always treated > as private by the hardware. The DMA operation inside the guest must be performed > on shared pages -- this is mainly because in virtualization world the device > DMA needs some assistance from the hypervisor. > > The security model provided by the SEV ensure that hypervisor will no longer able > to inspect or alter any guest code or data. The guest OS controls what it want to > share with outside world (in this case with Hypervisor). > > In software implementation we took approach to encrypt everything possible starting > early in boot. In this approach whenever OVMF wants to perform certain DMA operations > it allocate a shared page, issues the command, free the shared page after the command > is completed (in other word we use sw bounce buffer to complete the DMA operation). > > We have implemented IOMMU driver to provide the following functions: > > AllocateBuffer(): > -------------------- > it allocate a private pages, as per UEFI spec the driver will map the buffer allocated > from this routine using BusMasterCommonBuffer hence we allocate extra stash pages > in addition to requested pages. > > > Map > ---- > If requested operation is BusMasterRead or BusMasterWrite then we allocate a shared buffer > and DeviceAddress point to shared buffer. > > If requested operation is BusMasterCommonBuffer then we perform in-place decryption of the > contents and update the page-table to clear the C-bit (basically make it shared page) > > Unmap > ------ > If operation was BusMasterRead or BusMasterWrite then we complete the unmapping and free > the shared buffer allocated in Map(). > > If operation was BusMasterCommonBuffer then we perform in-place encryption and set the C-bit > (basically make the page private) > > FreeBuffer: > ----------- > Free the pages > > > Lets run with the below scenario: > > 1) guest marks a page as "shared" and gives its physical address to HV (e.g DMA read) > 2) hypervisor completes the request operation > 3) hypervisor caches the shared physical address and monitor it for information leak > 4) sometime later if guest write data in its "shared" physical address then hypervisor can > easily read it without guest knowledge. > > SEV hardware can protect us against the attack where someone tries to inject or alter the > guest code. As I noted above any instruction fetch is forced as private hence if attacker > write some code into a shared buffer and point the RIP to his/her code then instruction > fetch will try to decrypt it and get unintelligible opcode. If a page was marked as "shared" > then its guest responsibility to make it "private" after its done communicating with > hypervisor. > > [1] http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf > > >>> If the C-bit is cleared for a read/write buffer, is the content in the >>> read/write buffer also exposed to hypervisor? >> >> Not immediately. Only when the guest also rewrites the memory through >> the appropriate virtual address. >> >> Basically, in the virtual machine, the C-bit in a PTE that covers a >> particular guest virtual address (GVA) controls whether a guest write >> through that GVA will result in memory encryption, and if a gues read >> through that GVA will result in memory decryption. >> >> The current state of the C-bit doesn't matter for the hypervisor, what >> matters is the last guest write through a GVA whose PTE had the C-bit >> set, or clear. If the C-bit was clear when the guest last wrote the >> area, then the hypervisor can read the data. If the C-bit was set, then >> the hypervisor can only read garbage. >> >> >>> I means if there is security concern, the concern should be applied to >>> both common buffer and read/write buffer. >>> Then we have to figure a way to resolve both buffer. >> >> Yes, this is correct. The PciIo.Unmap operation, as currently >> implemented in OvmfPkg/IoMmuDxe/, handles the encryption/decryption >> correctly for all operations, but it can only guarantee *not* freeing >> memory for CommonBuffer. So Unmap()ing CommonBuffer at ExitBootServices >> is safe, while Unmap()ing Read/Write is not. The encryption would be >> re-established just fine for Read/Write as well, but we would change the >> UEFI memmap. >> >> In OVMF, we currently manage this problem by making all asynchronous DMA >> mappings CommonBuffer, even if they could othewise be Read or Write. We >> use Read and Write only if the DMA operation completes before the >> higher-level protocol function returns (so we can immediately Unmap the >> operation, and the ExitBootServices() handler doesn't have to care). >> >> >>> To be honest, that is exactly my biggest question on this patch: >>> why do we only handle the common buffer specially? >> >> For the following reason: >> >> - Given that CommonBuffer mappings take a separate AllocateBuffer() / >> FreeBuffer() call-pair, around Map/Unmap, we can *reasonably ask* PciIo >> implementors -- beyond what the UEFI spec requries -- to keep *all* >> dynamic memory management out of Map/Unmap. If they need dynamic memory >> management, we ask them to do it in AllocateBuffer/FreeBuffer instead. >> This way Unmap is safe in ExitBootServices handlers. >> >> - We cannot *reasonably ask* PciIo implementors to provide the same >> guarantee for Read and Write mappings, simply because there are no other >> APIs that could manage the dynamic memory for Map and Unmap -- >> AllocateBuffer and FreeBuffer are not required for Read and Write >> mappings. PciIo implementors couldn't agree to our request even if they >> wanted to. Therefore Unmapping Read/Write operations is inherently >> unsafe in EBS handlers. >> >> >>>> NOTE: The device should still be halt state, because the device is >>>> also controlled by OS. >>> >>> This is the key point -- the device is *not* controlled by the guest OS, >>> in the worst case. >>> >>> The virtual device is ultimately implemented by the hypervisor. We don't >>> expect the virtual device (= the hypervisor) to mis-behave on purpose. >>> However, if the hypervisor is compromised by an external attacker (for >>> example over the network, or via privilege escalation from another >>> guest), then all guest RAM that has not been encrypted up to that point >>> becomes visible to the attacker. In other words, the hypervisor ("the >>> device") becomes retro-actively malicious. >>> [Jiewen] If that is the threat model, I have a question on the exposure: >>> 1) If the concern is for existing data, it means all DMA data already written. >>> However, the DMA data is already consumed or produced by virtual device >>> or a hypervisor. If the hypervisor is attacked, it already gets all the data content. >> >> Maybe, maybe not. The data may have been sent to a different host over >> the network, and wiped from memory. >> >> Or, the data may have been written to a disk image that is separately >> encrypted by the host OS, and has been detached (unplugged) from the >> guest, and also unmounted from the host, with the disk key securely >> wiped from host memory. >> >> We can also imagine the reverse direction. Assume that the user of the >> virtual machine goes into the UEFI shell in OVMF, starts the EDIT >> program, and types some secret information into a text file on the ESP. >> Further assume that the disk image is encrypted on the host OS. It is >> conceivable that fragments of the secret could remain stuck in the AHCI >> (disk) and USB (keyboard) DMA buffers. >> >> Maybe you think that these are "made up" examples, and I agree, I just >> made them up. The point is, it is not my place to discount possible >> attack vectors (I haven't designed SEV). Attackers will be limited by >> their *own* imaginations only, not by mine :) >> >> >>> 2) if the concern is for future data, it means all user data to be written. >>> However, the C-bit already prevent that. >> >> As far as I understand SEV, future data is out of scope. The goal is to >> prevent *retroactive* guest data leaks (= whatever is currently in host >> OS memory) if an attacker compromises an otherwise non-malicious hypervisor. >> >> If an attacker compromises a virtualization host, then they can just sit >> silent and watch data flow into and out of guests from that point >> onward, because they can then monitor all DMA (which always happens in >> clear-text) real-time. >> >> >>> Maybe I do not fully understand the threat model defined. >>> If you can explain more, that would be great. >> >> Well I tried to explain *my* understanding of SEV :) I hope Brijesh will >> correct me if I'm wrong. >> >> >>> The point of SEV is to keep as much guest data encrypted at all times as >>> possible, so that *whenever* the hypervisor is compromised by an >>> attacker, the guest memory -- which the attacker can inevitably dump >>> from the host side -- remains un-intellegible to the attacker. >>> [Jiewen] OK. If this is a security question, I suggest we define a clear >>> threat model at first. >>> Or what we are doing might be unnecessary or insufficient. >> >> I agree. >> >> As I said above, my operating principle has been to re-encrypt all DMA >> buffers as soon as possible. For long-standing DMA buffers, re-encrypt >> them at ExitBootServices at the latest. >> >> >>> [Jiewen] For "require that Unmap() work for CommonBuffer >>> operations without releasing memory", I would hold my opinion until >>> it is documented clearly in UEFI spec. >> >> You are right, of course. >> >> It's just that we cannot avoid exploring ways, for this feature, that >> currently fall outside of the spec. Whatever we do here will be outside >> of the spec, one way or another. I suggested what I thought could be a >> reasonable extension to the spec, one that could be implemented by PciIo >> implementors even before the spec is modified. >> >> edk2's PciIo.Unmap already behaves like this, without breaking the spec. >> >> If there's a better way -- for example modifying CoreExitBootServices() >> in edk2, to signal IOMMU drivers separately, *after* notifying other >> ExitBootServices() handlers --, that might work as well. >> >> >>> My current concern is: >>> First, this sentence is NOT defined in UEFI specification. >> >> Correct. >> >> >>> Second, it might break an existing implementation or existing feature, such as tracing. >> >> Correct. >> >>> Till now, I did not see any restriction in UEFI spec say: In this function, you are not allowed >>> to call memory services. >>> The only restriction is >>> 1) TPL restriction, where memory service must be <= TPL_NOTIFY. >>> 2) AP restriction, where no UEFI service/protocol is allowed for AP. >> >> I agree. >> >> >>> I'm just trying to operate with the APIs currently defined by the UEFI >>> spec, and these assumptions were the best I could come up with. >>> [Jiewen] The PCI device driver must follow and *only* follow UEFI spec. >>> Especially the IHV Option ROM should not consume any private API. >> >> I disagree about "only follow". If there are additional requirements >> that can be satisfied without breaking the spec, driver implementors can >> choose to conform to both sets of requirements. >> >> For example (if I understand correctly), Microsoft / Windows present a >> bunch of requirements *beyond* the UEFI spec, for both platform and >> add-on card firmware. Most vendors conform :) >> >> >>>> [Jiewen] I am not sure who will control "When control is transferred to the OS, all >>>> guest memory should be encrypted again, even those areas that were once >>>> used as CommonBuffers." >>>> For SEV case, I think it should be controlled by OS, because it is just CR3. >>> >>> If it was indeed just CR3, then I would fully agree with you. >>> >>> But -- to my understanding --, ensuring that the memory is actually >>> encrypted requires that the guest *write* to the memory through a >>> virtual address whose PTE has the C-bit set. >>> >>> And I don't think the guest OS can be expected to rewrite all of its >>> memory at launch. :( >>> >>> [Jiewen] That is fine. >>> I suggest we get clear on the threat model as the first step. >>> The threat model for SEV usage in OVMF. >>> >>> I am sorry if that is already discussed before. I might ignore the conversation. >> >> No problem; it's always good to re-investigate our assumptions and >> operating principles. >> >> >>> If you or any SEV owner can share the insight, that will be great. >>> See my question above "If that is the threat model, I have a question on the exposure:..." >> >> I shared *my* impressions of the threat model (which are somewhat >> unclear, admittedly, and thus could make me overly cautious). >> >> I hope Brijesh can extend and/or correct my description. >> >> >>>> So apparently the default behaviors of the VT-d IOMMU and the SEV IOMMU >>>> are opposites -- VT-d permits all DMA unless configured otherwise, while >>>> SEV forbids all DMA unless configured otherwise. >>>> [Jiewen] I do not think so. The default behaviors of VT-d IOMMU is to disable all DMA access. >>>> I setup translation table, but mark all entry to be NOT-PRESENT. >>>> I grant DMA access only if there is a specific request by a device. >>>> >>>> I open DMA access in ExitBootServices, just want to make sure it is compatible to >>>> the existing OS, which might not have knowledge on IOMMU. >>>> I will provide a patch to make it a policy very soon. As such VTd IOMMU may >>>> turn off IOMMU, or keep it enabled at ExitBootService. >>>> An IOMMU aware OS may take over IOMMU directly and reprogram it. >>> >>> Thanks for the clarification. >>> >>> But, again, will this not lead to the possibility that the VT-d IOMMU's >>> ExitBootServices() handler disables all DMA *before* the PCI drivers get >>> a chance to run their own ExitBootServices() handlers, disabling the >>> individual devices? >>> [Jiewen] Sorry for the confusing. Let me explain: >>> No, VTd never disables *all* DMA buffer at ExitBootService. >>> >>> "disable VTd" means to "disable IOMMU protection" and "allow all DMA". >>> "Keep VTd enabled" means to "keep IOMMU protection enabled" and >>> "only allow the DMA from Map() request". >> >> Okay, but this interpretation was exactly what I thought of first (see >> above): "VT-d permits all DMA unless configured otherwise". You answered >> that it wasn't the case. >> >> So basically, if I understand it correctly now, at ExitBootServices the >> VT-d IOMMU driver opens up all RAM for DMA access. Is that correct? >> >> That is equivalent to decrypting all memory under SEV, and is the exact >> opposite of what we want. (As I understand it.) >> >> >>> If that happens, then a series of IOMMU faults could be generated, which >>> I described above. (I.e., such IOMMU faults could result, at least in >>> the case of SEV, in garbage being written to disk, via queued commands.) >>> [Jiewen] You are right. That would not happen. >>> IOMMU driver should not bring any compatibility problem for the PCI driver, >>> iff the PCI driver follows the UEFI specification. >> >> Agreed. >> >> >>> Now, to finish up, here's an idea I just had. >>> >>> Are we allowed to call gBS->SignalEvent() in an ExitBootServices() >>> notification function? >>> >>> If we are, then we could do the following: >>> >>> * PciIo.Unmap() and friends would work as always (no restrictions on >>> dynamic memory allocation or freeing, for any kind of DMA operation). >>> >>> * PCI drivers would not be expected to call PciIo.Unmap() in their >>> ExitBootServices() handlers. >>> >>> * The IOMMU driver would have an ExitBootServices() notification >>> function, to be enqueued at the TPL_CALLBACK or TPL_NOTIFY level >>> (doesn't matter which). >>> >>> * In this notification function, the IOMMU driver would signal *another* >>> event (a private one). The notification function for this event would >>> be enqueued strictly at the TPL_CALLBACK level. >>> >>> * The notification function for the second event (private to the IOMMU) >>> would iterate over all existent mappings, and unmap them, without >>> allocating or freeing memory. >>> >>> The key point is that by signaling the second event, the IOMMU driver >>> could order its own cleanup code after all other ExitBootServices() >>> callbacks. (Assuming, at least, that no other ExitBootServices() >>> callback employs the same trick to defer itself!) Therefore by the time >>> the second callback runs, all PCI devices have been halted, and it is >>> safe to tear down the translations. >>> >>> The ordering would be ensured by the following: >>> >>> - The UEFI spec says under CreateEventEx(), "All events are guaranteed >>> to be signaled before the first notification action is taken." This >>> means that, by the time the IOMMU's ExitBootServices() handler is >>> entered, all other ExitBootServices() handlers have been *queued* at >>> least, at TPL_CALLBACK or TPL_NOTIFY. >>> >>> - Therefore, when we signal our second callback from there, for >>> TPL_CALLBACK, the second callback will be queued at the end of the >>> TPL_CALLBACK queue. >>> >>> - CreateEventEx() also says that EFI_EVENT_GROUP_EXIT_BOOT_SERVICES "is >>> functionally equivalent to the EVT_SIGNAL_EXIT_BOOT_SERVICES flag >>> for the Type argument of CreateEvent." So it wouldn't matter if a >>> driver used CreateEvent() or CreateEventEx() for setting up its own >>> handler, the handler would be queued just the same. >>> >>> I think this is ugly and brittle, but perhaps the only way to clean up >>> *all* translations safely, with regard to PciIo.Unmap() + >>> ExitBootServices(), without updating the UEFI spec. >>> >>> What do you think? >>> >>> [Jiewen] If the order is correct, and all PCI device driver is halt at ExitBootServices, that works. >>> We do not need update PCI driver and we do not need update UEFI spec. >>> We only need update IOMMU driver which is concerned, based upon the threat model. >>> That probably is best solution. :-) >> >> I'm very glad to hear that you like the idea. >> >> However, it depends on whether we are permitted, by the UEFI spec, to >> signal another event in an ExitBootServices() notification function. >> >> Are we permitted to do that? >> >> Does the UEFI spec guarantee that the notification function for the >> *second* event will be queued just like it would be under "normal" >> circumstances? >> >> (I know we must not allocate or free memory in the notification function >> of the *second* event either; I just want to know if the second event's >> handler is *queued* like it would normally be.) >> >> >>> I assume you want to handle both common buffer and read/write buffer, right? >> >> Yes. Under this idea, all kinds of operations would be cleaned up. >> >> >>> And if possible, I still have interest to get clear on the threat model for SEV in OVMF. >>> If you or any SEV owner can share ... >> >> Absolutely. Please see above. >> >> Thank you! >> Laszlo >> > _______________________________________________ > edk2-devel mailing list > edk2-devel@lists.01.org > https://lists.01.org/mailman/listinfo/edk2-devel >