From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 37ADF2095BB6A for ; Tue, 5 Sep 2017 10:54:38 -0700 (PDT) Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id DC11FA790; Tue, 5 Sep 2017 17:57:26 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com DC11FA790 Authentication-Results: ext-mx05.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx05.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=lersek@redhat.com Received: from lacos-laptop-7.usersys.redhat.com (ovpn-120-185.rdu2.redhat.com [10.10.120.185]) by smtp.corp.redhat.com (Postfix) with ESMTP id 421A17D7BC; Tue, 5 Sep 2017 17:57:25 +0000 (UTC) To: "Yao, Jiewen" , "Zeng, Star" , edk2-devel-01 Cc: "Dong, Eric" , Brijesh Singh References: <20170903195449.30261-1-lersek@redhat.com> <0C09AFA07DD0434D9E2A0C6AEB0483103B93A125@shsmsx102.ccr.corp.intel.com> <4b24b1eb-362f-3b46-97e2-bdfda53f40c9@redhat.com> <74D8A39837DF1E4DA445A8C0B3885C503A9A79BD@shsmsx102.ccr.corp.intel.com> <5f1fdc84-5824-bee2-5a1a-fbd67adf5443@redhat.com> <74D8A39837DF1E4DA445A8C0B3885C503A9A7F10@shsmsx102.ccr.corp.intel.com> From: Laszlo Ersek Message-ID: <12d71f32-9dcf-4f6e-b033-d5f82104caca@redhat.com> Date: Tue, 5 Sep 2017 19:57:24 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 In-Reply-To: <74D8A39837DF1E4DA445A8C0B3885C503A9A7F10@shsmsx102.ccr.corp.intel.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.29]); Tue, 05 Sep 2017 17:57:27 +0000 (UTC) Subject: Re: [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() X-BeenThere: edk2-devel@lists.01.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: EDK II Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Sep 2017 17:54:38 -0000 Content-Type: text/plain; charset=windows-1252 Content-Language: en-US Content-Transfer-Encoding: 7bit On 09/05/17 15:44, Yao, Jiewen wrote: > Good discussion. Comment in line. > > From: Laszlo Ersek [mailto:lersek@redhat.com] > Sent: Tuesday, September 5, 2017 5:16 PM > To: Yao, Jiewen ; Zeng, Star ; edk2-devel-01 > Cc: Dong, Eric > Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() > > On 09/05/17 04:18, Yao, Jiewen wrote: >> HI Laszlo >> Thank you very much to raise this DMA topic. >> >> I have a chat with Star. We have couples of question, from high level to low level. >> Hope you can clarify: >> >> >> 1) About disable DMA >> >> As you mentioned, in UEFI_Driver_Writer_Guide_V1.0.1_120308.pdf, 7.7 Adding the Exit Boot Services feature >> ========================= >> Examples from the EDK II that use this feature are the PCI device drivers for USB Host >> Controllers. Some USB Host Controllers are PCI Bus Masters that continuously access a >> memory buffer to poll for operation requests. Access to this memory buffer by a USB >> Host Controller may be required to boot an operation system, but this activity must be >> terminated when the OS calls ExitBootServices(). The typical action in the Exit Boot >> Services Event for these types of drivers is to disable the PCI bus master and place the >> USB Host Controller into a halted state >> ========================= >> >> I believe the fundamental requirement is to "disable DMA". >> As such, the simplest action is to clear PCI Bus Master Enable bit in the command register. >> >> Do you think clearing BME is enough? >> Is there any special reason that we must unmap the DMA buffer? > > Modifying the device state is enough to ask the device *nicely* to stop > doing DMA. > > But, I think, if we are on an IOMMU-protected system, where part of > PciIo.Map() is to add a translation (= system memory access permission) > to the IOMMU, then we should also revoke this permission explicitly, > *after* we asked the device nicely to stop doing DMA. > > Basically, ask the device nicely first, and then enforce the protection. > > ... BTW, I mentioned "IOMMU faults" earlier, and I have some further > thoughts on them. This is about tearing down IOMMU translations *first*, > in the IOMMU driver's ExitBootServices() handler, *before* the PCI > drivers get a chance -- in *their* ExitBootServices() handlers -- to ask > the PCI devices nicely to stop doing DMA. > > I think a quite problematic failure mode is possible here. Assume that a > disk controller has some write commands queued (which are mapped with > Bus Master Read or Bus Master CommonBuffer operations). If the > ExitBootServices() handler for the IOMMU driver runs first, then the > device could act upon these queued commands with missing IOMMU > translations, before the device's own PCI driver asked the device to > shut down DMA. In some cases, I guess this could result in those queued > write commands simply failing, without ill effects. However, dependent > on the IOMMU implementation, the write commands might not fail > explicitly, but read garbage from system memory, and write garbage to > disk blocks. That would be bad. > > I think this is exactly what could happen with the SEV IOMMU we have. > The memory would be first re-encrypted (this is equivalent to revoking a > translation), and then the write commands (queued in the emulated AHCI > controller, in the hypervisor) would read garbage from guest memory, and > write garbage to the virtual disk. > > [Jiewen] I am not clear on how it happens in SEV here. > Below is my thought, please correct me if I am wrong. > > First, the fundamental requirement is to disable BME at exit boot service, > or make sure the controller is in halt state. > If a device driver fails to meet such requirement, it is a bug we need to fix at first. > > Now, let's assume all device driver is in halt state at ExitBootService. - that is my > understanding for all rest discussion. OK. > I checked OVMF and I do not see any code in IOMMU driver to register exit boot services event, > which means the DMA buffer is still decrypted. More precisely, the IOMMU driver does not take it upon itself to re-encrypt DMA buffers at ExitBootServices(). Instead it relies on the PCI drivers to Unmap() CommonBuffers at ExitBootServices(). > If that is the case, the DMA transfer before BME disable is safety (DMA buffer is decrypted) and > there will be no DMA transfer any more at ExitBootService (BME disabled). > Then how does the write command queued in the emulated AHCI controller start read garbage? It is not happening right now. When I wrote that, I was just speculating about possible consequences; i.e., what *could* happen if the IOMMU driver itself re-encrypted memory at ExitBootServices(). *Then* the garbage case could happen. In other words, the case I described does not happen right now; I only described it as a counter-argument against *adding* an ExitBootServices() handler to the IOMMU driver. Under OvmfPkg, the IOMMU behavior for SEV, and the PCI drivers for the virtio devices, are fully in synch. The PCI drivers *alone* are responsible for unmapping CommonBuffers at ExitBootServices(), and so the IOMMU driver doesn't do it. The question is if the same distribution of responsibilities applies to PCI drivers that live outside of OvmfPkg, such as MdeModulePkg. The current patch set suggests "yes". > The SEV IOMMU driver only modifies the page table. Not exactly. It modifies page table entries, and then re-writes the memory covered by those page table entries. Simply flipping the C-bit in page table entries does not reencrypt memory contents, it only sets up the next read and/or write that happens through that page mapping, for "decrypted view" or "encrypted view", from the virtualization host side. > Then after ExitBootService, the OS will take control of CR3 and set correct > encryption bit. This is true, the guest OS will set up new page tables, and in those PTEs, the C-bit ("encrypt") will likely be set by default. However, the guest OS will not *rewrite* all of the memory, with the C-bit set. This means that any memory that the firmware didn't actually re-encrypt (in the IOMMU driver) will still expose stale data to the hypervisor. > NOTE: The device should still be halt state, because the device is > also controlled by OS. This is the key point -- the device is *not* controlled by the guest OS, in the worst case. The virtual device is ultimately implemented by the hypervisor. We don't expect the virtual device (= the hypervisor) to mis-behave on purpose. However, if the hypervisor is compromised by an external attacker (for example over the network, or via privilege escalation from another guest), then all guest RAM that has not been encrypted up to that point becomes visible to the attacker. In other words, the hypervisor ("the device") becomes retro-actively malicious. The point of SEV is to keep as much guest data encrypted at all times as possible, so that *whenever* the hypervisor is compromised by an attacker, the guest memory -- which the attacker can inevitably dump from the host side -- remains un-intellegible to the attacker. >> 2) About common buffer >> If there is a special requirement that we must unmap the DMA buffer, why we only unmap the common buffer? >> >> I understand your statement that unmapping a read buffer or write buffer may cause free memory action. But that is implementation choice. > > I believe I disagree with this (i.e., I think it is not an > implementation choice). > > In the general case, we can assume that PciIo.Map() allocates a bounce > buffer internally, when Read or Write action is requested. > > [Jiewen] When I say implementation choice, I mean Map() *may* allocates > a bounce buffer or it just uses the same address, if the platform is capable of > using DRAM as DMA buffer directly. I have two counter-arguments: (a) As long as we don't want to reimplement PciIo.Map() from the scratch, the current layering of protocols in edk2 implies that even if the PCI device driver sets DUAL_ADDRESS_CYCLE, this flag can be cleared on the way to the IOMMU protocol, in PciHostBridgeDxe, if PciHostBridgeLib says "no DMA above 4GB". OvmfPkg's PciHostBridgeLib currently says "no DMA above 4GB", simply because that's always been the case, and I'm not sure if we can lift the limitation without regressions. (b) The more important counter-argument is that Map() and Unmap() need auxiliary memory (a temporary buffer) for actually encrypting and decrypting memory, so that the virtual device (implemented by the hypervisor) can read it or write it (temporarily). As I wrote above, encrypting or decrypting the memory contents works by writing the original data through a new page mapping that has the desired C-bit setting. In a guest operating system, in-place encryption is possible by having two virtual mappings to the same physical address range, one with the C-bit set, and another with the C-bit clear. For encrypting, the guest OS reads a cache-line sized chunk of data through the "C-bit clear" mapping, and writes it back through the "C-bit set" mapping. For decrypting, it does the inverse. In UEFI, we don't have two virtual page mappings for the same physical address, we only have a 1:1 mapping, on which we can flip the C-bit. This means that, for decryption for example, we have to copy the data from its original location (currently encrypted) to a "stash" (which is always encrypted), modify the PTEs for the original location (clear the C-bit, flush the TLB), then finally copy the data back from the "stash" (still encrypted) to the original location (which will effectively decrypt it, for the hypervisor to read). This "stash" requires room. Currently the "stash" has an equal number of pages to the actual region being mapped. For CommonBuffer operations, the stash is allocated in AllocateBuffer(), together with the normal buffer's allocation. It is freed similarly in FreeBuffer(). For Read/Write BMDMA operations, for which the PCI driver only calls Map()/Unmap(), we cannot avoid allocating the stash in Map(), and freeing it in Unmap(). We are practically forced to allocate a bounce buffer not due to address width limitations (32 vs 64), but due to C-bit status. Now, earlier I had an idea to add a 1-page sized global array variable to the IOMMU driver, and do the encryption/decryption page by page. In other words, "unroll" the PTE management and the copying into page-sized chunks. This would eliminate the need to allocate a "stash" dynamically. However, the drawback of this approach is that the PTE's must be toggled one by one even for a larger region. It could cause undesirably high page table splitting and bad performance. So, the SEV IOMMU could only eliminate dynamic memory alloc/dealloc for Read/Write DMA operations allocation if both conditions were changed: - if we knew for sure that 64-bit DMA was safe to advertize in PciHostBridgeLib, - if we replaced the dynamically allocated stash, and the "batch" PTE management + copying, with a single-page static buffer, and "unrolled" PTE management + copying. More generally speaking, 64-bit clean DMA might simply not be possible on a given hardware platform, and on that platform, PciIo.Map() could not avoid allocating bounce buffers for Read/Write DMA operations, generally speaking. Thus, we shouldn't require a PCI driver to call Unmap() at ExitBootServices(). > In turn, PciIo.Unmap() has to release the same buffer, internally. > PciIo.Unmap() does not know if it is called from an ExitBootServices() > notification function, or from a normal context. It cannot decide if it > is allowed to release memory, or not. There's no additional parameter to > provide this hint either. If PciIo.Unmap() never releases buffers, then > ExitBootServices() will be happy, but we're going to leak memory. > > Now, PciIo.Unmap() could move the bounce buffer to some internal "free > list" instead, rather than freeing it. And, the next PciIo.Map() call > could try to reuse a buffer from the "free list". But this is difficult > to do, because the buffers may have any size, and recycling them would > require an internal, general "page pool", in parallel to the Boot > Services page pool (AllocatePages, FreePages). > > If PciIo.Unmap() could be informed that it is running in > ExitBootServices() notification context -- called from PCI drivers --, > then unmapping Read and Write operations would be simple. > >> If we have strong requirement to unmap the DMA buffer, we should achieve that. > > I totally agree that we *should* achieve that, but I don't see how the > current PciIo.Map() and PciIo.Unmap() APIs make it possible, for Read > and Write operations. > > For CommonBuffer operations, it is possible, because AllocateBuffer() > and FreeBuffer() are separate actions, and an ExitBootServices() > notification function can *avoid* calling FreeBuffer. (But only for > CommonBuffer operations.) PciIo.Unmap() cannot be told *not* to release > a bounce buffer for Read/Write. > > [Jiewen] Again, it is an implementation choice. It is possible that a host > bridge allocate common buffer without calling any Allocate(). > But another implementation may also allocate internal data structure to > record every map() action, including common buffer. (For event log, or > tracing purpose, et etc.) True. My point is that - we could reasonably require that Unmap() work for CommonBuffer operations without releasing memory, - but we couldn't reasonably require the same for Read and Write operations. This is because for CommonBuffer, a platform that right now allocates extra stuff in Map(), could move the same allocations into AllocateBuffer(). Similarly, move the current de-allocations from Unmap() into FreeBuffer(). The same is not possible for Read/Write, so we shouldn't require that. > I do not think Free() should bring difference between common buffer or > read/write buffer. > I am a little worried that we have too many assumption here: > e.g. Map() common buffer never involves Allocate() even for driver internal > data structure , and Map() read/write buffer always involves Allocate() even > there is no need to allocate anything. > With that assumption, we purposely avoid Unmap() read/write buffer, > but force Unmap() common buffer. Exactly. > I am not sure if that assumption is correct. I'm just trying to operate with the APIs currently defined by the UEFI spec, and these assumptions were the best I could come up with. For example, if PciIo.Unmap() took one more parameter: IN BOOLEAN CalledAtExitBootServices then all this would be solved at once. Namely, PciIo implementations would be required to call Unmap() for *all* DMA operations (read, write, commonbuffer) in their ExitBootServices() notification functions, with the parameter set to TRUE. All other contexts should pass FALSE. In turn, CalledAtExitBootServices==TRUE would require PciIo.Unmap() to work exactly as otherwise, but all memory allocated earlier with gBS->AllocatePool() and gBS->AllocatePages() would have to be leaked on purpose. This is all. Given that this parameter does not exist, I'm trying to find other ways to: - perform the unmapping (re-encryption) at ExitBootServices, - without releasing memory, - and not in the IOMMU driver (because that could remove the translation before the PCI driver's own callback could disable the device, and that would be wrong) >> 3) About interaction with IOMMU >> I am not worried about IOMMU interaction. >> I believe an proper IOMMU implementation should not block any action taken by a PCI device driver. >> Current IntelVTdDxe driver disables VTd engine in ExitBootService, which means the DMA access is no longer managed by VTd engine. It is NOT related to disable DMA buffer. > > Ah, I understand! > > So, on one hand, that is totally safe, for PCI device drivers. It means > all pending DMA can get through, until the PCI drivers ask the devices > (nicely) to stop doing DMA. > [Jiewen] Yes. I think any IOMMU implementation should take that > into consideration. > > > > On the other hand, it is the opposite of the security goal of SEV. With > SEV, the default state is "all DMA forbidden" (= all guest memory > encrypted). This is also the status that we'd like to achieve when > ExitBootServices() completes. When control is transferred to the OS, all > guest memory should be encrypted again, even those areas that were once > used as CommonBuffers. > [Jiewen] I am not sure who will control "When control is transferred to the OS, all > guest memory should be encrypted again, even those areas that were once > used as CommonBuffers." > For SEV case, I think it should be controlled by OS, because it is just CR3. If it was indeed just CR3, then I would fully agree with you. But -- to my understanding --, ensuring that the memory is actually encrypted requires that the guest *write* to the memory through a virtual address whose PTE has the C-bit set. And I don't think the guest OS can be expected to rewrite all of its memory at launch. :( > If so, we should not meet any problem, because we already disable BME > in device driver's ExitBootService event. I agree, but installing new page tables is not enough. > So apparently the default behaviors of the VT-d IOMMU and the SEV IOMMU > are opposites -- VT-d permits all DMA unless configured otherwise, while > SEV forbids all DMA unless configured otherwise. > [Jiewen] I do not think so. The default behaviors of VT-d IOMMU is to disable all DMA access. > I setup translation table, but mark all entry to be NOT-PRESENT. > I grant DMA access only if there is a specific request by a device. > > I open DMA access in ExitBootServices, just want to make sure it is compatible to > the existing OS, which might not have knowledge on IOMMU. > I will provide a patch to make it a policy very soon. As such VTd IOMMU may > turn off IOMMU, or keep it enabled at ExitBootService. > An IOMMU aware OS may take over IOMMU directly and reprogram it. Thanks for the clarification. But, again, will this not lead to the possibility that the VT-d IOMMU's ExitBootServices() handler disables all DMA *before* the PCI drivers get a chance to run their own ExitBootServices() handlers, disabling the individual devices? If that happens, then a series of IOMMU faults could be generated, which I described above. (I.e., such IOMMU faults could result, at least in the case of SEV, in garbage being written to disk, via queued commands.) >> 4) About the driver you modified >> I believe we have some other PCI device driver (NVMe/SDMMC/UFS), besides the ones you modified in this patch (ATA/UCHI/EHCI/XHCI). >> If we need take this approach, I assume they also need update, right? > > Yes, they should be updated similarly. > > SDMMC and UFS are not included in OVMF, so I couldn't test their behavior. > > NVMe is included in OVMF, but it is used very rarely in practice (we can > call it a corner case), so I thought I would first submit patches for > the four main drivers (UHCI, EHCI, XHCI, AHCI) that are frequently used > with OVMF and QEMU. Tracking down the CommonBuffer lifecycles was > difficult :) > [Jiewen] Understood. That is OK. We can handle that in separate patch. Thanks! Now, to finish up, here's an idea I just had. Are we allowed to call gBS->SignalEvent() in an ExitBootServices() notification function? If we are, then we could do the following: * PciIo.Unmap() and friends would work as always (no restrictions on dynamic memory allocation or freeing, for any kind of DMA operation). * PCI drivers would not be expected to call PciIo.Unmap() in their ExitBootServices() handlers. * The IOMMU driver would have an ExitBootServices() notification function, to be enqueued at the TPL_CALLBACK or TPL_NOTIFY level (doesn't matter which). * In this notification function, the IOMMU driver would signal *another* event (a private one). The notification function for this event would be enqueued strictly at the TPL_CALLBACK level. * The notification function for the second event (private to the IOMMU) would iterate over all existent mappings, and unmap them, without allocating or freeing memory. The key point is that by signaling the second event, the IOMMU driver could order its own cleanup code after all other ExitBootServices() callbacks. (Assuming, at least, that no other ExitBootServices() callback employs the same trick to defer itself!) Therefore by the time the second callback runs, all PCI devices have been halted, and it is safe to tear down the translations. The ordering would be ensured by the following: - The UEFI spec says under CreateEventEx(), "All events are guaranteed to be signaled before the first notification action is taken." This means that, by the time the IOMMU's ExitBootServices() handler is entered, all other ExitBootServices() handlers have been *queued* at least, at TPL_CALLBACK or TPL_NOTIFY. - Therefore, when we signal our second callback from there, for TPL_CALLBACK, the second callback will be queued at the end of the TPL_CALLBACK queue. - CreateEventEx() also says that EFI_EVENT_GROUP_EXIT_BOOT_SERVICES "is functionally equivalent to the EVT_SIGNAL_EXIT_BOOT_SERVICES flag for the Type argument of CreateEvent." So it wouldn't matter if a driver used CreateEvent() or CreateEventEx() for setting up its own handler, the handler would be queued just the same. I think this is ugly and brittle, but perhaps the only way to clean up *all* translations safely, with regard to PciIo.Unmap() + ExitBootServices(), without updating the UEFI spec. What do you think? Thanks, Laszlo