public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed
From: Laszlo Ersek <lersek@redhat.com>
To: "Yao, Jiewen" <jiewen.yao@intel.com>,
	"Zeng, Star" <star.zeng@intel.com>,
	 edk2-devel-01 <edk2-devel@lists.01.org>
Cc: "Dong, Eric" <eric.dong@intel.com>,
	Brijesh Singh <brijesh.singh@amd.com>
Subject: Re: [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices()
Date: Tue, 5 Sep 2017 19:57:24 +0200	[thread overview]
Message-ID: <12d71f32-9dcf-4f6e-b033-d5f82104caca@redhat.com> (raw)
In-Reply-To: <74D8A39837DF1E4DA445A8C0B3885C503A9A7F10@shsmsx102.ccr.corp.intel.com>

On 09/05/17 15:44, Yao, Jiewen wrote:
> Good discussion. Comment in line.
> 
> From: Laszlo Ersek [mailto:lersek@redhat.com]
> Sent: Tuesday, September 5, 2017 5:16 PM
> To: Yao, Jiewen <jiewen.yao@intel.com>; Zeng, Star <star.zeng@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org>
> Cc: Dong, Eric <eric.dong@intel.com>
> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices()
> 
> On 09/05/17 04:18, Yao, Jiewen wrote:
>> HI Laszlo
>> Thank you very much to raise this DMA topic.
>>
>> I have a chat with Star. We have couples of question, from high level to low level.
>> Hope you can clarify:
>>
>>
>> 1)       About disable DMA
>>
>> As you mentioned, in UEFI_Driver_Writer_Guide_V1.0.1_120308.pdf, 7.7 Adding the Exit Boot Services feature
>> =========================
>> Examples from the EDK II that use this feature are the PCI device drivers for USB Host
>> Controllers. Some USB Host Controllers are PCI Bus Masters that continuously access a
>> memory buffer to poll for operation requests. Access to this memory buffer by a USB
>> Host Controller may be required to boot an operation system, but this activity must be
>> terminated when the OS calls ExitBootServices(). The typical action in the Exit Boot
>> Services Event for these types of drivers is to disable the PCI bus master and place the
>> USB Host Controller into a halted state
>> =========================
>>
>> I believe the fundamental requirement is to "disable DMA".
>> As such, the simplest action is to clear PCI Bus Master Enable bit in the command register.
>>
>> Do you think clearing BME is enough?
>> Is there any special reason that we must unmap the DMA buffer?
> 
> Modifying the device state is enough to ask the device *nicely* to stop
> doing DMA.
> 
> But, I think, if we are on an IOMMU-protected system, where part of
> PciIo.Map() is to add a translation (= system memory access permission)
> to the IOMMU, then we should also revoke this permission explicitly,
> *after* we asked the device nicely to stop doing DMA.
> 
> Basically, ask the device nicely first, and then enforce the protection.
> 
> ... BTW, I mentioned "IOMMU faults" earlier, and I have some further
> thoughts on them. This is about tearing down IOMMU translations *first*,
> in the IOMMU driver's ExitBootServices() handler, *before* the PCI
> drivers get a chance -- in *their* ExitBootServices() handlers -- to ask
> the PCI devices nicely to stop doing DMA.
> 
> I think a quite problematic failure mode is possible here. Assume that a
> disk controller has some write commands queued (which are mapped with
> Bus Master Read or Bus Master CommonBuffer operations). If the
> ExitBootServices() handler for the IOMMU driver runs first, then the
> device could act upon these queued commands with missing IOMMU
> translations, before the device's own PCI driver asked the device to
> shut down DMA. In some cases, I guess this could result in those queued
> write commands simply failing, without ill effects. However, dependent
> on the IOMMU implementation, the write commands might not fail
> explicitly, but read garbage from system memory, and write garbage to
> disk blocks. That would be bad.
> 
> I think this is exactly what could happen with the SEV IOMMU we have.
> The memory would be first re-encrypted (this is equivalent to revoking a
> translation), and then the write commands (queued in the emulated AHCI
> controller, in the hypervisor) would read garbage from guest memory, and
> write garbage to the virtual disk.
> 
> [Jiewen] I am not clear on how it happens in SEV here.
> Below is my thought, please correct me if I am wrong.
> 
> First, the fundamental requirement is to disable BME at exit boot service,
> or make sure the controller is in halt state.
> If a device driver fails to meet such requirement, it is a bug we need to fix at first.
> 
> Now, let's assume all device driver is in halt state at ExitBootService. - that is my
> understanding for all rest discussion.

OK.

> I checked OVMF and I do not see any code in IOMMU driver to register exit boot services event,
> which means the DMA buffer is still decrypted.

More precisely, the IOMMU driver does not take it upon itself to
re-encrypt DMA buffers at ExitBootServices().

Instead it relies on the PCI drivers to Unmap() CommonBuffers at
ExitBootServices().

> If that is the case, the DMA transfer before BME disable is safety (DMA buffer is decrypted) and
> there will be no DMA transfer any more at ExitBootService (BME disabled).
> Then how does the write command queued in the emulated AHCI controller start read garbage?

It is not happening right now. When I wrote that, I was just speculating
about possible consequences; i.e., what *could* happen if the IOMMU
driver itself re-encrypted memory at ExitBootServices().

*Then* the garbage case could happen.

In other words, the case I described does not happen right now; I only
described it as a counter-argument against *adding* an
ExitBootServices() handler to the IOMMU driver.

Under OvmfPkg, the IOMMU behavior for SEV, and the PCI drivers for the
virtio devices, are fully in synch. The PCI drivers *alone* are
responsible for unmapping CommonBuffers at ExitBootServices(), and so
the IOMMU driver doesn't do it.

The question is if the same distribution of responsibilities applies to
PCI drivers that live outside of OvmfPkg, such as MdeModulePkg.

The current patch set suggests "yes".

> The SEV IOMMU driver only modifies the page table.

Not exactly. It modifies page table entries, and then re-writes the
memory covered by those page table entries.

Simply flipping the C-bit in page table entries does not reencrypt
memory contents, it only sets up the next read and/or write that happens
through that page mapping, for "decrypted view" or "encrypted view",
from the virtualization host side.

> Then after ExitBootService, the OS will take control of CR3 and set correct
> encryption bit.

This is true, the guest OS will set up new page tables, and in those
PTEs, the C-bit ("encrypt") will likely be set by default.

However, the guest OS will not *rewrite* all of the memory, with the
C-bit set. This means that any memory that the firmware didn't actually
re-encrypt (in the IOMMU driver) will still expose stale data to the
hypervisor.

> NOTE: The device should still be halt state, because the device is
> also controlled by OS.

This is the key point -- the device is *not* controlled by the guest OS,
in the worst case.

The virtual device is ultimately implemented by the hypervisor. We don't
expect the virtual device (= the hypervisor) to mis-behave on purpose.
However, if the hypervisor is compromised by an external attacker (for
example over the network, or via privilege escalation from another
guest), then all guest RAM that has not been encrypted up to that point
becomes visible to the attacker. In other words, the hypervisor ("the
device") becomes retro-actively malicious.

The point of SEV is to keep as much guest data encrypted at all times as
possible, so that *whenever* the hypervisor is compromised by an
attacker, the guest memory -- which the attacker can inevitably dump
from the host side -- remains un-intellegible to the attacker.


>> 2)       About common buffer
>> If there is a special requirement that we must unmap the DMA buffer, why we only unmap the common buffer?
>>
>> I understand your statement that unmapping a read buffer or write buffer may cause free memory action. But that is implementation choice.
> 
> I believe I disagree with this (i.e., I think it is not an
> implementation choice).
> 
> In the general case, we can assume that PciIo.Map() allocates a bounce
> buffer internally, when Read or Write action is requested.
> 
> [Jiewen] When I say implementation choice, I mean Map() *may* allocates
> a bounce buffer or it just uses the same address, if the platform is capable of
> using DRAM as DMA buffer directly.

I have two counter-arguments:

(a) As long as we don't want to reimplement PciIo.Map() from the
scratch, the current layering of protocols in edk2 implies that even if
the PCI device driver sets DUAL_ADDRESS_CYCLE, this flag can be cleared
on the way to the IOMMU protocol, in PciHostBridgeDxe, if
PciHostBridgeLib says "no DMA above 4GB".

OvmfPkg's PciHostBridgeLib currently says "no DMA above 4GB", simply
because that's always been the case, and I'm not sure if we can lift the
limitation without regressions.


(b) The more important counter-argument is that Map() and Unmap() need
auxiliary memory (a temporary buffer) for actually encrypting and
decrypting memory, so that the virtual device (implemented by the
hypervisor) can read it or write it (temporarily).

As I wrote above, encrypting or decrypting the memory contents works by
writing the original data through a new page mapping that has the
desired C-bit setting.

In a guest operating system, in-place encryption is possible by having
two virtual mappings to the same physical address range, one with the
C-bit set, and another with the C-bit clear. For encrypting, the guest
OS reads a cache-line sized chunk of data through the "C-bit clear"
mapping, and writes it back through the "C-bit set" mapping. For
decrypting, it does the inverse.

In UEFI, we don't have two virtual page mappings for the same physical
address, we only have a 1:1 mapping, on which we can flip the C-bit.
This means that, for decryption for example, we have to copy the data
from its original location (currently encrypted) to a "stash" (which is
always encrypted), modify the PTEs for the original location (clear the
C-bit, flush the TLB), then finally copy the data back from the "stash"
(still encrypted) to the original location (which will effectively
decrypt it, for the hypervisor to read).

This "stash" requires room. Currently the "stash" has an equal number of
pages to the actual region being mapped. For CommonBuffer operations,
the stash is allocated in AllocateBuffer(), together with the normal
buffer's allocation. It is freed similarly in FreeBuffer().

For Read/Write BMDMA operations, for which the PCI driver only calls
Map()/Unmap(), we cannot avoid allocating the stash in Map(), and
freeing it in Unmap(). We are practically forced to allocate a bounce
buffer not due to address width limitations (32 vs 64), but due to C-bit
status.

Now, earlier I had an idea to add a 1-page sized global array variable
to the IOMMU driver, and do the encryption/decryption page by page. In
other words, "unroll" the PTE management and the copying into page-sized
chunks. This would eliminate the need to allocate a "stash" dynamically.

However, the drawback of this approach is that the PTE's must be toggled
one by one even for a larger region. It could cause undesirably high
page table splitting and bad performance.


So, the SEV IOMMU could only eliminate dynamic memory alloc/dealloc for
Read/Write DMA operations allocation if both conditions were changed:

- if we knew for sure that 64-bit DMA was safe to advertize in
  PciHostBridgeLib,

- if we replaced the dynamically allocated stash, and the "batch" PTE
  management + copying, with a single-page static buffer, and "unrolled"
  PTE management + copying.

More generally speaking, 64-bit clean DMA might simply not be possible
on a given hardware platform, and on that platform, PciIo.Map() could
not avoid allocating bounce buffers for Read/Write DMA operations,
generally speaking. Thus, we shouldn't require a PCI driver to call
Unmap() at ExitBootServices().


> In turn, PciIo.Unmap() has to release the same buffer, internally.
> PciIo.Unmap() does not know if it is called from an ExitBootServices()
> notification function, or from a normal context. It cannot decide if it
> is allowed to release memory, or not. There's no additional parameter to
> provide this hint either. If PciIo.Unmap() never releases buffers, then
> ExitBootServices() will be happy, but we're going to leak memory.
> 
> Now, PciIo.Unmap() could move the bounce buffer to some internal "free
> list" instead, rather than freeing it. And, the next PciIo.Map() call
> could try to reuse a buffer from the "free list". But this is difficult
> to do, because the buffers may have any size, and recycling them would
> require an internal, general "page pool", in parallel to the Boot
> Services page pool (AllocatePages, FreePages).
> 
> If PciIo.Unmap() could be informed that it is running in
> ExitBootServices() notification context -- called from PCI drivers --,
> then unmapping Read and Write operations would be simple.
> 
>> If we have strong requirement to unmap the DMA buffer, we should achieve that.
> 
> I totally agree that we *should* achieve that, but I don't see how the
> current PciIo.Map() and PciIo.Unmap() APIs make it possible, for Read
> and Write operations.
> 
> For CommonBuffer operations, it is possible, because AllocateBuffer()
> and FreeBuffer() are separate actions, and an ExitBootServices()
> notification function can *avoid* calling FreeBuffer. (But only for
> CommonBuffer operations.) PciIo.Unmap() cannot be told *not* to release
> a bounce buffer for Read/Write.
> 
> [Jiewen] Again, it is an implementation choice. It is possible that a host
> bridge allocate common buffer without calling any Allocate().
> But another implementation may also allocate internal data structure to
> record every map() action, including common buffer. (For event log, or
> tracing purpose, et etc.)

True. My point is that
- we could reasonably require that Unmap() work for CommonBuffer
  operations without releasing memory,
- but we couldn't reasonably require the same for Read and Write
  operations.

This is because for CommonBuffer, a platform that right now allocates
extra stuff in Map(), could move the same allocations into
AllocateBuffer(). Similarly, move the current de-allocations from
Unmap() into FreeBuffer(). The same is not possible for Read/Write, so
we shouldn't require that.

> I do not think Free() should bring difference between common buffer or
> read/write buffer.
> I am a little worried that we have too many assumption here:
> e.g. Map() common buffer never involves Allocate() even for driver internal
> data structure , and Map() read/write buffer always involves Allocate() even
> there is no need to allocate anything.
> With that assumption, we purposely avoid Unmap() read/write buffer,
> but force Unmap() common buffer.

Exactly.

> I am not sure if that assumption is correct.

I'm just trying to operate with the APIs currently defined by the UEFI
spec, and these assumptions were the best I could come up with.

For example, if PciIo.Unmap() took one more parameter:

  IN BOOLEAN  CalledAtExitBootServices

then all this would be solved at once. Namely, PciIo implementations
would be required to call Unmap() for *all* DMA operations (read, write,
commonbuffer) in their ExitBootServices() notification functions, with
the parameter set to TRUE. All other contexts should pass FALSE.

In turn, CalledAtExitBootServices==TRUE would require PciIo.Unmap() to
work exactly as otherwise, but all memory allocated earlier with
gBS->AllocatePool() and gBS->AllocatePages() would have to be leaked on
purpose.

This is all.

Given that this parameter does not exist, I'm trying to find other ways to:
- perform the unmapping (re-encryption) at ExitBootServices,
- without releasing memory,
- and not in the IOMMU driver (because that could remove the translation
  before the PCI driver's own callback could disable the device, and
  that would be wrong)


>> 3)       About interaction with IOMMU
>> I am not worried about IOMMU interaction.
>> I believe an proper IOMMU implementation should not block any action taken by a PCI device driver.
>> Current IntelVTdDxe driver disables VTd engine in ExitBootService, which means the DMA access is no longer managed by VTd engine. It is NOT related to disable DMA buffer.
> 
> Ah, I understand!
> 
> So, on one hand, that is totally safe, for PCI device drivers. It means
> all pending DMA can get through, until the PCI drivers ask the devices
> (nicely) to stop doing DMA.
> [Jiewen] Yes. I think any IOMMU implementation should take that
> into consideration.
> 
> 
> 
> On the other hand, it is the opposite of the security goal of SEV. With
> SEV, the default state is "all DMA forbidden" (= all guest memory
> encrypted). This is also the status that we'd like to achieve when
> ExitBootServices() completes. When control is transferred to the OS, all
> guest memory should be encrypted again, even those areas that were once
> used as CommonBuffers.
> [Jiewen] I am not sure who will control "When control is transferred to the OS, all
> guest memory should be encrypted again, even those areas that were once
> used as CommonBuffers."
> For SEV case, I think it should be controlled by OS, because it is just CR3.

If it was indeed just CR3, then I would fully agree with you.

But -- to my understanding --, ensuring that the memory is actually
encrypted requires that the guest *write* to the memory through a
virtual address whose PTE has the C-bit set.

And I don't think the guest OS can be expected to rewrite all of its
memory at launch. :(

> If so, we should not meet any problem, because we already disable BME
> in device driver's ExitBootService event.

I agree, but installing new page tables is not enough.

> So apparently the default behaviors of the VT-d IOMMU and the SEV IOMMU
> are opposites -- VT-d permits all DMA unless configured otherwise, while
> SEV forbids all DMA unless configured otherwise.
> [Jiewen] I do not think so. The default behaviors of VT-d IOMMU is to disable all DMA access.
> I setup translation table, but mark all entry to be NOT-PRESENT.
> I grant DMA access only if there is a specific request by a device.
> 
> I open DMA access in ExitBootServices, just want to make sure it is compatible to
> the existing OS, which might not have knowledge on IOMMU.
> I will provide a patch to make it a policy very soon. As such VTd IOMMU may
> turn off IOMMU, or keep it enabled at ExitBootService.
> An IOMMU aware OS may take over IOMMU directly and reprogram it.

Thanks for the clarification.

But, again, will this not lead to the possibility that the VT-d IOMMU's
ExitBootServices() handler disables all DMA *before* the PCI drivers get
a chance to run their own ExitBootServices() handlers, disabling the
individual devices?

If that happens, then a series of IOMMU faults could be generated, which
I described above. (I.e., such IOMMU faults could result, at least in
the case of SEV, in garbage being written to disk, via queued commands.)

>> 4)       About the driver you modified
>> I believe we have some other PCI device driver (NVMe/SDMMC/UFS), besides the ones you modified in this patch (ATA/UCHI/EHCI/XHCI).
>> If we need take this approach, I assume they also need update, right?
> 
> Yes, they should be updated similarly.
> 
> SDMMC and UFS are not included in OVMF, so I couldn't test their behavior.
> 
> NVMe is included in OVMF, but it is used very rarely in practice (we can
> call it a corner case), so I thought I would first submit patches for
> the four main drivers (UHCI, EHCI, XHCI, AHCI) that are frequently used
> with OVMF and QEMU. Tracking down the CommonBuffer lifecycles was
> difficult :)
> [Jiewen] Understood. That is OK. We can handle that in separate patch.

Thanks!

Now, to finish up, here's an idea I just had.

Are we allowed to call gBS->SignalEvent() in an ExitBootServices()
notification function?

If we are, then we could do the following:

* PciIo.Unmap() and friends would work as always (no restrictions on
  dynamic memory allocation or freeing, for any kind of DMA operation).

* PCI drivers would not be expected to call PciIo.Unmap() in their
  ExitBootServices() handlers.

* The IOMMU driver would have an ExitBootServices() notification
  function, to be enqueued at the TPL_CALLBACK or TPL_NOTIFY level
  (doesn't matter which).

* In this notification function, the IOMMU driver would signal *another*
  event (a private one). The notification function for this event would
  be enqueued strictly at the TPL_CALLBACK level.

* The notification function for the second event (private to the IOMMU)
  would iterate over all existent mappings, and unmap them, without
  allocating or freeing memory.

The key point is that by signaling the second event, the IOMMU driver
could order its own cleanup code after all other ExitBootServices()
callbacks. (Assuming, at least, that no other ExitBootServices()
callback employs the same trick to defer itself!) Therefore by the time
the second callback runs, all PCI devices have been halted, and it is
safe to tear down the translations.

The ordering would be ensured by the following:

- The UEFI spec says under CreateEventEx(), "All events are guaranteed
  to be signaled before the first notification action is taken." This
  means that, by the time the IOMMU's ExitBootServices() handler is
  entered, all other ExitBootServices() handlers have been *queued* at
  least, at TPL_CALLBACK or TPL_NOTIFY.

- Therefore, when we signal our second callback from there, for
  TPL_CALLBACK, the second callback will be queued at the end of the
  TPL_CALLBACK queue.

- CreateEventEx() also says that EFI_EVENT_GROUP_EXIT_BOOT_SERVICES "is
  functionally equivalent to the EVT_SIGNAL_EXIT_BOOT_SERVICES flag
  for the Type argument of CreateEvent." So it wouldn't matter if a
  driver used CreateEvent() or CreateEventEx() for setting up its own
  handler, the handler would be queued just the same.

I think this is ugly and brittle, but perhaps the only way to clean up
*all* translations safely, with regard to PciIo.Unmap() +
ExitBootServices(), without updating the UEFI spec.

What do you think?

Thanks,
Laszlo


  reply	other threads:[~2017-09-05 17:54 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-03 19:54 [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() Laszlo Ersek
2017-09-03 19:54 ` [PATCH 1/4] MdeModulePkg/UhciDxe: unmap BM common buffers when exiting boot services Laszlo Ersek
2017-09-03 19:54 ` [PATCH 2/4] MdeModulePkg/EhciDxe: " Laszlo Ersek
2017-09-03 19:54 ` [PATCH 3/4] MdeModulePkg/XhciDxe: " Laszlo Ersek
2017-09-03 19:54 ` [PATCH 4/4] MdeModulePkg/AtaAtapiPassThru: unmap common buffers at ExitBootServices() Laszlo Ersek
2017-09-04 10:36 ` [PATCH 0/4] MdeModulePkg: some PCI HC drivers: " Zeng, Star
2017-09-04 21:19   ` Laszlo Ersek
2017-09-05  2:18     ` Yao, Jiewen
2017-09-05  9:15       ` Laszlo Ersek
2017-09-05 13:44         ` Yao, Jiewen
2017-09-05 17:57           ` Laszlo Ersek [this message]
2017-09-06  4:37             ` Yao, Jiewen
2017-09-06 12:14               ` Laszlo Ersek
2017-09-06 15:39                 ` Brijesh Singh
2017-09-07  4:46                   ` Yao, Jiewen
2017-09-07 11:46                     ` Laszlo Ersek
2017-09-07 14:40                       ` Brijesh Singh
2017-09-07 14:48                         ` Yao, Jiewen
2017-09-07 16:40                           ` Laszlo Ersek
2017-09-07  4:34                 ` Yao, Jiewen
2017-09-07 12:11                   ` Laszlo Ersek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=12d71f32-9dcf-4f6e-b033-d5f82104caca@redhat.com \
    --to=devel@edk2.groups.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox