Re: RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept

public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed

From: Ashish Kalra <ashish.kalra@amd.com>
To: Tobin Feldman-Fitzthum <tobin@linux.ibm.com>
Cc: devel@edk2.groups.io, dovmurik@linux.vnet.ibm.com,
	Dov.Murik1@il.ibm.com, brijesh.singh@amd.com, tobin@ibm.com,
	david.kaplan@amd.com, jon.grimm@amd.com, thomas.lendacky@amd.com,
	jejb@linux.ibm.com, frankeh@us.ibm.com
Subject: Re: RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept
Date: Fri, 30 Oct 2020 18:35:25 +0000	[thread overview]
Message-ID: <20201030183525.GA17491@ashkalra_ubuntu_server> (raw)
In-Reply-To: <5dc185214309c0cf309e5244dea37c11@linux.vnet.ibm.com>

Hello Tobin,

On Thu, Oct 29, 2020 at 04:36:07PM -0400, Tobin Feldman-Fitzthum wrote:
> On 2020-10-29 13:06, Ashish Kalra wrote:
> > Hello Tobin,
> > 
> > On Wed, Oct 28, 2020 at 03:31:44PM -0400, Tobin Feldman-Fitzthum wrote:
> > > Hello,
> > > 
> > > Dov Murik. James Bottomley, Hubertus Franke, and I have been working
> > > on a
> > > plan for fast live migration of SEV and SEV-ES (and SEV-SNP when
> > > it's out
> > > and even hopefully Intel TDX) VMs. We have developed an approach
> > > that we
> > > believe is feasible and a demonstration that shows our solution to
> > > the most
> > > difficult part of the problem. In short, we have implemented a UEFI
> > > Application that can resume from a VM snapshot. We think this is the
> > > crux of
> > > SEV-ES live migration. After describing the context of our demo and
> > > how it
> > > works, we explain how it can be extended to a full SEV-ES migration.
> > > Our
> > > goal is to show that fast SEV and SEV-ES live migration can be
> > > implemented
> > > in OVMF with minimal kernel changes. We provide a blueprint for
> > > doing so.
> > > 
> > > Typically the hypervisor facilitates live migration. AMD SEV
> > > excludes the
> > > hypervisor from the trust domain of the guest. When a hypervisor (HV)
> > > examines the memory of an SEV guest, it will find only a ciphertext.
> > > If the
> > > HV moves the memory of an SEV guest, the ciphertext will be
> > > invalidated.
> > > Furthermore, with SEV-ES the hypervisor is largely unable to access
> > > guest
> > > CPU state. Thus, fast migration of SEV VMs requires support from
> > > inside the
> > > trust domain, i.e. the guest.
> > > 
> > > One approach is to add support for SEV Migration to the Linux
> > > kernel. This
> > > would allow the guest to encrypt/decrypt its own memory with a
> > > transport
> > > key. This approach has met some resistance. We propose a similar
> > > approach
> > > implemented not in Linux, but in firmware, specifically OVMF. Since
> > > OVMF
> > > runs inside the guest, it has access to the guest memory and CPU
> > > state. OVMF
> > > should be able to perform the manipulations required for live
> > > migration of
> > > SEV and SEV-ES guests.
> > > 
> > > The biggest challenge of this approach involves migrating the CPU
> > > state of
> > > an SEV-ES guest. In a normal (non-SEV migration) the HV sets the CPU
> > > state
> > > of the target before the target begins executing. In our approach,
> > > the HV
> > > starts the target and OVMF must resume to whatever state the source
> > > was in.
> > > We believe this to be the crux (or at least the most difficult part)
> > > of live
> > > migration for SEV and we hope that by demonstrating resume from EFI,
> > > we can
> > > show that our approach is generally feasible.
> > > 
> > > Our demo can be found at <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsecure-migration&amp;data=04%7C01%7Cashish.kalra%40amd.com%7C9ae0ce60e5fd43378cb808d87c4a4746%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637396005748813716%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=I%2FF8XELOBFAvDnHmVw3M1ln7hb9a%2FmQrGXxWn2s5XSY%3D&amp;reserved=0>.
> > > The tooling repository is the best starting point. It contains
> > > documentation
> > > about the project and the scripts needed to run the demo. There are
> > > two more
> > > repos associated with the project. One is a modified edk2 tree that
> > > contains
> > > our modified OVMF. The other is a modified qemu, that has a couple of
> > > temporary changes needed for the demo. Our demonstration is aimed
> > > only at
> > > resuming from a VM snapshot in OVMF. We provide the source CPU state
> > > and
> > > source memory to the destination using temporary plumbing that
> > > violates the
> > > SEV trust model. We explain the setup in more depth in README.md. We
> > > are
> > > showing only that OVMF can resume from a VM snapshot. At the end we
> > > will
> > > describe our plan for transferring CPU state and memory from source to
> > > guest. To be clear, the temporary tooling used for this demo isn't
> > > built for
> > > encrypted VMs, but below we explain how this demo applies to and can
> > > be
> > > extended to encrypted VMs.
> > > 
> > > We Implemented our resume code in a very similar fashion to the
> > > recommended
> > > S3 resume code. When the HV sets the CPU state of a guest, it can do
> > > so when
> > > the guest is not executing. Setting the state from inside the guest
> > > is a
> > > delicate operation. There is no way to atomically set all of the CPU
> > > state
> > > from inside the guest. Instead, we must set most registers
> > > individually and
> > > account for changes in control flow that doing so might cause. We do
> > > this
> > > with a three-phase trampoline. OVMF calls phase 1, which runs on the
> > > OVMF
> > > map. Phase 1 sets up phase 2 and jumps to it. Phase 2 switches to an
> > > intermediate map that reconciles the OVMF map and the source map.
> > > Phase 3
> > > switches to the source map, restores the registers, and returns into
> > > execution of the source. We will go backwards through these phases
> > > in more
> > > depth.
> > > 
> > > The last thing that resume to EFI does is return. Specifically, we use
> > > IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a
> > > temporary stack and restores them atomically, thus returning to source
> > > execution. Prior to returning, we must manually restore most other
> > > registers
> > > to the values they had on the source. One particularly significant
> > > register
> > > is CR3. When we return to Linux, CR3 must be set to the source CR3
> > > or the
> > > first instruction executed in Linux will cause a page fault. The
> > > code that
> > > we use to restore the registers and return must be mapped in the
> > > source page
> > > table or we would get a page fault executing the instructions prior to
> > > returning into Linux. The value of CR3 is so significant, that it
> > > defines
> > > the three phases of the trampoline. Phase 3 begins when CR3 is set
> > > to the
> > > source CR3. After setting CR3, we set all the other registers and
> > > return.
> > > 
> > > Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping,
> > > meaning
> > > that virtual addresses are the same as physical addresses. The
> > > kernel page
> > > table uses an offset mapping, meaning that virtual addresses differ
> > > from
> > > physical addresses by a constant (for the most part). Crucially,
> > > this means
> > > that the virtual address of the page that is executed by phase 3
> > > differs
> > > between the OVMF map and the source map. If we are executing code
> > > mapped in
> > > OVMF and we change CR3 to point to the source map, although the page
> > > may be
> > > mapped in the source map, the virtual address will be different, and
> > > we will
> > > face undefined behavior. To fix this, we construct intermediate page
> > > tables
> > > that map the pages for phase 2 and 3 to the virtual address expected
> > > in OVMF
> > > and to the virtual address expected in the source map. Thus, we can
> > > switch
> > > CR3 from OVMF's map to the intermediate map and then from the
> > > intermediate
> > > map to the source map. Phase 2 is much shorter than phase 3. Phase 2
> > > is
> > > mainly responsible for switching to the intermediate map, flushing
> > > the TLB,
> > > and jumping to phase 3.
> > > 
> > > Fortunately phase 1 is even simpler than phase 2. Phase 1 has two
> > > duties.
> > > First, since phase 2 and 3 operate without a stack and can't access
> > > values
> > > defined in OVMF (such as the addresses of the pages containing phase
> > > 2 and
> > > 3), phase 1 must pass these values to phase 2 by putting them in
> > > registers.
> > > Second, phase 1 must start phase 2 by jumping to it.
> > > 
> > > Given that we can resume to a snapshot in OVMF, we should be able to
> > > migrate
> > > an SEV guest as long as we can securely communicate the VM snapshot
> > > from
> > > source to destination. For our demo, we do this with a handful of QMP
> > > commands. More sophisticated methods are required for a production
> > > implementation.
> > > 
> > > When we refer to a snapshot, what we really mean is the device state,
> > > memory, and CPU state of a guest. In live migration this is
> > > transmitted
> > > dynamically as opposed to being saved and restored. Device state is
> > > not
> > > protected by SEV and can be handled entirely by the HV. Memory, on
> > > the other
> > > hand, cannot be handled only by the HV. As mentioned previously,
> > > memory
> > > needs to be encrypted with a transport key. A Migration Handler on the
> > > source will coordinate with the HV to encrypt pages and transmit
> > > them to the
> > > destination. The destination HV will receive the pages over the
> > > network and
> > > pass them to the Migration Handler in the target VM so they can be
> > > decrypted. This transmission will occur continuously until the
> > > memory of the
> > > source and target converges.
> > > 
> > > Plain SEV does not protect the CPU state of the guest and therefore
> > > does not
> > > require any special mechanism for transmission of the CPU state. We
> > > plan to
> > > implement an end-to-end migration with plain SEV first. In SEV-ES,
> > > the PSP
> > > (platform security processor) encrypts CPU state on each VMExit. The
> > > encrypted state is stored in memory. Normally this memory (known as
> > > the
> > > VMSA) is not mapped into the guest, but we can add an entry to the
> > > nested
> > > page tables that will expose the VMSA to the guest.
> > 
> > I have a question here, is there any kind of integrity protection on the
> > CPU state when the target VM is resumed after nigration, for example, if
> > there is a malicious hypervisor which maps a page with subverted CPU
> > state on the nested page tables, what prevents the target VM to resume
> > execution on a subverted or compromised CPU state ?
> 
> Good question. Here is my thinking. The VMSA is mapped in the guest memory.
> It will be transmitted to the target like any other page, with encryption
> and integrity-checking. So we have integrity checking for CPU state while
> it is in flight.
> 
> I think you are wondering something slightly different, though. Once the
> page with the VMSA arrives at the target and is decrypted and put in place,
> the hypervisor could potentially change the NPT to replace the data. Since
> the page with the VMSA will be encrypted (and the Migration Handler will
> expect this), the HV can't replace the page with arbitrary values.
> 
> Since the VMSA is in memory, we have the protections that SEV provides
> for memory. Prior to SNP, this does not include integrity protection.
> The HV could attempt a replay attack by replacing the page with the
> VMSA with an older version of the same page. That said, the target will
> have just booted so there isn't much to replay.
> 
> If we really need to, we could add functionality to the Migration Handler
> that would allow the HV to ask for an HMAC of the VMSA on the source.
> The Migration Handler on the target could use this to verify the VMSA
> just prior to starting the trampoline. Given the above, I am not sure
> this is necessary. Hopefully I've understood the attack you're suggesting
> correctly.
> 

Yes this is the attack i am suggesting about a compromised or malicious
hypervisor replacing the page containing the CPU state with compromised
data in the NPT when the target VM starts.

Thanks,
Ashish

> > > This means that when the
> > > guest VMExits, the CPU state will be saved to guest memory. With the
> > > CPU
> > > state in guest memory, it can be transmitted to the target using the
> > > method
> > > described above.
> > > 
> > > In addition to the changes needed in OVMF to resume the VM, the
> > > transmission
> > > of the VM from source to target will require a new code path in the
> > > hypervisor. There will also need to be a few minor changes to Linux
> > > (adding
> > > a mapping for our Phase 3 pages). Despite all the moving pieces, we
> > > believe
> > > that this is a feasible approach for supporting live migration for
> > > SEV and
> > > SEV-ES.
> > > 
> > > For the sake of brevity, we have left out a few issues, including SMP
> > > support, generation of the intermediate mappings, and more. We have
> > > included
> > > some notes about these issues in the COMPLICATIONS.md file. We also
> > > have an
> > > outline of an end-to-end implementation of live migration for SEV-ES
> > > in
> > > END-TO-END.md. See README.md for info on how to run the demo. While
> > > this is
> > > not a full migration, we hope to show that fast live migration with
> > > SEV and
> > > SEV-ES is possible without major kernel changes.
> > > 
> > > -Tobin
> > >

next prev parent reply	other threads:[~2020-10-30 18:35 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-28 19:31 RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept Tobin Feldman-Fitzthum
2020-10-29 17:06 ` Ashish Kalra
2020-10-29 20:36   ` tobin
2020-10-30 18:35     ` Ashish Kalra [this message]
2020-11-03 14:59 ` [edk2-devel] " Laszlo Ersek
2020-11-04 18:27   ` Tobin Feldman-Fitzthum
2020-11-06 15:45     ` Laszlo Ersek
2020-11-06 20:03       ` Tobin Feldman-Fitzthum
2020-11-06 16:38     ` Dr. David Alan Gilbert
2020-11-06 21:48       ` Tobin Feldman-Fitzthum
2020-11-06 22:17         ` Ashish Kalra
2020-11-09 20:27           ` Tobin Feldman-Fitzthum
2020-11-09 20:34             ` Kalra, Ashish
2020-11-09 19:56         ` Dr. David Alan Gilbert
2020-11-09 22:37           ` Tobin Feldman-Fitzthum
2020-11-09 23:44             ` James Bottomley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201030183525.GA17491@ashkalra_ubuntu_server \
    --to=devel@edk2.groups.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox