From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124])
 by mx.groups.io with SMTP id smtpd.web08.5224.1604415590664965447
 for <devel@edk2.groups.io>;
 Tue, 03 Nov 2020 06:59:50 -0800
Authentication-Results: mx.groups.io;
 dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=EvC2e3eT;
 spf=pass (domain: redhat.com, ip: 216.205.24.124, mailfrom: lersek@redhat.com)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1604415589;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=R+eVZ3e5kzOVpxh4cLP+av16Wi1JVzh9goAtNRPxhH4=;
	b=EvC2e3eTrHsW3ypRg12ZXPZEhNvWPSvevV470IJBxIVc1Yi6lzjfhyZpQAPPHiLauM4mfs
	f6wsf/4pAqnsRy3I+9EHZe6nddOWkXDVuDkNIPyqv5Ynk/QU+dJ4kHnd1VQrCnMTJqoboH
	RoLE39IMg6tClFc2NBY6jiZpGDwJs6U=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-199-bLxb8y-UNrqRKqkEF90yLg-1; Tue, 03 Nov 2020 09:59:45 -0500
X-MC-Unique: bLxb8y-UNrqRKqkEF90yLg-1
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx01.redhat.com (Postfix) with ESMTPS id E364F10866B2;
	Tue,  3 Nov 2020 14:59:42 +0000 (UTC)
Received: from lacos-laptop-7.usersys.redhat.com (ovpn-115-74.ams2.redhat.com [10.36.115.74])
	by smtp.corp.redhat.com (Postfix) with ESMTP id 8310C60BF1;
	Tue,  3 Nov 2020 14:59:39 +0000 (UTC)
Subject: Re: [edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept
To: tobin@linux.ibm.com
Cc: devel@edk2.groups.io, dovmurik@linux.vnet.ibm.com, Dov.Murik1@il.ibm.com,
 ashish.kalra@amd.com, brijesh.singh@amd.com, tobin@ibm.com,
 david.kaplan@amd.com, jon.grimm@amd.com, thomas.lendacky@amd.com,
 jejb@linux.ibm.com, frankeh@us.ibm.com,
 "Dr. David Alan Gilbert" <dgilbert@redhat.com>
References: <c5d3f84e-2c11-3e49-4ab2-f4d2c2b095d4@linux.ibm.com>
From: "Laszlo Ersek" <lersek@redhat.com>
Message-ID: <933a5d2b-a495-37b9-fe8b-243f9bae24d5@redhat.com>
Date: Tue, 3 Nov 2020 15:59:38 +0100
MIME-Version: 1.0
In-Reply-To: <c5d3f84e-2c11-3e49-4ab2-f4d2c2b095d4@linux.ibm.com>
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12
Authentication-Results: relay.mimecast.com;
	auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=lersek@redhat.com
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit

Hi Tobin,

(keeping full context -- I'm adding Dave)

On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote:
> Hello,
> 
> Dov Murik. James Bottomley, Hubertus Franke, and I have been working on
> a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when it's
> out and even hopefully Intel TDX) VMs. We have developed an approach
> that we believe is feasible and a demonstration that shows our solution
> to the most difficult part of the problem. In short, we have implemented
> a UEFI Application that can resume from a VM snapshot. We think this is
> the crux of SEV-ES live migration. After describing the context of our
> demo and how it works, we explain how it can be extended to a full
> SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live
> migration can be implemented in OVMF with minimal kernel changes. We
> provide a blueprint for doing so.
> 
> Typically the hypervisor facilitates live migration. AMD SEV excludes
> the hypervisor from the trust domain of the guest. When a hypervisor
> (HV) examines the memory of an SEV guest, it will find only a
> ciphertext. If the HV moves the memory of an SEV guest, the ciphertext
> will be invalidated. Furthermore, with SEV-ES the hypervisor is largely
> unable to access guest CPU state. Thus, fast migration of SEV VMs
> requires support from inside the trust domain, i.e. the guest.
> 
> One approach is to add support for SEV Migration to the Linux kernel.
> This would allow the guest to encrypt/decrypt its own memory with a
> transport key. This approach has met some resistance. We propose a
> similar approach implemented not in Linux, but in firmware, specifically
> OVMF. Since OVMF runs inside the guest, it has access to the guest
> memory and CPU state. OVMF should be able to perform the manipulations
> required for live migration of SEV and SEV-ES guests.
> 
> The biggest challenge of this approach involves migrating the CPU state
> of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the CPU
> state of the target before the target begins executing. In our approach,
> the HV starts the target and OVMF must resume to whatever state the
> source was in. We believe this to be the crux (or at least the most
> difficult part) of live migration for SEV and we hope that by
> demonstrating resume from EFI, we can show that our approach is
> generally feasible.
> 
> Our demo can be found at <https://github.com/secure-migration>. The
> tooling repository is the best starting point. It contains documentation
> about the project and the scripts needed to run the demo. There are two
> more repos associated with the project. One is a modified edk2 tree that
> contains our modified OVMF. The other is a modified qemu, that has a
> couple of temporary changes needed for the demo. Our demonstration is
> aimed only at resuming from a VM snapshot in OVMF. We provide the source
> CPU state and source memory to the destination using temporary plumbing
> that violates the SEV trust model. We explain the setup in more depth in
> README.md. We are showing only that OVMF can resume from a VM snapshot.
> At the end we will describe our plan for transferring CPU state and
> memory from source to guest. To be clear, the temporary tooling used for
> this demo isn't built for encrypted VMs, but below we explain how this
> demo applies to and can be extended to encrypted VMs.
> 
> We Implemented our resume code in a very similar fashion to the
> recommended S3 resume code. When the HV sets the CPU state of a guest,
> it can do so when the guest is not executing. Setting the state from
> inside the guest is a delicate operation. There is no way to atomically
> set all of the CPU state from inside the guest. Instead, we must set
> most registers individually and account for changes in control flow that
> doing so might cause. We do this with a three-phase trampoline. OVMF
> calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and
> jumps to it. Phase 2 switches to an intermediate map that reconciles the
> OVMF map and the source map. Phase 3 switches to the source map,
> restores the registers, and returns into execution of the source. We
> will go backwards through these phases in more depth.
> 
> The last thing that resume to EFI does is return. Specifically, we use
> IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a
> temporary stack and restores them atomically, thus returning to source
> execution. Prior to returning, we must manually restore most other
> registers to the values they had on the source. One particularly
> significant register is CR3. When we return to Linux, CR3 must be set to
> the source CR3 or the first instruction executed in Linux will cause a
> page fault. The code that we use to restore the registers and return
> must be mapped in the source page table or we would get a page fault
> executing the instructions prior to returning into Linux. The value of
> CR3 is so significant, that it defines the three phases of the
> trampoline. Phase 3 begins when CR3 is set to the source CR3. After
> setting CR3, we set all the other registers and return.
> 
> Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, meaning
> that virtual addresses are the same as physical addresses. The kernel
> page table uses an offset mapping, meaning that virtual addresses differ
> from physical addresses by a constant (for the most part). Crucially,
> this means that the virtual address of the page that is executed by
> phase 3 differs between the OVMF map and the source map. If we are
> executing code mapped in OVMF and we change CR3 to point to the source
> map, although the page may be mapped in the source map, the virtual
> address will be different, and we will face undefined behavior. To fix
> this, we construct intermediate page tables that map the pages for phase
> 2 and 3 to the virtual address expected in OVMF and to the virtual
> address expected in the source map. Thus, we can switch CR3 from OVMF's
> map to the intermediate map and then from the intermediate map to the
> source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly
> responsible for switching to the intermediate map, flushing the TLB, and
> jumping to phase 3.
> 
> Fortunately phase 1 is even simpler than phase 2. Phase 1 has two
> duties. First, since phase 2 and 3 operate without a stack and can't
> access values defined in OVMF (such as the addresses of the pages
> containing phase 2 and 3), phase 1 must pass these values to phase 2 by
> putting them in registers. Second, phase 1 must start phase 2 by jumping
> to it.
> 
> Given that we can resume to a snapshot in OVMF, we should be able to
> migrate an SEV guest as long as we can securely communicate the VM
> snapshot from source to destination. For our demo, we do this with a
> handful of QMP commands. More sophisticated methods are required for a
> production implementation.
> 
> When we refer to a snapshot, what we really mean is the device state,
> memory, and CPU state of a guest. In live migration this is transmitted
> dynamically as opposed to being saved and restored. Device state is not
> protected by SEV and can be handled entirely by the HV. Memory, on the
> other hand, cannot be handled only by the HV. As mentioned previously,
> memory needs to be encrypted with a transport key. A Migration Handler
> on the source will coordinate with the HV to encrypt pages and transmit
> them to the destination. The destination HV will receive the pages over
> the network and pass them to the Migration Handler in the target VM so
> they can be decrypted. This transmission will occur continuously until
> the memory of the source and target converges.
> 
> Plain SEV does not protect the CPU state of the guest and therefore does
> not require any special mechanism for transmission of the CPU state. We
> plan to implement an end-to-end migration with plain SEV first. In
> SEV-ES, the PSP (platform security processor) encrypts CPU state on each
> VMExit. The encrypted state is stored in memory. Normally this memory
> (known as the VMSA) is not mapped into the guest, but we can add an
> entry to the nested page tables that will expose the VMSA to the guest.
> This means that when the guest VMExits, the CPU state will be saved to
> guest memory. With the CPU state in guest memory, it can be transmitted
> to the target using the method described above.
> 
> In addition to the changes needed in OVMF to resume the VM, the
> transmission of the VM from source to target will require a new code
> path in the hypervisor. There will also need to be a few minor changes
> to Linux (adding a mapping for our Phase 3 pages). Despite all the
> moving pieces, we believe that this is a feasible approach for
> supporting live migration for SEV and SEV-ES.
> 
> For the sake of brevity, we have left out a few issues, including SMP
> support, generation of the intermediate mappings, and more. We have
> included some notes about these issues in the COMPLICATIONS.md file. We
> also have an outline of an end-to-end implementation of live migration
> for SEV-ES in END-TO-END.md. See README.md for info on how to run the
> demo. While this is not a full migration, we hope to show that fast live
> migration with SEV and SEV-ES is possible without major kernel changes.
> 
> -Tobin

the one word that comes to my mind upon reading the above is,
"overwhelming".

(I have not been addressed directly, but:

- the subject says "RFC",

- and the documentation at

https://github.com/secure-migration/resume-from-edk2-tooling#what-changes-did-we-make

states that AmdSevPkg was created for convenience, and that the feature
could be integrated into OVMF. (Paraphrased.)

So I guess it's tolerable if I make a comment: )

I've checked out the "mh-state-dev" branch of
<https://github.com/secure-migration/resume-from-efi-edk2.git>. It has
80 commits on top of edk2 master (base commit: d5339c04d7cd,
"UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency",
2020-04-23).

These commits were authored over the 6-7 months since April. It's
obviously huge work. To me, most of these commits clearly aim at getting
the demo / proof-of-concept functional, rather than guiding (more
precisely: hand-holding) reviewers through the construction of the feature.

In my opinion, the series is not upstreamable in its current format
(which is presently not much more readable than a single-commit code
drop). Upstreaming is probably not your intent, either, at this time.

I agree that getting feedback ("buy-in") at this level of maturity is
justified from your POV, before you invest more work into cleaning up /
restructuring the series.

My problem is that "hand-holding" is exactly what I'd need -- I cannot
dedicate one or two weeks, as an indivisible block, to understanding
your design. Nor can I approach the series patch-wise in its current
format. Personally I would need the patch series to lead me through the
whole design with baby steps ("ELI5"), meaning small code changes and
detailed commit messages. I'd *also* need the more comprehensive
guide-like documentation, as background material.

Furthermore, I don't have an environment where I can test this
proof-of-concept (and provide you with further incentive for cleaning up
the series, by reporting success).

So I hope others can spend the time discussing the design with you, and
testing / repeating the demo. For me to review the patches, the patches
should condense and replay your thinking process from the last 7 months,
in as small as possible logical steps. (On the list.)

I really don't want to be the bottleneck here, which is why I would
support introducing this feature as a separate top-level package
(AmdSevPkg).

Thanks
Laszlo