From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by mx.groups.io with SMTP id smtpd.web08.5224.1604415590664965447 for ; Tue, 03 Nov 2020 06:59:50 -0800 Authentication-Results: mx.groups.io; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=EvC2e3eT; spf=pass (domain: redhat.com, ip: 216.205.24.124, mailfrom: lersek@redhat.com) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1604415589; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=R+eVZ3e5kzOVpxh4cLP+av16Wi1JVzh9goAtNRPxhH4=; b=EvC2e3eTrHsW3ypRg12ZXPZEhNvWPSvevV470IJBxIVc1Yi6lzjfhyZpQAPPHiLauM4mfs f6wsf/4pAqnsRy3I+9EHZe6nddOWkXDVuDkNIPyqv5Ynk/QU+dJ4kHnd1VQrCnMTJqoboH RoLE39IMg6tClFc2NBY6jiZpGDwJs6U= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-199-bLxb8y-UNrqRKqkEF90yLg-1; Tue, 03 Nov 2020 09:59:45 -0500 X-MC-Unique: bLxb8y-UNrqRKqkEF90yLg-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id E364F10866B2; Tue, 3 Nov 2020 14:59:42 +0000 (UTC) Received: from lacos-laptop-7.usersys.redhat.com (ovpn-115-74.ams2.redhat.com [10.36.115.74]) by smtp.corp.redhat.com (Postfix) with ESMTP id 8310C60BF1; Tue, 3 Nov 2020 14:59:39 +0000 (UTC) Subject: Re: [edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept To: tobin@linux.ibm.com Cc: devel@edk2.groups.io, dovmurik@linux.vnet.ibm.com, Dov.Murik1@il.ibm.com, ashish.kalra@amd.com, brijesh.singh@amd.com, tobin@ibm.com, david.kaplan@amd.com, jon.grimm@amd.com, thomas.lendacky@amd.com, jejb@linux.ibm.com, frankeh@us.ibm.com, "Dr. David Alan Gilbert" References: From: "Laszlo Ersek" Message-ID: <933a5d2b-a495-37b9-fe8b-243f9bae24d5@redhat.com> Date: Tue, 3 Nov 2020 15:59:38 +0100 MIME-Version: 1.0 In-Reply-To: X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=lersek@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Hi Tobin, (keeping full context -- I'm adding Dave) On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote: > Hello, > > Dov Murik. James Bottomley, Hubertus Franke, and I have been working on > a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when it's > out and even hopefully Intel TDX) VMs. We have developed an approach > that we believe is feasible and a demonstration that shows our solution > to the most difficult part of the problem. In short, we have implemented > a UEFI Application that can resume from a VM snapshot. We think this is > the crux of SEV-ES live migration. After describing the context of our > demo and how it works, we explain how it can be extended to a full > SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live > migration can be implemented in OVMF with minimal kernel changes. We > provide a blueprint for doing so. > > Typically the hypervisor facilitates live migration. AMD SEV excludes > the hypervisor from the trust domain of the guest. When a hypervisor > (HV) examines the memory of an SEV guest, it will find only a > ciphertext. If the HV moves the memory of an SEV guest, the ciphertext > will be invalidated. Furthermore, with SEV-ES the hypervisor is largely > unable to access guest CPU state. Thus, fast migration of SEV VMs > requires support from inside the trust domain, i.e. the guest. > > One approach is to add support for SEV Migration to the Linux kernel. > This would allow the guest to encrypt/decrypt its own memory with a > transport key. This approach has met some resistance. We propose a > similar approach implemented not in Linux, but in firmware, specifically > OVMF. Since OVMF runs inside the guest, it has access to the guest > memory and CPU state. OVMF should be able to perform the manipulations > required for live migration of SEV and SEV-ES guests. > > The biggest challenge of this approach involves migrating the CPU state > of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the CPU > state of the target before the target begins executing. In our approach, > the HV starts the target and OVMF must resume to whatever state the > source was in. We believe this to be the crux (or at least the most > difficult part) of live migration for SEV and we hope that by > demonstrating resume from EFI, we can show that our approach is > generally feasible. > > Our demo can be found at . The > tooling repository is the best starting point. It contains documentation > about the project and the scripts needed to run the demo. There are two > more repos associated with the project. One is a modified edk2 tree that > contains our modified OVMF. The other is a modified qemu, that has a > couple of temporary changes needed for the demo. Our demonstration is > aimed only at resuming from a VM snapshot in OVMF. We provide the source > CPU state and source memory to the destination using temporary plumbing > that violates the SEV trust model. We explain the setup in more depth in > README.md. We are showing only that OVMF can resume from a VM snapshot. > At the end we will describe our plan for transferring CPU state and > memory from source to guest. To be clear, the temporary tooling used for > this demo isn't built for encrypted VMs, but below we explain how this > demo applies to and can be extended to encrypted VMs. > > We Implemented our resume code in a very similar fashion to the > recommended S3 resume code. When the HV sets the CPU state of a guest, > it can do so when the guest is not executing. Setting the state from > inside the guest is a delicate operation. There is no way to atomically > set all of the CPU state from inside the guest. Instead, we must set > most registers individually and account for changes in control flow that > doing so might cause. We do this with a three-phase trampoline. OVMF > calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and > jumps to it. Phase 2 switches to an intermediate map that reconciles the > OVMF map and the source map. Phase 3 switches to the source map, > restores the registers, and returns into execution of the source. We > will go backwards through these phases in more depth. > > The last thing that resume to EFI does is return. Specifically, we use > IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a > temporary stack and restores them atomically, thus returning to source > execution. Prior to returning, we must manually restore most other > registers to the values they had on the source. One particularly > significant register is CR3. When we return to Linux, CR3 must be set to > the source CR3 or the first instruction executed in Linux will cause a > page fault. The code that we use to restore the registers and return > must be mapped in the source page table or we would get a page fault > executing the instructions prior to returning into Linux. The value of > CR3 is so significant, that it defines the three phases of the > trampoline. Phase 3 begins when CR3 is set to the source CR3. After > setting CR3, we set all the other registers and return. > > Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, meaning > that virtual addresses are the same as physical addresses. The kernel > page table uses an offset mapping, meaning that virtual addresses differ > from physical addresses by a constant (for the most part). Crucially, > this means that the virtual address of the page that is executed by > phase 3 differs between the OVMF map and the source map. If we are > executing code mapped in OVMF and we change CR3 to point to the source > map, although the page may be mapped in the source map, the virtual > address will be different, and we will face undefined behavior. To fix > this, we construct intermediate page tables that map the pages for phase > 2 and 3 to the virtual address expected in OVMF and to the virtual > address expected in the source map. Thus, we can switch CR3 from OVMF's > map to the intermediate map and then from the intermediate map to the > source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly > responsible for switching to the intermediate map, flushing the TLB, and > jumping to phase 3. > > Fortunately phase 1 is even simpler than phase 2. Phase 1 has two > duties. First, since phase 2 and 3 operate without a stack and can't > access values defined in OVMF (such as the addresses of the pages > containing phase 2 and 3), phase 1 must pass these values to phase 2 by > putting them in registers. Second, phase 1 must start phase 2 by jumping > to it. > > Given that we can resume to a snapshot in OVMF, we should be able to > migrate an SEV guest as long as we can securely communicate the VM > snapshot from source to destination. For our demo, we do this with a > handful of QMP commands. More sophisticated methods are required for a > production implementation. > > When we refer to a snapshot, what we really mean is the device state, > memory, and CPU state of a guest. In live migration this is transmitted > dynamically as opposed to being saved and restored. Device state is not > protected by SEV and can be handled entirely by the HV. Memory, on the > other hand, cannot be handled only by the HV. As mentioned previously, > memory needs to be encrypted with a transport key. A Migration Handler > on the source will coordinate with the HV to encrypt pages and transmit > them to the destination. The destination HV will receive the pages over > the network and pass them to the Migration Handler in the target VM so > they can be decrypted. This transmission will occur continuously until > the memory of the source and target converges. > > Plain SEV does not protect the CPU state of the guest and therefore does > not require any special mechanism for transmission of the CPU state. We > plan to implement an end-to-end migration with plain SEV first. In > SEV-ES, the PSP (platform security processor) encrypts CPU state on each > VMExit. The encrypted state is stored in memory. Normally this memory > (known as the VMSA) is not mapped into the guest, but we can add an > entry to the nested page tables that will expose the VMSA to the guest. > This means that when the guest VMExits, the CPU state will be saved to > guest memory. With the CPU state in guest memory, it can be transmitted > to the target using the method described above. > > In addition to the changes needed in OVMF to resume the VM, the > transmission of the VM from source to target will require a new code > path in the hypervisor. There will also need to be a few minor changes > to Linux (adding a mapping for our Phase 3 pages). Despite all the > moving pieces, we believe that this is a feasible approach for > supporting live migration for SEV and SEV-ES. > > For the sake of brevity, we have left out a few issues, including SMP > support, generation of the intermediate mappings, and more. We have > included some notes about these issues in the COMPLICATIONS.md file. We > also have an outline of an end-to-end implementation of live migration > for SEV-ES in END-TO-END.md. See README.md for info on how to run the > demo. While this is not a full migration, we hope to show that fast live > migration with SEV and SEV-ES is possible without major kernel changes. > > -Tobin the one word that comes to my mind upon reading the above is, "overwhelming". (I have not been addressed directly, but: - the subject says "RFC", - and the documentation at https://github.com/secure-migration/resume-from-edk2-tooling#what-changes-did-we-make states that AmdSevPkg was created for convenience, and that the feature could be integrated into OVMF. (Paraphrased.) So I guess it's tolerable if I make a comment: ) I've checked out the "mh-state-dev" branch of . It has 80 commits on top of edk2 master (base commit: d5339c04d7cd, "UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency", 2020-04-23). These commits were authored over the 6-7 months since April. It's obviously huge work. To me, most of these commits clearly aim at getting the demo / proof-of-concept functional, rather than guiding (more precisely: hand-holding) reviewers through the construction of the feature. In my opinion, the series is not upstreamable in its current format (which is presently not much more readable than a single-commit code drop). Upstreaming is probably not your intent, either, at this time. I agree that getting feedback ("buy-in") at this level of maturity is justified from your POV, before you invest more work into cleaning up / restructuring the series. My problem is that "hand-holding" is exactly what I'd need -- I cannot dedicate one or two weeks, as an indivisible block, to understanding your design. Nor can I approach the series patch-wise in its current format. Personally I would need the patch series to lead me through the whole design with baby steps ("ELI5"), meaning small code changes and detailed commit messages. I'd *also* need the more comprehensive guide-like documentation, as background material. Furthermore, I don't have an environment where I can test this proof-of-concept (and provide you with further incentive for cleaning up the series, by reporting success). So I hope others can spend the time discussing the design with you, and testing / repeating the demo. For me to review the patches, the patches should condense and replay your thinking process from the last 7 months, in as small as possible logical steps. (On the list.) I really don't want to be the bottleneck here, which is why I would support introducing this feature as a separate top-level package (AmdSevPkg). Thanks Laszlo