From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by mx.groups.io with SMTP id smtpd.web08.17020.1604514482972158767 for ; Wed, 04 Nov 2020 10:28:03 -0800 Authentication-Results: mx.groups.io; dkim=pass header.i=@ibm.com header.s=pp1 header.b=lNYbUGcF; spf=pass (domain: linux.ibm.com, ip: 148.163.156.1, mailfrom: tobin@linux.ibm.com) Received: from pps.filterd (m0098394.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 0A4ICeAP157587; Wed, 4 Nov 2020 13:28:01 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=mime-version : date : from : to : cc : subject : in-reply-to : references : message-id : content-type : content-transfer-encoding; s=pp1; bh=bqSVf6H0PgKhpgCMp1A6nVBynbJ929IB6RN5atsOHs0=; b=lNYbUGcFCtSmpDdks7jVm3SatZgBFYQWaxPZ28houzqRBaJ7oO64JWIWG2Psp1F11o1L gzaNFkwOvbtBSzUZ7yurNbjlaMdOvax9iR0fhKqazea67Ithlts6OAyq41pjKEemzjp9 viGrZ6Wbwin9+qwpL65dcMDR5a9BEOG2Ok0Tn8qPoSy7pc4tbDSIpe/9MNHeTap7ApO+ SexSUUkDKtSxezq6x8lF5CuEC2UK/zllHqCJ7WWdSLy2mKec4R8wjaNSH4PNY44W235n 3ZjKOvFEfHSS3B8kNEq6O5pz1cthx8M3dbgp2sqi4fODpxP3Oc3sEVBVZceC02jjuFOT AA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 34kymyv7eb-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 04 Nov 2020 13:28:01 -0500 Received: from m0098394.ppops.net (m0098394.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.36/8.16.0.36) with SMTP id 0A4ICfQj157683; Wed, 4 Nov 2020 13:28:00 -0500 Received: from ppma02wdc.us.ibm.com (aa.5b.37a9.ip4.static.sl-reverse.com [169.55.91.170]) by mx0a-001b2d01.pphosted.com with ESMTP id 34kymyv7du-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 04 Nov 2020 13:28:00 -0500 Received: from pps.filterd (ppma02wdc.us.ibm.com [127.0.0.1]) by ppma02wdc.us.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 0A4IMwpF018763; Wed, 4 Nov 2020 18:27:59 GMT Received: from b03cxnp08025.gho.boulder.ibm.com (b03cxnp08025.gho.boulder.ibm.com [9.17.130.17]) by ppma02wdc.us.ibm.com with ESMTP id 34h0ew8f93-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 04 Nov 2020 18:27:59 +0000 Received: from b03ledav001.gho.boulder.ibm.com (b03ledav001.gho.boulder.ibm.com [9.17.130.232]) by b03cxnp08025.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 0A4IRo8g10682892 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 4 Nov 2020 18:27:50 GMT Received: from b03ledav001.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 15E016E04E; Wed, 4 Nov 2020 18:27:56 +0000 (GMT) Received: from b03ledav001.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 98E366E04C; Wed, 4 Nov 2020 18:27:55 +0000 (GMT) Received: from ltc.linux.ibm.com (unknown [9.16.170.189]) by b03ledav001.gho.boulder.ibm.com (Postfix) with ESMTP; Wed, 4 Nov 2020 18:27:55 +0000 (GMT) MIME-Version: 1.0 Date: Wed, 04 Nov 2020 13:27:55 -0500 From: "Tobin Feldman-Fitzthum" To: Laszlo Ersek Cc: devel@edk2.groups.io, dovmurik@linux.vnet.ibm.com, Dov.Murik1@il.ibm.com, ashish.kalra@amd.com, brijesh.singh@amd.com, tobin@ibm.com, david.kaplan@amd.com, jon.grimm@amd.com, thomas.lendacky@amd.com, jejb@linux.ibm.com, frankeh@us.ibm.com, "Dr. David Alan Gilbert" Subject: Re: [edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept In-Reply-To: <933a5d2b-a495-37b9-fe8b-243f9bae24d5@redhat.com> References: <933a5d2b-a495-37b9-fe8b-243f9bae24d5@redhat.com> Message-ID: <61acbc7b318b2c099a106151116f25ea@linux.vnet.ibm.com> X-Sender: tobin@linux.ibm.com User-Agent: Roundcube Webmail/1.0.1 X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.312,18.0.737 definitions=2020-11-04_12:2020-11-04,2020-11-04 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 bulkscore=0 mlxlogscore=999 suspectscore=0 phishscore=0 spamscore=0 clxscore=1015 mlxscore=0 lowpriorityscore=0 malwarescore=0 priorityscore=1501 adultscore=0 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2011040128 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit On 2020-11-03 09:59, Laszlo Ersek wrote: > Hi Tobin, > > (keeping full context -- I'm adding Dave) > > On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote: >> Hello, >> >> Dov Murik. James Bottomley, Hubertus Franke, and I have been working >> on >> a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when >> it's >> out and even hopefully Intel TDX) VMs. We have developed an approach >> that we believe is feasible and a demonstration that shows our >> solution >> to the most difficult part of the problem. In short, we have >> implemented >> a UEFI Application that can resume from a VM snapshot. We think this >> is >> the crux of SEV-ES live migration. After describing the context of our >> demo and how it works, we explain how it can be extended to a full >> SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live >> migration can be implemented in OVMF with minimal kernel changes. We >> provide a blueprint for doing so. >> >> Typically the hypervisor facilitates live migration. AMD SEV excludes >> the hypervisor from the trust domain of the guest. When a hypervisor >> (HV) examines the memory of an SEV guest, it will find only a >> ciphertext. If the HV moves the memory of an SEV guest, the ciphertext >> will be invalidated. Furthermore, with SEV-ES the hypervisor is >> largely >> unable to access guest CPU state. Thus, fast migration of SEV VMs >> requires support from inside the trust domain, i.e. the guest. >> >> One approach is to add support for SEV Migration to the Linux kernel. >> This would allow the guest to encrypt/decrypt its own memory with a >> transport key. This approach has met some resistance. We propose a >> similar approach implemented not in Linux, but in firmware, >> specifically >> OVMF. Since OVMF runs inside the guest, it has access to the guest >> memory and CPU state. OVMF should be able to perform the manipulations >> required for live migration of SEV and SEV-ES guests. >> >> The biggest challenge of this approach involves migrating the CPU >> state >> of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the >> CPU >> state of the target before the target begins executing. In our >> approach, >> the HV starts the target and OVMF must resume to whatever state the >> source was in. We believe this to be the crux (or at least the most >> difficult part) of live migration for SEV and we hope that by >> demonstrating resume from EFI, we can show that our approach is >> generally feasible. >> >> Our demo can be found at . The >> tooling repository is the best starting point. It contains >> documentation >> about the project and the scripts needed to run the demo. There are >> two >> more repos associated with the project. One is a modified edk2 tree >> that >> contains our modified OVMF. The other is a modified qemu, that has a >> couple of temporary changes needed for the demo. Our demonstration is >> aimed only at resuming from a VM snapshot in OVMF. We provide the >> source >> CPU state and source memory to the destination using temporary >> plumbing >> that violates the SEV trust model. We explain the setup in more depth >> in >> README.md. We are showing only that OVMF can resume from a VM >> snapshot. >> At the end we will describe our plan for transferring CPU state and >> memory from source to guest. To be clear, the temporary tooling used >> for >> this demo isn't built for encrypted VMs, but below we explain how this >> demo applies to and can be extended to encrypted VMs. >> >> We Implemented our resume code in a very similar fashion to the >> recommended S3 resume code. When the HV sets the CPU state of a guest, >> it can do so when the guest is not executing. Setting the state from >> inside the guest is a delicate operation. There is no way to >> atomically >> set all of the CPU state from inside the guest. Instead, we must set >> most registers individually and account for changes in control flow >> that >> doing so might cause. We do this with a three-phase trampoline. OVMF >> calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and >> jumps to it. Phase 2 switches to an intermediate map that reconciles >> the >> OVMF map and the source map. Phase 3 switches to the source map, >> restores the registers, and returns into execution of the source. We >> will go backwards through these phases in more depth. >> >> The last thing that resume to EFI does is return. Specifically, we use >> IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a >> temporary stack and restores them atomically, thus returning to source >> execution. Prior to returning, we must manually restore most other >> registers to the values they had on the source. One particularly >> significant register is CR3. When we return to Linux, CR3 must be set >> to >> the source CR3 or the first instruction executed in Linux will cause a >> page fault. The code that we use to restore the registers and return >> must be mapped in the source page table or we would get a page fault >> executing the instructions prior to returning into Linux. The value of >> CR3 is so significant, that it defines the three phases of the >> trampoline. Phase 3 begins when CR3 is set to the source CR3. After >> setting CR3, we set all the other registers and return. >> >> Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, >> meaning >> that virtual addresses are the same as physical addresses. The kernel >> page table uses an offset mapping, meaning that virtual addresses >> differ >> from physical addresses by a constant (for the most part). Crucially, >> this means that the virtual address of the page that is executed by >> phase 3 differs between the OVMF map and the source map. If we are >> executing code mapped in OVMF and we change CR3 to point to the source >> map, although the page may be mapped in the source map, the virtual >> address will be different, and we will face undefined behavior. To fix >> this, we construct intermediate page tables that map the pages for >> phase >> 2 and 3 to the virtual address expected in OVMF and to the virtual >> address expected in the source map. Thus, we can switch CR3 from >> OVMF's >> map to the intermediate map and then from the intermediate map to the >> source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly >> responsible for switching to the intermediate map, flushing the TLB, >> and >> jumping to phase 3. >> >> Fortunately phase 1 is even simpler than phase 2. Phase 1 has two >> duties. First, since phase 2 and 3 operate without a stack and can't >> access values defined in OVMF (such as the addresses of the pages >> containing phase 2 and 3), phase 1 must pass these values to phase 2 >> by >> putting them in registers. Second, phase 1 must start phase 2 by >> jumping >> to it. >> >> Given that we can resume to a snapshot in OVMF, we should be able to >> migrate an SEV guest as long as we can securely communicate the VM >> snapshot from source to destination. For our demo, we do this with a >> handful of QMP commands. More sophisticated methods are required for a >> production implementation. >> >> When we refer to a snapshot, what we really mean is the device state, >> memory, and CPU state of a guest. In live migration this is >> transmitted >> dynamically as opposed to being saved and restored. Device state is >> not >> protected by SEV and can be handled entirely by the HV. Memory, on the >> other hand, cannot be handled only by the HV. As mentioned previously, >> memory needs to be encrypted with a transport key. A Migration Handler >> on the source will coordinate with the HV to encrypt pages and >> transmit >> them to the destination. The destination HV will receive the pages >> over >> the network and pass them to the Migration Handler in the target VM so >> they can be decrypted. This transmission will occur continuously until >> the memory of the source and target converges. >> >> Plain SEV does not protect the CPU state of the guest and therefore >> does >> not require any special mechanism for transmission of the CPU state. >> We >> plan to implement an end-to-end migration with plain SEV first. In >> SEV-ES, the PSP (platform security processor) encrypts CPU state on >> each >> VMExit. The encrypted state is stored in memory. Normally this memory >> (known as the VMSA) is not mapped into the guest, but we can add an >> entry to the nested page tables that will expose the VMSA to the >> guest. >> This means that when the guest VMExits, the CPU state will be saved to >> guest memory. With the CPU state in guest memory, it can be >> transmitted >> to the target using the method described above. >> >> In addition to the changes needed in OVMF to resume the VM, the >> transmission of the VM from source to target will require a new code >> path in the hypervisor. There will also need to be a few minor changes >> to Linux (adding a mapping for our Phase 3 pages). Despite all the >> moving pieces, we believe that this is a feasible approach for >> supporting live migration for SEV and SEV-ES. >> >> For the sake of brevity, we have left out a few issues, including SMP >> support, generation of the intermediate mappings, and more. We have >> included some notes about these issues in the COMPLICATIONS.md file. >> We >> also have an outline of an end-to-end implementation of live migration >> for SEV-ES in END-TO-END.md. See README.md for info on how to run the >> demo. While this is not a full migration, we hope to show that fast >> live >> migration with SEV and SEV-ES is possible without major kernel >> changes. >> >> -Tobin > > the one word that comes to my mind upon reading the above is, > "overwhelming". > > (I have not been addressed directly, but: > > - the subject says "RFC", > > - and the documentation at > > https://github.com/secure-migration/resume-from-edk2-tooling#what-changes-did-we-make > > states that AmdSevPkg was created for convenience, and that the feature > could be integrated into OVMF. (Paraphrased.) > > So I guess it's tolerable if I make a comment: ) > We've been looking forward to your perspective. > I've checked out the "mh-state-dev" branch of > . It has > 80 commits on top of edk2 master (base commit: d5339c04d7cd, > "UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency", > 2020-04-23). > > These commits were authored over the 6-7 months since April. It's > obviously huge work. To me, most of these commits clearly aim at > getting > the demo / proof-of-concept functional, rather than guiding (more > precisely: hand-holding) reviewers through the construction of the > feature. > > In my opinion, the series is not upstreamable in its current format > (which is presently not much more readable than a single-commit code > drop). Upstreaming is probably not your intent, either, at this time. > > I agree that getting feedback ("buy-in") at this level of maturity is > justified from your POV, before you invest more work into cleaning up / > restructuring the series. > > My problem is that "hand-holding" is exactly what I'd need -- I cannot > dedicate one or two weeks, as an indivisible block, to understanding > your design. Nor can I approach the series patch-wise in its current > format. Personally I would need the patch series to lead me through the > whole design with baby steps ("ELI5"), meaning small code changes and > detailed commit messages. I'd *also* need the more comprehensive > guide-like documentation, as background material. > > Furthermore, I don't have an environment where I can test this > proof-of-concept (and provide you with further incentive for cleaning > up > the series, by reporting success). > > So I hope others can spend the time discussing the design with you, and > testing / repeating the demo. For me to review the patches, the patches > should condense and replay your thinking process from the last 7 > months, > in as small as possible logical steps. (On the list.) > I completely understand your position. This PoC has a lot of new ideas in it and you're right that our main priority was not to hand-hold/guide reviewers through the code. One thing that is worth emphasizing is that the pieces we are showcasing here are not the immediate priority when it comes to upstreaming. Specifically, we looked into the trampoline to make sure it was possible to migrate CPU state via firmware. While we need this for SEV-ES and our goal is to support SEV-ES, it is not the first step. We are currently working on a PoC for a full end-to-end migration with SEV (non-ES), which may be a better place for us to begin a serious discussion about getting things upstream. We will focus more on making these patches accessible to the upstream community. In the meantime, perhaps there is something we can do to help make our current work more clear. We could potentially explain things on a call or create some additional documentation. While our goal is not to shove this version of the trampoline upstream, it is significant to our plan as a whole and we want to help people understand it. -Tobin > I really don't want to be the bottleneck here, which is why I would > support introducing this feature as a separate top-level package > (AmdSevPkg). > > Thanks > Laszlo