From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by mx.groups.io with SMTP id smtpd.web10.17690.1604953688739520686 for ; Mon, 09 Nov 2020 12:28:09 -0800 Authentication-Results: mx.groups.io; dkim=pass header.i=@ibm.com header.s=pp1 header.b=HYSxTnvy; spf=pass (domain: linux.ibm.com, ip: 148.163.158.5, mailfrom: tobin@linux.ibm.com) Received: from pps.filterd (m0127361.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 0A9K2ijv064517; Mon, 9 Nov 2020 15:28:05 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=mime-version : date : from : to : cc : subject : in-reply-to : references : message-id : content-type : content-transfer-encoding; s=pp1; bh=9jruNPgLEyZNbkOdmXI4VcgMjvRrgCUZSo0SsmsS3WY=; b=HYSxTnvyLp9Ijd8xsRADkPoUSO97toFgqtIlmja6essvTR1qUm8xa4j+QXO36HeYElfq F+0ILadhMKet+Lvl7GkB1geLGOxlm8l+5A9BnotLs2KNH6MJPpWAbKy2RSWMHo6Fpupj ZN8nnG+LcVcjp+7ixoHFQnpcXOW1GFrvHIjqosV29td3ZE3d6SahsRgfGlNRzrR+69tL ct1s7x0dvixUflvNqPS+9SKfcZ4cLbMUYA1spcs+9PjxU590E4ctls+38DkO1Jz1w487 /kPznl+gVLic/YlBiMeM/GvYwNFqQyIpRoHEy3bdGla4c6+XuDUdOEYXs7/vQ9JUdq28 sA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 34q58yf6r1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 09 Nov 2020 15:28:04 -0500 Received: from m0127361.ppops.net (m0127361.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.36/8.16.0.36) with SMTP id 0A9K2nKQ064996; Mon, 9 Nov 2020 15:28:03 -0500 Received: from ppma04wdc.us.ibm.com (1a.90.2fa9.ip4.static.sl-reverse.com [169.47.144.26]) by mx0a-001b2d01.pphosted.com with ESMTP id 34q58yf6qg-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 09 Nov 2020 15:28:03 -0500 Received: from pps.filterd (ppma04wdc.us.ibm.com [127.0.0.1]) by ppma04wdc.us.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 0A9KM0ox025836; Mon, 9 Nov 2020 20:28:02 GMT Received: from b03cxnp08025.gho.boulder.ibm.com (b03cxnp08025.gho.boulder.ibm.com [9.17.130.17]) by ppma04wdc.us.ibm.com with ESMTP id 34q5nebxer-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 09 Nov 2020 20:28:02 +0000 Received: from b03ledav003.gho.boulder.ibm.com (b03ledav003.gho.boulder.ibm.com [9.17.130.234]) by b03cxnp08025.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 0A9KRqPu9437724 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 9 Nov 2020 20:27:53 GMT Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 9D7BE6A04F; Mon, 9 Nov 2020 20:27:58 +0000 (GMT) Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 383AD6A04D; Mon, 9 Nov 2020 20:27:58 +0000 (GMT) Received: from ltc.linux.ibm.com (unknown [9.16.170.189]) by b03ledav003.gho.boulder.ibm.com (Postfix) with ESMTP; Mon, 9 Nov 2020 20:27:58 +0000 (GMT) MIME-Version: 1.0 Date: Mon, 09 Nov 2020 15:27:57 -0500 From: "Tobin Feldman-Fitzthum" To: Ashish Kalra Cc: "Dr. David Alan Gilbert" , Laszlo Ersek , devel@edk2.groups.io, dovmurik@linux.vnet.ibm.com, Dov.Murik1@il.ibm.com, brijesh.singh@amd.com, tobin@ibm.com, david.kaplan@amd.com, jon.grimm@amd.com, thomas.lendacky@amd.com, jejb@linux.ibm.com, frankeh@us.ibm.com Subject: Re: [edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept In-Reply-To: <20201106221704.GA23995@ashkalra_ubuntu_server> References: <933a5d2b-a495-37b9-fe8b-243f9bae24d5@redhat.com> <61acbc7b318b2c099a106151116f25ea@linux.vnet.ibm.com> <20201106163848.GM3576@work-vm> <6c4d7b90a59d3df6895d8c0e35f7a2cd@linux.vnet.ibm.com> <20201106221704.GA23995@ashkalra_ubuntu_server> Message-ID: <830107e597cd63d69283094d4e36a10e@linux.vnet.ibm.com> X-Sender: tobin@linux.ibm.com User-Agent: Roundcube Webmail/1.0.1 X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.312,18.0.737 definitions=2020-11-09_12:2020-11-05,2020-11-09 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 lowpriorityscore=0 bulkscore=0 mlxlogscore=999 adultscore=0 suspectscore=0 malwarescore=0 impostorscore=0 spamscore=0 phishscore=0 clxscore=1015 priorityscore=1501 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2011090133 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit On 2020-11-06 17:17, Ashish Kalra wrote: > Hello Tobin, > > On Fri, Nov 06, 2020 at 04:48:12PM -0500, Tobin Feldman-Fitzthum wrote: >> On 2020-11-06 11:38, Dr. David Alan Gilbert wrote: >> > * Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote: >> > > On 2020-11-03 09:59, Laszlo Ersek wrote: >> > > > Hi Tobin, >> > > > >> > > > (keeping full context -- I'm adding Dave) >> > > > >> > > > On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote: >> > > > > Hello, >> > > > > >> > > > > Dov Murik. James Bottomley, Hubertus Franke, and I have been working >> > > > > on >> > > > > a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when >> > > > > it's >> > > > > out and even hopefully Intel TDX) VMs. We have developed an approach >> > > > > that we believe is feasible and a demonstration that shows our >> > > > > solution >> > > > > to the most difficult part of the problem. In short, we have >> > > > > implemented >> > > > > a UEFI Application that can resume from a VM snapshot. We think this >> > > > > is >> > > > > the crux of SEV-ES live migration. After describing the context of our >> > > > > demo and how it works, we explain how it can be extended to a full >> > > > > SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live >> > > > > migration can be implemented in OVMF with minimal kernel changes. We >> > > > > provide a blueprint for doing so. >> > > > > >> > > > > Typically the hypervisor facilitates live migration. AMD SEV excludes >> > > > > the hypervisor from the trust domain of the guest. When a hypervisor >> > > > > (HV) examines the memory of an SEV guest, it will find only a >> > > > > ciphertext. If the HV moves the memory of an SEV guest, the ciphertext >> > > > > will be invalidated. Furthermore, with SEV-ES the hypervisor is >> > > > > largely >> > > > > unable to access guest CPU state. Thus, fast migration of SEV VMs >> > > > > requires support from inside the trust domain, i.e. the guest. >> > > > > >> > > > > One approach is to add support for SEV Migration to the Linux kernel. >> > > > > This would allow the guest to encrypt/decrypt its own memory with a >> > > > > transport key. This approach has met some resistance. We propose a >> > > > > similar approach implemented not in Linux, but in firmware, >> > > > > specifically >> > > > > OVMF. Since OVMF runs inside the guest, it has access to the guest >> > > > > memory and CPU state. OVMF should be able to perform the manipulations >> > > > > required for live migration of SEV and SEV-ES guests. >> > > > > >> > > > > The biggest challenge of this approach involves migrating the CPU >> > > > > state >> > > > > of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the >> > > > > CPU >> > > > > state of the target before the target begins executing. In our >> > > > > approach, >> > > > > the HV starts the target and OVMF must resume to whatever state the >> > > > > source was in. We believe this to be the crux (or at least the most >> > > > > difficult part) of live migration for SEV and we hope that by >> > > > > demonstrating resume from EFI, we can show that our approach is >> > > > > generally feasible. >> > > > > >> > > > > Our demo can be found at . The >> > > > > tooling repository is the best starting point. It contains >> > > > > documentation >> > > > > about the project and the scripts needed to run the demo. There are >> > > > > two >> > > > > more repos associated with the project. One is a modified edk2 tree >> > > > > that >> > > > > contains our modified OVMF. The other is a modified qemu, that has a >> > > > > couple of temporary changes needed for the demo. Our demonstration is >> > > > > aimed only at resuming from a VM snapshot in OVMF. We provide the >> > > > > source >> > > > > CPU state and source memory to the destination using temporary >> > > > > plumbing >> > > > > that violates the SEV trust model. We explain the setup in more >> > > > > depth in >> > > > > README.md. We are showing only that OVMF can resume from a VM >> > > > > snapshot. >> > > > > At the end we will describe our plan for transferring CPU state and >> > > > > memory from source to guest. To be clear, the temporary tooling used >> > > > > for >> > > > > this demo isn't built for encrypted VMs, but below we explain how this >> > > > > demo applies to and can be extended to encrypted VMs. >> > > > > >> > > > > We Implemented our resume code in a very similar fashion to the >> > > > > recommended S3 resume code. When the HV sets the CPU state of a guest, >> > > > > it can do so when the guest is not executing. Setting the state from >> > > > > inside the guest is a delicate operation. There is no way to >> > > > > atomically >> > > > > set all of the CPU state from inside the guest. Instead, we must set >> > > > > most registers individually and account for changes in control flow >> > > > > that >> > > > > doing so might cause. We do this with a three-phase trampoline. OVMF >> > > > > calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and >> > > > > jumps to it. Phase 2 switches to an intermediate map that reconciles >> > > > > the >> > > > > OVMF map and the source map. Phase 3 switches to the source map, >> > > > > restores the registers, and returns into execution of the source. We >> > > > > will go backwards through these phases in more depth. >> > > > > >> > > > > The last thing that resume to EFI does is return. Specifically, we use >> > > > > IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a >> > > > > temporary stack and restores them atomically, thus returning to source >> > > > > execution. Prior to returning, we must manually restore most other >> > > > > registers to the values they had on the source. One particularly >> > > > > significant register is CR3. When we return to Linux, CR3 must be >> > > > > set to >> > > > > the source CR3 or the first instruction executed in Linux will cause a >> > > > > page fault. The code that we use to restore the registers and return >> > > > > must be mapped in the source page table or we would get a page fault >> > > > > executing the instructions prior to returning into Linux. The value of >> > > > > CR3 is so significant, that it defines the three phases of the >> > > > > trampoline. Phase 3 begins when CR3 is set to the source CR3. After >> > > > > setting CR3, we set all the other registers and return. >> > > > > >> > > > > Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, >> > > > > meaning >> > > > > that virtual addresses are the same as physical addresses. The kernel >> > > > > page table uses an offset mapping, meaning that virtual addresses >> > > > > differ >> > > > > from physical addresses by a constant (for the most part). Crucially, >> > > > > this means that the virtual address of the page that is executed by >> > > > > phase 3 differs between the OVMF map and the source map. If we are >> > > > > executing code mapped in OVMF and we change CR3 to point to the source >> > > > > map, although the page may be mapped in the source map, the virtual >> > > > > address will be different, and we will face undefined behavior. To fix >> > > > > this, we construct intermediate page tables that map the pages for >> > > > > phase >> > > > > 2 and 3 to the virtual address expected in OVMF and to the virtual >> > > > > address expected in the source map. Thus, we can switch CR3 from >> > > > > OVMF's >> > > > > map to the intermediate map and then from the intermediate map to the >> > > > > source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly >> > > > > responsible for switching to the intermediate map, flushing the TLB, >> > > > > and >> > > > > jumping to phase 3. >> > > > > >> > > > > Fortunately phase 1 is even simpler than phase 2. Phase 1 has two >> > > > > duties. First, since phase 2 and 3 operate without a stack and can't >> > > > > access values defined in OVMF (such as the addresses of the pages >> > > > > containing phase 2 and 3), phase 1 must pass these values to phase 2 >> > > > > by >> > > > > putting them in registers. Second, phase 1 must start phase 2 by >> > > > > jumping >> > > > > to it. >> > > > > >> > > > > Given that we can resume to a snapshot in OVMF, we should be able to >> > > > > migrate an SEV guest as long as we can securely communicate the VM >> > > > > snapshot from source to destination. For our demo, we do this with a >> > > > > handful of QMP commands. More sophisticated methods are required for a >> > > > > production implementation. >> > > > > >> > > > > When we refer to a snapshot, what we really mean is the device state, >> > > > > memory, and CPU state of a guest. In live migration this is >> > > > > transmitted >> > > > > dynamically as opposed to being saved and restored. Device state is >> > > > > not >> > > > > protected by SEV and can be handled entirely by the HV. Memory, on the >> > > > > other hand, cannot be handled only by the HV. As mentioned previously, >> > > > > memory needs to be encrypted with a transport key. A Migration Handler >> > > > > on the source will coordinate with the HV to encrypt pages and >> > > > > transmit >> > > > > them to the destination. The destination HV will receive the pages >> > > > > over >> > > > > the network and pass them to the Migration Handler in the target VM so >> > > > > they can be decrypted. This transmission will occur continuously until >> > > > > the memory of the source and target converges. >> > > > > >> > > > > Plain SEV does not protect the CPU state of the guest and therefore >> > > > > does >> > > > > not require any special mechanism for transmission of the CPU state. >> > > > > We >> > > > > plan to implement an end-to-end migration with plain SEV first. In >> > > > > SEV-ES, the PSP (platform security processor) encrypts CPU state on >> > > > > each >> > > > > VMExit. The encrypted state is stored in memory. Normally this memory >> > > > > (known as the VMSA) is not mapped into the guest, but we can add an >> > > > > entry to the nested page tables that will expose the VMSA to the >> > > > > guest. >> > > > > This means that when the guest VMExits, the CPU state will be saved to >> > > > > guest memory. With the CPU state in guest memory, it can be >> > > > > transmitted >> > > > > to the target using the method described above. >> > > > > >> > > > > In addition to the changes needed in OVMF to resume the VM, the >> > > > > transmission of the VM from source to target will require a new code >> > > > > path in the hypervisor. There will also need to be a few minor changes >> > > > > to Linux (adding a mapping for our Phase 3 pages). Despite all the >> > > > > moving pieces, we believe that this is a feasible approach for >> > > > > supporting live migration for SEV and SEV-ES. >> > > > > >> > > > > For the sake of brevity, we have left out a few issues, including SMP >> > > > > support, generation of the intermediate mappings, and more. We have >> > > > > included some notes about these issues in the COMPLICATIONS.md file. >> > > > > We >> > > > > also have an outline of an end-to-end implementation of live migration >> > > > > for SEV-ES in END-TO-END.md. See README.md for info on how to run the >> > > > > demo. While this is not a full migration, we hope to show that fast >> > > > > live >> > > > > migration with SEV and SEV-ES is possible without major kernel >> > > > > changes. >> > > > > >> > > > > -Tobin >> > > > >> > > > the one word that comes to my mind upon reading the above is, >> > > > "overwhelming". >> > > > >> > > > (I have not been addressed directly, but: >> > > > >> > > > - the subject says "RFC", >> > > > >> > > > - and the documentation at >> > > > >> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsecure-migration%2Fresume-from-edk2-tooling%23what-changes-did-we-make&data=04%7C01%7Cashish.kalra%40amd.com%7C94e1ccd037b648bd43ef08d8829dac65%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637402961010808338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=3%2FYBNKU90Kas%2F%2FUccbeqLI5CB2QRXBlA0ARkrnEAe0U%3D&reserved=0 >> > > > >> > > > states that AmdSevPkg was created for convenience, and that the feature >> > > > could be integrated into OVMF. (Paraphrased.) >> > > > >> > > > So I guess it's tolerable if I make a comment: ) >> > > > >> > > We've been looking forward to your perspective. >> > > >> > > > I've checked out the "mh-state-dev" branch of >> > > > . It has >> > > > 80 commits on top of edk2 master (base commit: d5339c04d7cd, >> > > > "UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency", >> > > > 2020-04-23). >> > > > >> > > > These commits were authored over the 6-7 months since April. It's >> > > > obviously huge work. To me, most of these commits clearly aim at getting >> > > > the demo / proof-of-concept functional, rather than guiding (more >> > > > precisely: hand-holding) reviewers through the construction of the >> > > > feature. >> > > > >> > > > In my opinion, the series is not upstreamable in its current format >> > > > (which is presently not much more readable than a single-commit code >> > > > drop). Upstreaming is probably not your intent, either, at this time. >> > > > >> > > > I agree that getting feedback ("buy-in") at this level of maturity is >> > > > justified from your POV, before you invest more work into cleaning up / >> > > > restructuring the series. >> > > > >> > > > My problem is that "hand-holding" is exactly what I'd need -- I cannot >> > > > dedicate one or two weeks, as an indivisible block, to understanding >> > > > your design. Nor can I approach the series patch-wise in its current >> > > > format. Personally I would need the patch series to lead me through the >> > > > whole design with baby steps ("ELI5"), meaning small code changes and >> > > > detailed commit messages. I'd *also* need the more comprehensive >> > > > guide-like documentation, as background material. >> > > > >> > > > Furthermore, I don't have an environment where I can test this >> > > > proof-of-concept (and provide you with further incentive for cleaning up >> > > > the series, by reporting success). >> > > > >> > > > So I hope others can spend the time discussing the design with you, and >> > > > testing / repeating the demo. For me to review the patches, the patches >> > > > should condense and replay your thinking process from the last 7 months, >> > > > in as small as possible logical steps. (On the list.) >> > > > >> > > I completely understand your position. This PoC has a lot of >> > > new ideas in it and you're right that our main priority was not >> > > to hand-hold/guide reviewers through the code. >> > > >> > > One thing that is worth emphasizing is that the pieces we >> > > are showcasing here are not the immediate priority when it >> > > comes to upstreaming. Specifically, we looked into the trampoline >> > > to make sure it was possible to migrate CPU state via firmware. >> > > While we need this for SEV-ES and our goal is to support SEV-ES, >> > > it is not the first step. We are currently working on a PoC for >> > > a full end-to-end migration with SEV (non-ES), which may be a better >> > > place for us to begin a serious discussion about getting things >> > > upstream. We will focus more on making these patches accessible >> > > to the upstream community. >> > >> > With my migration maintainer hat on, I'd like to understand a bit more >> > about these different approaches; they could be quite invasive, so I'd >> > like to make sure we're not doing one and throwing it away - it would >> > be great if you could explain your non-ES approach; you don't need to >> > have POC code to explain it. >> > >> Our non-ES approach is a subset of our ES approach. For ES, the >> Migration Handler in the guest needs to help out with memory and >> CPU state. For plain SEV, the HV can set the CPU state, but we still >> need a way to transfer the memory. The current POC only deals >> with the CPU state. >> >> We're still working out some of the details in QEMU, but the basic >> idea of transferring memory is that each time the HV needs to send a >> page to the target, it will ask the Migration Handler in the guest >> for a version of the page that is encrypted with a transport key. >> Since the MH is inside the guest, it can read from any address >> in guest memory. The Migration Handlers on the source and the target >> will share a key. Once the source encrypts the requested page with >> the transport key, it can safely hand it off to the HV. Once the page >> reaches the target, the target HV will pass the page into the >> Migration Handler, which will decrypt using the transport key and >> move the page to the appropriate address. >> >> A few things to note: >> >> - The Migration Handler on the source needs to be running in the >> guest alongside the VM. On the target, the MH needs to startup >> before we can receive any pages. In both cases we are thinking >> that an additional vCPU can be started for the MH to run on. >> This could be spawned dynamically or live for the duration of >> the guest. >> >> - We need to make sure that the Migration Handler on the target >> does not overwrite itself when it receives pages from the >> source. Since we run the same firmware on the source and >> target, and since the MH is runtime code, the memory >> footprint of the MH should match on the source and the >> target. We will need to make sure there are no weird >> relocations. >> >> - There are some complexities arising from the fact that not >> every page in an SEV VM is encrypted. We are looking into >> the best way to handle encrypted vs. shared pages. >> > > Raising this question here as part of this discussion ... are you > thinking of adding the page encryption bitmap (as we do for the slow > migration patches) here to figure out if the guest pages are encrypted > or not ? > We are using the bitmap for the first iteration of our end-to-end POC. > The page encryption status will need notifications from the guest > kernel > and OVMF. > > Additionally, is the page encrpytion bitmap support going to be added > as > a hypercall interface to the guest, which also means that the > guest kernel needs to be modified ? Although the bitmap is handy, we would like to avoid the patches you are alluding to. We are currently looking into how we can eliminate the bitmap. -Tobin > > Thanks, > Ashish > >> Hopefully those notes don't confound my earlier explanation too >> much. I think that's most of the picture for non-ES migration. >> Let me know if you have any questions. ES migration would use >> the same approach for transferring memory. >> >> -Tobin >> >> > Dave >> > >> > > In the meantime, perhaps there is something we can do to help >> > > make our current work more clear. We could potentially explain >> > > things on a call or create some additional documentation. While >> > > our goal is not to shove this version of the trampoline upstream, >> > > it is significant to our plan as a whole and we want to help >> > > people understand it. >> > > >> > > -Tobin >> > > >> > > > I really don't want to be the bottleneck here, which is why I would >> > > > support introducing this feature as a separate top-level package >> > > > (AmdSevPkg). >> > > > >> > > > Thanks >> > > > Laszlo >> > >