* RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept @ 2020-10-28 19:31 Tobin Feldman-Fitzthum 2020-10-29 17:06 ` Ashish Kalra 2020-11-03 14:59 ` [edk2-devel] " Laszlo Ersek 0 siblings, 2 replies; 16+ messages in thread From: Tobin Feldman-Fitzthum @ 2020-10-28 19:31 UTC (permalink / raw) To: devel Cc: dovmurik, Dov.Murik1, ashish.kalra, brijesh.singh, tobin, david.kaplan, jon.grimm, thomas.lendacky, thomas.lendacky, jejb, frankeh Hello, Dov Murik. James Bottomley, Hubertus Franke, and I have been working on a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when it's out and even hopefully Intel TDX) VMs. We have developed an approach that we believe is feasible and a demonstration that shows our solution to the most difficult part of the problem. In short, we have implemented a UEFI Application that can resume from a VM snapshot. We think this is the crux of SEV-ES live migration. After describing the context of our demo and how it works, we explain how it can be extended to a full SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live migration can be implemented in OVMF with minimal kernel changes. We provide a blueprint for doing so. Typically the hypervisor facilitates live migration. AMD SEV excludes the hypervisor from the trust domain of the guest. When a hypervisor (HV) examines the memory of an SEV guest, it will find only a ciphertext. If the HV moves the memory of an SEV guest, the ciphertext will be invalidated. Furthermore, with SEV-ES the hypervisor is largely unable to access guest CPU state. Thus, fast migration of SEV VMs requires support from inside the trust domain, i.e. the guest. One approach is to add support for SEV Migration to the Linux kernel. This would allow the guest to encrypt/decrypt its own memory with a transport key. This approach has met some resistance. We propose a similar approach implemented not in Linux, but in firmware, specifically OVMF. Since OVMF runs inside the guest, it has access to the guest memory and CPU state. OVMF should be able to perform the manipulations required for live migration of SEV and SEV-ES guests. The biggest challenge of this approach involves migrating the CPU state of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the CPU state of the target before the target begins executing. In our approach, the HV starts the target and OVMF must resume to whatever state the source was in. We believe this to be the crux (or at least the most difficult part) of live migration for SEV and we hope that by demonstrating resume from EFI, we can show that our approach is generally feasible. Our demo can be found at <https://github.com/secure-migration>. The tooling repository is the best starting point. It contains documentation about the project and the scripts needed to run the demo. There are two more repos associated with the project. One is a modified edk2 tree that contains our modified OVMF. The other is a modified qemu, that has a couple of temporary changes needed for the demo. Our demonstration is aimed only at resuming from a VM snapshot in OVMF. We provide the source CPU state and source memory to the destination using temporary plumbing that violates the SEV trust model. We explain the setup in more depth in README.md. We are showing only that OVMF can resume from a VM snapshot. At the end we will describe our plan for transferring CPU state and memory from source to guest. To be clear, the temporary tooling used for this demo isn't built for encrypted VMs, but below we explain how this demo applies to and can be extended to encrypted VMs. We Implemented our resume code in a very similar fashion to the recommended S3 resume code. When the HV sets the CPU state of a guest, it can do so when the guest is not executing. Setting the state from inside the guest is a delicate operation. There is no way to atomically set all of the CPU state from inside the guest. Instead, we must set most registers individually and account for changes in control flow that doing so might cause. We do this with a three-phase trampoline. OVMF calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and jumps to it. Phase 2 switches to an intermediate map that reconciles the OVMF map and the source map. Phase 3 switches to the source map, restores the registers, and returns into execution of the source. We will go backwards through these phases in more depth. The last thing that resume to EFI does is return. Specifically, we use IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a temporary stack and restores them atomically, thus returning to source execution. Prior to returning, we must manually restore most other registers to the values they had on the source. One particularly significant register is CR3. When we return to Linux, CR3 must be set to the source CR3 or the first instruction executed in Linux will cause a page fault. The code that we use to restore the registers and return must be mapped in the source page table or we would get a page fault executing the instructions prior to returning into Linux. The value of CR3 is so significant, that it defines the three phases of the trampoline. Phase 3 begins when CR3 is set to the source CR3. After setting CR3, we set all the other registers and return. Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, meaning that virtual addresses are the same as physical addresses. The kernel page table uses an offset mapping, meaning that virtual addresses differ from physical addresses by a constant (for the most part). Crucially, this means that the virtual address of the page that is executed by phase 3 differs between the OVMF map and the source map. If we are executing code mapped in OVMF and we change CR3 to point to the source map, although the page may be mapped in the source map, the virtual address will be different, and we will face undefined behavior. To fix this, we construct intermediate page tables that map the pages for phase 2 and 3 to the virtual address expected in OVMF and to the virtual address expected in the source map. Thus, we can switch CR3 from OVMF's map to the intermediate map and then from the intermediate map to the source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly responsible for switching to the intermediate map, flushing the TLB, and jumping to phase 3. Fortunately phase 1 is even simpler than phase 2. Phase 1 has two duties. First, since phase 2 and 3 operate without a stack and can't access values defined in OVMF (such as the addresses of the pages containing phase 2 and 3), phase 1 must pass these values to phase 2 by putting them in registers. Second, phase 1 must start phase 2 by jumping to it. Given that we can resume to a snapshot in OVMF, we should be able to migrate an SEV guest as long as we can securely communicate the VM snapshot from source to destination. For our demo, we do this with a handful of QMP commands. More sophisticated methods are required for a production implementation. When we refer to a snapshot, what we really mean is the device state, memory, and CPU state of a guest. In live migration this is transmitted dynamically as opposed to being saved and restored. Device state is not protected by SEV and can be handled entirely by the HV. Memory, on the other hand, cannot be handled only by the HV. As mentioned previously, memory needs to be encrypted with a transport key. A Migration Handler on the source will coordinate with the HV to encrypt pages and transmit them to the destination. The destination HV will receive the pages over the network and pass them to the Migration Handler in the target VM so they can be decrypted. This transmission will occur continuously until the memory of the source and target converges. Plain SEV does not protect the CPU state of the guest and therefore does not require any special mechanism for transmission of the CPU state. We plan to implement an end-to-end migration with plain SEV first. In SEV-ES, the PSP (platform security processor) encrypts CPU state on each VMExit. The encrypted state is stored in memory. Normally this memory (known as the VMSA) is not mapped into the guest, but we can add an entry to the nested page tables that will expose the VMSA to the guest. This means that when the guest VMExits, the CPU state will be saved to guest memory. With the CPU state in guest memory, it can be transmitted to the target using the method described above. In addition to the changes needed in OVMF to resume the VM, the transmission of the VM from source to target will require a new code path in the hypervisor. There will also need to be a few minor changes to Linux (adding a mapping for our Phase 3 pages). Despite all the moving pieces, we believe that this is a feasible approach for supporting live migration for SEV and SEV-ES. For the sake of brevity, we have left out a few issues, including SMP support, generation of the intermediate mappings, and more. We have included some notes about these issues in the COMPLICATIONS.md file. We also have an outline of an end-to-end implementation of live migration for SEV-ES in END-TO-END.md. See README.md for info on how to run the demo. While this is not a full migration, we hope to show that fast live migration with SEV and SEV-ES is possible without major kernel changes. -Tobin ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept 2020-10-28 19:31 RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept Tobin Feldman-Fitzthum @ 2020-10-29 17:06 ` Ashish Kalra 2020-10-29 20:36 ` tobin 2020-11-03 14:59 ` [edk2-devel] " Laszlo Ersek 1 sibling, 1 reply; 16+ messages in thread From: Ashish Kalra @ 2020-10-29 17:06 UTC (permalink / raw) To: Tobin Feldman-Fitzthum Cc: devel, dovmurik, Dov.Murik1, brijesh.singh, tobin, david.kaplan, jon.grimm, thomas.lendacky, jejb, frankeh Hello Tobin, On Wed, Oct 28, 2020 at 03:31:44PM -0400, Tobin Feldman-Fitzthum wrote: > Hello, > > Dov Murik. James Bottomley, Hubertus Franke, and I have been working on a > plan for fast live migration of SEV and SEV-ES (and SEV-SNP when it's out > and even hopefully Intel TDX) VMs. We have developed an approach that we > believe is feasible and a demonstration that shows our solution to the most > difficult part of the problem. In short, we have implemented a UEFI > Application that can resume from a VM snapshot. We think this is the crux of > SEV-ES live migration. After describing the context of our demo and how it > works, we explain how it can be extended to a full SEV-ES migration. Our > goal is to show that fast SEV and SEV-ES live migration can be implemented > in OVMF with minimal kernel changes. We provide a blueprint for doing so. > > Typically the hypervisor facilitates live migration. AMD SEV excludes the > hypervisor from the trust domain of the guest. When a hypervisor (HV) > examines the memory of an SEV guest, it will find only a ciphertext. If the > HV moves the memory of an SEV guest, the ciphertext will be invalidated. > Furthermore, with SEV-ES the hypervisor is largely unable to access guest > CPU state. Thus, fast migration of SEV VMs requires support from inside the > trust domain, i.e. the guest. > > One approach is to add support for SEV Migration to the Linux kernel. This > would allow the guest to encrypt/decrypt its own memory with a transport > key. This approach has met some resistance. We propose a similar approach > implemented not in Linux, but in firmware, specifically OVMF. Since OVMF > runs inside the guest, it has access to the guest memory and CPU state. OVMF > should be able to perform the manipulations required for live migration of > SEV and SEV-ES guests. > > The biggest challenge of this approach involves migrating the CPU state of > an SEV-ES guest. In a normal (non-SEV migration) the HV sets the CPU state > of the target before the target begins executing. In our approach, the HV > starts the target and OVMF must resume to whatever state the source was in. > We believe this to be the crux (or at least the most difficult part) of live > migration for SEV and we hope that by demonstrating resume from EFI, we can > show that our approach is generally feasible. > > Our demo can be found at <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsecure-migration&data=04%7C01%7Cashish.kalra%40amd.com%7C6edb93f8936e465a9fee08d87b781d00%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637395103097650163%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=dsOh3zcwSWgnpmMdcCnSoJ%2B3Ohqz175axch%2B%2Bnu73Uc%3D&reserved=0>. > The tooling repository is the best starting point. It contains documentation > about the project and the scripts needed to run the demo. There are two more > repos associated with the project. One is a modified edk2 tree that contains > our modified OVMF. The other is a modified qemu, that has a couple of > temporary changes needed for the demo. Our demonstration is aimed only at > resuming from a VM snapshot in OVMF. We provide the source CPU state and > source memory to the destination using temporary plumbing that violates the > SEV trust model. We explain the setup in more depth in README.md. We are > showing only that OVMF can resume from a VM snapshot. At the end we will > describe our plan for transferring CPU state and memory from source to > guest. To be clear, the temporary tooling used for this demo isn't built for > encrypted VMs, but below we explain how this demo applies to and can be > extended to encrypted VMs. > > We Implemented our resume code in a very similar fashion to the recommended > S3 resume code. When the HV sets the CPU state of a guest, it can do so when > the guest is not executing. Setting the state from inside the guest is a > delicate operation. There is no way to atomically set all of the CPU state > from inside the guest. Instead, we must set most registers individually and > account for changes in control flow that doing so might cause. We do this > with a three-phase trampoline. OVMF calls phase 1, which runs on the OVMF > map. Phase 1 sets up phase 2 and jumps to it. Phase 2 switches to an > intermediate map that reconciles the OVMF map and the source map. Phase 3 > switches to the source map, restores the registers, and returns into > execution of the source. We will go backwards through these phases in more > depth. > > The last thing that resume to EFI does is return. Specifically, we use > IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a > temporary stack and restores them atomically, thus returning to source > execution. Prior to returning, we must manually restore most other registers > to the values they had on the source. One particularly significant register > is CR3. When we return to Linux, CR3 must be set to the source CR3 or the > first instruction executed in Linux will cause a page fault. The code that > we use to restore the registers and return must be mapped in the source page > table or we would get a page fault executing the instructions prior to > returning into Linux. The value of CR3 is so significant, that it defines > the three phases of the trampoline. Phase 3 begins when CR3 is set to the > source CR3. After setting CR3, we set all the other registers and return. > > Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, meaning > that virtual addresses are the same as physical addresses. The kernel page > table uses an offset mapping, meaning that virtual addresses differ from > physical addresses by a constant (for the most part). Crucially, this means > that the virtual address of the page that is executed by phase 3 differs > between the OVMF map and the source map. If we are executing code mapped in > OVMF and we change CR3 to point to the source map, although the page may be > mapped in the source map, the virtual address will be different, and we will > face undefined behavior. To fix this, we construct intermediate page tables > that map the pages for phase 2 and 3 to the virtual address expected in OVMF > and to the virtual address expected in the source map. Thus, we can switch > CR3 from OVMF's map to the intermediate map and then from the intermediate > map to the source map. Phase 2 is much shorter than phase 3. Phase 2 is > mainly responsible for switching to the intermediate map, flushing the TLB, > and jumping to phase 3. > > Fortunately phase 1 is even simpler than phase 2. Phase 1 has two duties. > First, since phase 2 and 3 operate without a stack and can't access values > defined in OVMF (such as the addresses of the pages containing phase 2 and > 3), phase 1 must pass these values to phase 2 by putting them in registers. > Second, phase 1 must start phase 2 by jumping to it. > > Given that we can resume to a snapshot in OVMF, we should be able to migrate > an SEV guest as long as we can securely communicate the VM snapshot from > source to destination. For our demo, we do this with a handful of QMP > commands. More sophisticated methods are required for a production > implementation. > > When we refer to a snapshot, what we really mean is the device state, > memory, and CPU state of a guest. In live migration this is transmitted > dynamically as opposed to being saved and restored. Device state is not > protected by SEV and can be handled entirely by the HV. Memory, on the other > hand, cannot be handled only by the HV. As mentioned previously, memory > needs to be encrypted with a transport key. A Migration Handler on the > source will coordinate with the HV to encrypt pages and transmit them to the > destination. The destination HV will receive the pages over the network and > pass them to the Migration Handler in the target VM so they can be > decrypted. This transmission will occur continuously until the memory of the > source and target converges. > > Plain SEV does not protect the CPU state of the guest and therefore does not > require any special mechanism for transmission of the CPU state. We plan to > implement an end-to-end migration with plain SEV first. In SEV-ES, the PSP > (platform security processor) encrypts CPU state on each VMExit. The > encrypted state is stored in memory. Normally this memory (known as the > VMSA) is not mapped into the guest, but we can add an entry to the nested > page tables that will expose the VMSA to the guest. I have a question here, is there any kind of integrity protection on the CPU state when the target VM is resumed after nigration, for example, if there is a malicious hypervisor which maps a page with subverted CPU state on the nested page tables, what prevents the target VM to resume execution on a subverted or compromised CPU state ? Thanks, Ashish > This means that when the > guest VMExits, the CPU state will be saved to guest memory. With the CPU > state in guest memory, it can be transmitted to the target using the method > described above. > > In addition to the changes needed in OVMF to resume the VM, the transmission > of the VM from source to target will require a new code path in the > hypervisor. There will also need to be a few minor changes to Linux (adding > a mapping for our Phase 3 pages). Despite all the moving pieces, we believe > that this is a feasible approach for supporting live migration for SEV and > SEV-ES. > > For the sake of brevity, we have left out a few issues, including SMP > support, generation of the intermediate mappings, and more. We have included > some notes about these issues in the COMPLICATIONS.md file. We also have an > outline of an end-to-end implementation of live migration for SEV-ES in > END-TO-END.md. See README.md for info on how to run the demo. While this is > not a full migration, we hope to show that fast live migration with SEV and > SEV-ES is possible without major kernel changes. > > -Tobin > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept 2020-10-29 17:06 ` Ashish Kalra @ 2020-10-29 20:36 ` tobin 2020-10-30 18:35 ` Ashish Kalra 0 siblings, 1 reply; 16+ messages in thread From: tobin @ 2020-10-29 20:36 UTC (permalink / raw) To: Ashish Kalra Cc: devel, dovmurik, Dov.Murik1, brijesh.singh, tobin, david.kaplan, jon.grimm, thomas.lendacky, jejb, frankeh On 2020-10-29 13:06, Ashish Kalra wrote: > Hello Tobin, > > On Wed, Oct 28, 2020 at 03:31:44PM -0400, Tobin Feldman-Fitzthum wrote: >> Hello, >> >> Dov Murik. James Bottomley, Hubertus Franke, and I have been working >> on a >> plan for fast live migration of SEV and SEV-ES (and SEV-SNP when it's >> out >> and even hopefully Intel TDX) VMs. We have developed an approach that >> we >> believe is feasible and a demonstration that shows our solution to the >> most >> difficult part of the problem. In short, we have implemented a UEFI >> Application that can resume from a VM snapshot. We think this is the >> crux of >> SEV-ES live migration. After describing the context of our demo and >> how it >> works, we explain how it can be extended to a full SEV-ES migration. >> Our >> goal is to show that fast SEV and SEV-ES live migration can be >> implemented >> in OVMF with minimal kernel changes. We provide a blueprint for doing >> so. >> >> Typically the hypervisor facilitates live migration. AMD SEV excludes >> the >> hypervisor from the trust domain of the guest. When a hypervisor (HV) >> examines the memory of an SEV guest, it will find only a ciphertext. >> If the >> HV moves the memory of an SEV guest, the ciphertext will be >> invalidated. >> Furthermore, with SEV-ES the hypervisor is largely unable to access >> guest >> CPU state. Thus, fast migration of SEV VMs requires support from >> inside the >> trust domain, i.e. the guest. >> >> One approach is to add support for SEV Migration to the Linux kernel. >> This >> would allow the guest to encrypt/decrypt its own memory with a >> transport >> key. This approach has met some resistance. We propose a similar >> approach >> implemented not in Linux, but in firmware, specifically OVMF. Since >> OVMF >> runs inside the guest, it has access to the guest memory and CPU >> state. OVMF >> should be able to perform the manipulations required for live >> migration of >> SEV and SEV-ES guests. >> >> The biggest challenge of this approach involves migrating the CPU >> state of >> an SEV-ES guest. In a normal (non-SEV migration) the HV sets the CPU >> state >> of the target before the target begins executing. In our approach, the >> HV >> starts the target and OVMF must resume to whatever state the source >> was in. >> We believe this to be the crux (or at least the most difficult part) >> of live >> migration for SEV and we hope that by demonstrating resume from EFI, >> we can >> show that our approach is generally feasible. >> >> Our demo can be found at >> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsecure-migration&data=04%7C01%7Cashish.kalra%40amd.com%7C6edb93f8936e465a9fee08d87b781d00%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637395103097650163%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=dsOh3zcwSWgnpmMdcCnSoJ%2B3Ohqz175axch%2B%2Bnu73Uc%3D&reserved=0>. >> The tooling repository is the best starting point. It contains >> documentation >> about the project and the scripts needed to run the demo. There are >> two more >> repos associated with the project. One is a modified edk2 tree that >> contains >> our modified OVMF. The other is a modified qemu, that has a couple of >> temporary changes needed for the demo. Our demonstration is aimed only >> at >> resuming from a VM snapshot in OVMF. We provide the source CPU state >> and >> source memory to the destination using temporary plumbing that >> violates the >> SEV trust model. We explain the setup in more depth in README.md. We >> are >> showing only that OVMF can resume from a VM snapshot. At the end we >> will >> describe our plan for transferring CPU state and memory from source to >> guest. To be clear, the temporary tooling used for this demo isn't >> built for >> encrypted VMs, but below we explain how this demo applies to and can >> be >> extended to encrypted VMs. >> >> We Implemented our resume code in a very similar fashion to the >> recommended >> S3 resume code. When the HV sets the CPU state of a guest, it can do >> so when >> the guest is not executing. Setting the state from inside the guest is >> a >> delicate operation. There is no way to atomically set all of the CPU >> state >> from inside the guest. Instead, we must set most registers >> individually and >> account for changes in control flow that doing so might cause. We do >> this >> with a three-phase trampoline. OVMF calls phase 1, which runs on the >> OVMF >> map. Phase 1 sets up phase 2 and jumps to it. Phase 2 switches to an >> intermediate map that reconciles the OVMF map and the source map. >> Phase 3 >> switches to the source map, restores the registers, and returns into >> execution of the source. We will go backwards through these phases in >> more >> depth. >> >> The last thing that resume to EFI does is return. Specifically, we use >> IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a >> temporary stack and restores them atomically, thus returning to source >> execution. Prior to returning, we must manually restore most other >> registers >> to the values they had on the source. One particularly significant >> register >> is CR3. When we return to Linux, CR3 must be set to the source CR3 or >> the >> first instruction executed in Linux will cause a page fault. The code >> that >> we use to restore the registers and return must be mapped in the >> source page >> table or we would get a page fault executing the instructions prior to >> returning into Linux. The value of CR3 is so significant, that it >> defines >> the three phases of the trampoline. Phase 3 begins when CR3 is set to >> the >> source CR3. After setting CR3, we set all the other registers and >> return. >> >> Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, >> meaning >> that virtual addresses are the same as physical addresses. The kernel >> page >> table uses an offset mapping, meaning that virtual addresses differ >> from >> physical addresses by a constant (for the most part). Crucially, this >> means >> that the virtual address of the page that is executed by phase 3 >> differs >> between the OVMF map and the source map. If we are executing code >> mapped in >> OVMF and we change CR3 to point to the source map, although the page >> may be >> mapped in the source map, the virtual address will be different, and >> we will >> face undefined behavior. To fix this, we construct intermediate page >> tables >> that map the pages for phase 2 and 3 to the virtual address expected >> in OVMF >> and to the virtual address expected in the source map. Thus, we can >> switch >> CR3 from OVMF's map to the intermediate map and then from the >> intermediate >> map to the source map. Phase 2 is much shorter than phase 3. Phase 2 >> is >> mainly responsible for switching to the intermediate map, flushing the >> TLB, >> and jumping to phase 3. >> >> Fortunately phase 1 is even simpler than phase 2. Phase 1 has two >> duties. >> First, since phase 2 and 3 operate without a stack and can't access >> values >> defined in OVMF (such as the addresses of the pages containing phase 2 >> and >> 3), phase 1 must pass these values to phase 2 by putting them in >> registers. >> Second, phase 1 must start phase 2 by jumping to it. >> >> Given that we can resume to a snapshot in OVMF, we should be able to >> migrate >> an SEV guest as long as we can securely communicate the VM snapshot >> from >> source to destination. For our demo, we do this with a handful of QMP >> commands. More sophisticated methods are required for a production >> implementation. >> >> When we refer to a snapshot, what we really mean is the device state, >> memory, and CPU state of a guest. In live migration this is >> transmitted >> dynamically as opposed to being saved and restored. Device state is >> not >> protected by SEV and can be handled entirely by the HV. Memory, on the >> other >> hand, cannot be handled only by the HV. As mentioned previously, >> memory >> needs to be encrypted with a transport key. A Migration Handler on the >> source will coordinate with the HV to encrypt pages and transmit them >> to the >> destination. The destination HV will receive the pages over the >> network and >> pass them to the Migration Handler in the target VM so they can be >> decrypted. This transmission will occur continuously until the memory >> of the >> source and target converges. >> >> Plain SEV does not protect the CPU state of the guest and therefore >> does not >> require any special mechanism for transmission of the CPU state. We >> plan to >> implement an end-to-end migration with plain SEV first. In SEV-ES, the >> PSP >> (platform security processor) encrypts CPU state on each VMExit. The >> encrypted state is stored in memory. Normally this memory (known as >> the >> VMSA) is not mapped into the guest, but we can add an entry to the >> nested >> page tables that will expose the VMSA to the guest. > > I have a question here, is there any kind of integrity protection on > the > CPU state when the target VM is resumed after nigration, for example, > if > there is a malicious hypervisor which maps a page with subverted CPU > state on the nested page tables, what prevents the target VM to resume > execution on a subverted or compromised CPU state ? Good question. Here is my thinking. The VMSA is mapped in the guest memory. It will be transmitted to the target like any other page, with encryption and integrity-checking. So we have integrity checking for CPU state while it is in flight. I think you are wondering something slightly different, though. Once the page with the VMSA arrives at the target and is decrypted and put in place, the hypervisor could potentially change the NPT to replace the data. Since the page with the VMSA will be encrypted (and the Migration Handler will expect this), the HV can't replace the page with arbitrary values. Since the VMSA is in memory, we have the protections that SEV provides for memory. Prior to SNP, this does not include integrity protection. The HV could attempt a replay attack by replacing the page with the VMSA with an older version of the same page. That said, the target will have just booted so there isn't much to replay. If we really need to, we could add functionality to the Migration Handler that would allow the HV to ask for an HMAC of the VMSA on the source. The Migration Handler on the target could use this to verify the VMSA just prior to starting the trampoline. Given the above, I am not sure this is necessary. Hopefully I've understood the attack you're suggesting correctly. -Tobin > > Thanks, > Ashish > >> This means that when the >> guest VMExits, the CPU state will be saved to guest memory. With the >> CPU >> state in guest memory, it can be transmitted to the target using the >> method >> described above. >> >> In addition to the changes needed in OVMF to resume the VM, the >> transmission >> of the VM from source to target will require a new code path in the >> hypervisor. There will also need to be a few minor changes to Linux >> (adding >> a mapping for our Phase 3 pages). Despite all the moving pieces, we >> believe >> that this is a feasible approach for supporting live migration for SEV >> and >> SEV-ES. >> >> For the sake of brevity, we have left out a few issues, including SMP >> support, generation of the intermediate mappings, and more. We have >> included >> some notes about these issues in the COMPLICATIONS.md file. We also >> have an >> outline of an end-to-end implementation of live migration for SEV-ES >> in >> END-TO-END.md. See README.md for info on how to run the demo. While >> this is >> not a full migration, we hope to show that fast live migration with >> SEV and >> SEV-ES is possible without major kernel changes. >> >> -Tobin >> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept 2020-10-29 20:36 ` tobin @ 2020-10-30 18:35 ` Ashish Kalra 0 siblings, 0 replies; 16+ messages in thread From: Ashish Kalra @ 2020-10-30 18:35 UTC (permalink / raw) To: Tobin Feldman-Fitzthum Cc: devel, dovmurik, Dov.Murik1, brijesh.singh, tobin, david.kaplan, jon.grimm, thomas.lendacky, jejb, frankeh Hello Tobin, On Thu, Oct 29, 2020 at 04:36:07PM -0400, Tobin Feldman-Fitzthum wrote: > On 2020-10-29 13:06, Ashish Kalra wrote: > > Hello Tobin, > > > > On Wed, Oct 28, 2020 at 03:31:44PM -0400, Tobin Feldman-Fitzthum wrote: > > > Hello, > > > > > > Dov Murik. James Bottomley, Hubertus Franke, and I have been working > > > on a > > > plan for fast live migration of SEV and SEV-ES (and SEV-SNP when > > > it's out > > > and even hopefully Intel TDX) VMs. We have developed an approach > > > that we > > > believe is feasible and a demonstration that shows our solution to > > > the most > > > difficult part of the problem. In short, we have implemented a UEFI > > > Application that can resume from a VM snapshot. We think this is the > > > crux of > > > SEV-ES live migration. After describing the context of our demo and > > > how it > > > works, we explain how it can be extended to a full SEV-ES migration. > > > Our > > > goal is to show that fast SEV and SEV-ES live migration can be > > > implemented > > > in OVMF with minimal kernel changes. We provide a blueprint for > > > doing so. > > > > > > Typically the hypervisor facilitates live migration. AMD SEV > > > excludes the > > > hypervisor from the trust domain of the guest. When a hypervisor (HV) > > > examines the memory of an SEV guest, it will find only a ciphertext. > > > If the > > > HV moves the memory of an SEV guest, the ciphertext will be > > > invalidated. > > > Furthermore, with SEV-ES the hypervisor is largely unable to access > > > guest > > > CPU state. Thus, fast migration of SEV VMs requires support from > > > inside the > > > trust domain, i.e. the guest. > > > > > > One approach is to add support for SEV Migration to the Linux > > > kernel. This > > > would allow the guest to encrypt/decrypt its own memory with a > > > transport > > > key. This approach has met some resistance. We propose a similar > > > approach > > > implemented not in Linux, but in firmware, specifically OVMF. Since > > > OVMF > > > runs inside the guest, it has access to the guest memory and CPU > > > state. OVMF > > > should be able to perform the manipulations required for live > > > migration of > > > SEV and SEV-ES guests. > > > > > > The biggest challenge of this approach involves migrating the CPU > > > state of > > > an SEV-ES guest. In a normal (non-SEV migration) the HV sets the CPU > > > state > > > of the target before the target begins executing. In our approach, > > > the HV > > > starts the target and OVMF must resume to whatever state the source > > > was in. > > > We believe this to be the crux (or at least the most difficult part) > > > of live > > > migration for SEV and we hope that by demonstrating resume from EFI, > > > we can > > > show that our approach is generally feasible. > > > > > > Our demo can be found at <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsecure-migration&data=04%7C01%7Cashish.kalra%40amd.com%7C9ae0ce60e5fd43378cb808d87c4a4746%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637396005748813716%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=I%2FF8XELOBFAvDnHmVw3M1ln7hb9a%2FmQrGXxWn2s5XSY%3D&reserved=0>. > > > The tooling repository is the best starting point. It contains > > > documentation > > > about the project and the scripts needed to run the demo. There are > > > two more > > > repos associated with the project. One is a modified edk2 tree that > > > contains > > > our modified OVMF. The other is a modified qemu, that has a couple of > > > temporary changes needed for the demo. Our demonstration is aimed > > > only at > > > resuming from a VM snapshot in OVMF. We provide the source CPU state > > > and > > > source memory to the destination using temporary plumbing that > > > violates the > > > SEV trust model. We explain the setup in more depth in README.md. We > > > are > > > showing only that OVMF can resume from a VM snapshot. At the end we > > > will > > > describe our plan for transferring CPU state and memory from source to > > > guest. To be clear, the temporary tooling used for this demo isn't > > > built for > > > encrypted VMs, but below we explain how this demo applies to and can > > > be > > > extended to encrypted VMs. > > > > > > We Implemented our resume code in a very similar fashion to the > > > recommended > > > S3 resume code. When the HV sets the CPU state of a guest, it can do > > > so when > > > the guest is not executing. Setting the state from inside the guest > > > is a > > > delicate operation. There is no way to atomically set all of the CPU > > > state > > > from inside the guest. Instead, we must set most registers > > > individually and > > > account for changes in control flow that doing so might cause. We do > > > this > > > with a three-phase trampoline. OVMF calls phase 1, which runs on the > > > OVMF > > > map. Phase 1 sets up phase 2 and jumps to it. Phase 2 switches to an > > > intermediate map that reconciles the OVMF map and the source map. > > > Phase 3 > > > switches to the source map, restores the registers, and returns into > > > execution of the source. We will go backwards through these phases > > > in more > > > depth. > > > > > > The last thing that resume to EFI does is return. Specifically, we use > > > IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a > > > temporary stack and restores them atomically, thus returning to source > > > execution. Prior to returning, we must manually restore most other > > > registers > > > to the values they had on the source. One particularly significant > > > register > > > is CR3. When we return to Linux, CR3 must be set to the source CR3 > > > or the > > > first instruction executed in Linux will cause a page fault. The > > > code that > > > we use to restore the registers and return must be mapped in the > > > source page > > > table or we would get a page fault executing the instructions prior to > > > returning into Linux. The value of CR3 is so significant, that it > > > defines > > > the three phases of the trampoline. Phase 3 begins when CR3 is set > > > to the > > > source CR3. After setting CR3, we set all the other registers and > > > return. > > > > > > Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, > > > meaning > > > that virtual addresses are the same as physical addresses. The > > > kernel page > > > table uses an offset mapping, meaning that virtual addresses differ > > > from > > > physical addresses by a constant (for the most part). Crucially, > > > this means > > > that the virtual address of the page that is executed by phase 3 > > > differs > > > between the OVMF map and the source map. If we are executing code > > > mapped in > > > OVMF and we change CR3 to point to the source map, although the page > > > may be > > > mapped in the source map, the virtual address will be different, and > > > we will > > > face undefined behavior. To fix this, we construct intermediate page > > > tables > > > that map the pages for phase 2 and 3 to the virtual address expected > > > in OVMF > > > and to the virtual address expected in the source map. Thus, we can > > > switch > > > CR3 from OVMF's map to the intermediate map and then from the > > > intermediate > > > map to the source map. Phase 2 is much shorter than phase 3. Phase 2 > > > is > > > mainly responsible for switching to the intermediate map, flushing > > > the TLB, > > > and jumping to phase 3. > > > > > > Fortunately phase 1 is even simpler than phase 2. Phase 1 has two > > > duties. > > > First, since phase 2 and 3 operate without a stack and can't access > > > values > > > defined in OVMF (such as the addresses of the pages containing phase > > > 2 and > > > 3), phase 1 must pass these values to phase 2 by putting them in > > > registers. > > > Second, phase 1 must start phase 2 by jumping to it. > > > > > > Given that we can resume to a snapshot in OVMF, we should be able to > > > migrate > > > an SEV guest as long as we can securely communicate the VM snapshot > > > from > > > source to destination. For our demo, we do this with a handful of QMP > > > commands. More sophisticated methods are required for a production > > > implementation. > > > > > > When we refer to a snapshot, what we really mean is the device state, > > > memory, and CPU state of a guest. In live migration this is > > > transmitted > > > dynamically as opposed to being saved and restored. Device state is > > > not > > > protected by SEV and can be handled entirely by the HV. Memory, on > > > the other > > > hand, cannot be handled only by the HV. As mentioned previously, > > > memory > > > needs to be encrypted with a transport key. A Migration Handler on the > > > source will coordinate with the HV to encrypt pages and transmit > > > them to the > > > destination. The destination HV will receive the pages over the > > > network and > > > pass them to the Migration Handler in the target VM so they can be > > > decrypted. This transmission will occur continuously until the > > > memory of the > > > source and target converges. > > > > > > Plain SEV does not protect the CPU state of the guest and therefore > > > does not > > > require any special mechanism for transmission of the CPU state. We > > > plan to > > > implement an end-to-end migration with plain SEV first. In SEV-ES, > > > the PSP > > > (platform security processor) encrypts CPU state on each VMExit. The > > > encrypted state is stored in memory. Normally this memory (known as > > > the > > > VMSA) is not mapped into the guest, but we can add an entry to the > > > nested > > > page tables that will expose the VMSA to the guest. > > > > I have a question here, is there any kind of integrity protection on the > > CPU state when the target VM is resumed after nigration, for example, if > > there is a malicious hypervisor which maps a page with subverted CPU > > state on the nested page tables, what prevents the target VM to resume > > execution on a subverted or compromised CPU state ? > > Good question. Here is my thinking. The VMSA is mapped in the guest memory. > It will be transmitted to the target like any other page, with encryption > and integrity-checking. So we have integrity checking for CPU state while > it is in flight. > > I think you are wondering something slightly different, though. Once the > page with the VMSA arrives at the target and is decrypted and put in place, > the hypervisor could potentially change the NPT to replace the data. Since > the page with the VMSA will be encrypted (and the Migration Handler will > expect this), the HV can't replace the page with arbitrary values. > > Since the VMSA is in memory, we have the protections that SEV provides > for memory. Prior to SNP, this does not include integrity protection. > The HV could attempt a replay attack by replacing the page with the > VMSA with an older version of the same page. That said, the target will > have just booted so there isn't much to replay. > > If we really need to, we could add functionality to the Migration Handler > that would allow the HV to ask for an HMAC of the VMSA on the source. > The Migration Handler on the target could use this to verify the VMSA > just prior to starting the trampoline. Given the above, I am not sure > this is necessary. Hopefully I've understood the attack you're suggesting > correctly. > Yes this is the attack i am suggesting about a compromised or malicious hypervisor replacing the page containing the CPU state with compromised data in the NPT when the target VM starts. Thanks, Ashish > > > This means that when the > > > guest VMExits, the CPU state will be saved to guest memory. With the > > > CPU > > > state in guest memory, it can be transmitted to the target using the > > > method > > > described above. > > > > > > In addition to the changes needed in OVMF to resume the VM, the > > > transmission > > > of the VM from source to target will require a new code path in the > > > hypervisor. There will also need to be a few minor changes to Linux > > > (adding > > > a mapping for our Phase 3 pages). Despite all the moving pieces, we > > > believe > > > that this is a feasible approach for supporting live migration for > > > SEV and > > > SEV-ES. > > > > > > For the sake of brevity, we have left out a few issues, including SMP > > > support, generation of the intermediate mappings, and more. We have > > > included > > > some notes about these issues in the COMPLICATIONS.md file. We also > > > have an > > > outline of an end-to-end implementation of live migration for SEV-ES > > > in > > > END-TO-END.md. See README.md for info on how to run the demo. While > > > this is > > > not a full migration, we hope to show that fast live migration with > > > SEV and > > > SEV-ES is possible without major kernel changes. > > > > > > -Tobin > > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept 2020-10-28 19:31 RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept Tobin Feldman-Fitzthum 2020-10-29 17:06 ` Ashish Kalra @ 2020-11-03 14:59 ` Laszlo Ersek 2020-11-04 18:27 ` Tobin Feldman-Fitzthum 1 sibling, 1 reply; 16+ messages in thread From: Laszlo Ersek @ 2020-11-03 14:59 UTC (permalink / raw) To: tobin Cc: devel, dovmurik, Dov.Murik1, ashish.kalra, brijesh.singh, tobin, david.kaplan, jon.grimm, thomas.lendacky, jejb, frankeh, Dr. David Alan Gilbert Hi Tobin, (keeping full context -- I'm adding Dave) On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote: > Hello, > > Dov Murik. James Bottomley, Hubertus Franke, and I have been working on > a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when it's > out and even hopefully Intel TDX) VMs. We have developed an approach > that we believe is feasible and a demonstration that shows our solution > to the most difficult part of the problem. In short, we have implemented > a UEFI Application that can resume from a VM snapshot. We think this is > the crux of SEV-ES live migration. After describing the context of our > demo and how it works, we explain how it can be extended to a full > SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live > migration can be implemented in OVMF with minimal kernel changes. We > provide a blueprint for doing so. > > Typically the hypervisor facilitates live migration. AMD SEV excludes > the hypervisor from the trust domain of the guest. When a hypervisor > (HV) examines the memory of an SEV guest, it will find only a > ciphertext. If the HV moves the memory of an SEV guest, the ciphertext > will be invalidated. Furthermore, with SEV-ES the hypervisor is largely > unable to access guest CPU state. Thus, fast migration of SEV VMs > requires support from inside the trust domain, i.e. the guest. > > One approach is to add support for SEV Migration to the Linux kernel. > This would allow the guest to encrypt/decrypt its own memory with a > transport key. This approach has met some resistance. We propose a > similar approach implemented not in Linux, but in firmware, specifically > OVMF. Since OVMF runs inside the guest, it has access to the guest > memory and CPU state. OVMF should be able to perform the manipulations > required for live migration of SEV and SEV-ES guests. > > The biggest challenge of this approach involves migrating the CPU state > of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the CPU > state of the target before the target begins executing. In our approach, > the HV starts the target and OVMF must resume to whatever state the > source was in. We believe this to be the crux (or at least the most > difficult part) of live migration for SEV and we hope that by > demonstrating resume from EFI, we can show that our approach is > generally feasible. > > Our demo can be found at <https://github.com/secure-migration>. The > tooling repository is the best starting point. It contains documentation > about the project and the scripts needed to run the demo. There are two > more repos associated with the project. One is a modified edk2 tree that > contains our modified OVMF. The other is a modified qemu, that has a > couple of temporary changes needed for the demo. Our demonstration is > aimed only at resuming from a VM snapshot in OVMF. We provide the source > CPU state and source memory to the destination using temporary plumbing > that violates the SEV trust model. We explain the setup in more depth in > README.md. We are showing only that OVMF can resume from a VM snapshot. > At the end we will describe our plan for transferring CPU state and > memory from source to guest. To be clear, the temporary tooling used for > this demo isn't built for encrypted VMs, but below we explain how this > demo applies to and can be extended to encrypted VMs. > > We Implemented our resume code in a very similar fashion to the > recommended S3 resume code. When the HV sets the CPU state of a guest, > it can do so when the guest is not executing. Setting the state from > inside the guest is a delicate operation. There is no way to atomically > set all of the CPU state from inside the guest. Instead, we must set > most registers individually and account for changes in control flow that > doing so might cause. We do this with a three-phase trampoline. OVMF > calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and > jumps to it. Phase 2 switches to an intermediate map that reconciles the > OVMF map and the source map. Phase 3 switches to the source map, > restores the registers, and returns into execution of the source. We > will go backwards through these phases in more depth. > > The last thing that resume to EFI does is return. Specifically, we use > IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a > temporary stack and restores them atomically, thus returning to source > execution. Prior to returning, we must manually restore most other > registers to the values they had on the source. One particularly > significant register is CR3. When we return to Linux, CR3 must be set to > the source CR3 or the first instruction executed in Linux will cause a > page fault. The code that we use to restore the registers and return > must be mapped in the source page table or we would get a page fault > executing the instructions prior to returning into Linux. The value of > CR3 is so significant, that it defines the three phases of the > trampoline. Phase 3 begins when CR3 is set to the source CR3. After > setting CR3, we set all the other registers and return. > > Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, meaning > that virtual addresses are the same as physical addresses. The kernel > page table uses an offset mapping, meaning that virtual addresses differ > from physical addresses by a constant (for the most part). Crucially, > this means that the virtual address of the page that is executed by > phase 3 differs between the OVMF map and the source map. If we are > executing code mapped in OVMF and we change CR3 to point to the source > map, although the page may be mapped in the source map, the virtual > address will be different, and we will face undefined behavior. To fix > this, we construct intermediate page tables that map the pages for phase > 2 and 3 to the virtual address expected in OVMF and to the virtual > address expected in the source map. Thus, we can switch CR3 from OVMF's > map to the intermediate map and then from the intermediate map to the > source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly > responsible for switching to the intermediate map, flushing the TLB, and > jumping to phase 3. > > Fortunately phase 1 is even simpler than phase 2. Phase 1 has two > duties. First, since phase 2 and 3 operate without a stack and can't > access values defined in OVMF (such as the addresses of the pages > containing phase 2 and 3), phase 1 must pass these values to phase 2 by > putting them in registers. Second, phase 1 must start phase 2 by jumping > to it. > > Given that we can resume to a snapshot in OVMF, we should be able to > migrate an SEV guest as long as we can securely communicate the VM > snapshot from source to destination. For our demo, we do this with a > handful of QMP commands. More sophisticated methods are required for a > production implementation. > > When we refer to a snapshot, what we really mean is the device state, > memory, and CPU state of a guest. In live migration this is transmitted > dynamically as opposed to being saved and restored. Device state is not > protected by SEV and can be handled entirely by the HV. Memory, on the > other hand, cannot be handled only by the HV. As mentioned previously, > memory needs to be encrypted with a transport key. A Migration Handler > on the source will coordinate with the HV to encrypt pages and transmit > them to the destination. The destination HV will receive the pages over > the network and pass them to the Migration Handler in the target VM so > they can be decrypted. This transmission will occur continuously until > the memory of the source and target converges. > > Plain SEV does not protect the CPU state of the guest and therefore does > not require any special mechanism for transmission of the CPU state. We > plan to implement an end-to-end migration with plain SEV first. In > SEV-ES, the PSP (platform security processor) encrypts CPU state on each > VMExit. The encrypted state is stored in memory. Normally this memory > (known as the VMSA) is not mapped into the guest, but we can add an > entry to the nested page tables that will expose the VMSA to the guest. > This means that when the guest VMExits, the CPU state will be saved to > guest memory. With the CPU state in guest memory, it can be transmitted > to the target using the method described above. > > In addition to the changes needed in OVMF to resume the VM, the > transmission of the VM from source to target will require a new code > path in the hypervisor. There will also need to be a few minor changes > to Linux (adding a mapping for our Phase 3 pages). Despite all the > moving pieces, we believe that this is a feasible approach for > supporting live migration for SEV and SEV-ES. > > For the sake of brevity, we have left out a few issues, including SMP > support, generation of the intermediate mappings, and more. We have > included some notes about these issues in the COMPLICATIONS.md file. We > also have an outline of an end-to-end implementation of live migration > for SEV-ES in END-TO-END.md. See README.md for info on how to run the > demo. While this is not a full migration, we hope to show that fast live > migration with SEV and SEV-ES is possible without major kernel changes. > > -Tobin the one word that comes to my mind upon reading the above is, "overwhelming". (I have not been addressed directly, but: - the subject says "RFC", - and the documentation at https://github.com/secure-migration/resume-from-edk2-tooling#what-changes-did-we-make states that AmdSevPkg was created for convenience, and that the feature could be integrated into OVMF. (Paraphrased.) So I guess it's tolerable if I make a comment: ) I've checked out the "mh-state-dev" branch of <https://github.com/secure-migration/resume-from-efi-edk2.git>. It has 80 commits on top of edk2 master (base commit: d5339c04d7cd, "UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency", 2020-04-23). These commits were authored over the 6-7 months since April. It's obviously huge work. To me, most of these commits clearly aim at getting the demo / proof-of-concept functional, rather than guiding (more precisely: hand-holding) reviewers through the construction of the feature. In my opinion, the series is not upstreamable in its current format (which is presently not much more readable than a single-commit code drop). Upstreaming is probably not your intent, either, at this time. I agree that getting feedback ("buy-in") at this level of maturity is justified from your POV, before you invest more work into cleaning up / restructuring the series. My problem is that "hand-holding" is exactly what I'd need -- I cannot dedicate one or two weeks, as an indivisible block, to understanding your design. Nor can I approach the series patch-wise in its current format. Personally I would need the patch series to lead me through the whole design with baby steps ("ELI5"), meaning small code changes and detailed commit messages. I'd *also* need the more comprehensive guide-like documentation, as background material. Furthermore, I don't have an environment where I can test this proof-of-concept (and provide you with further incentive for cleaning up the series, by reporting success). So I hope others can spend the time discussing the design with you, and testing / repeating the demo. For me to review the patches, the patches should condense and replay your thinking process from the last 7 months, in as small as possible logical steps. (On the list.) I really don't want to be the bottleneck here, which is why I would support introducing this feature as a separate top-level package (AmdSevPkg). Thanks Laszlo ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept 2020-11-03 14:59 ` [edk2-devel] " Laszlo Ersek @ 2020-11-04 18:27 ` Tobin Feldman-Fitzthum 2020-11-06 15:45 ` Laszlo Ersek 2020-11-06 16:38 ` Dr. David Alan Gilbert 0 siblings, 2 replies; 16+ messages in thread From: Tobin Feldman-Fitzthum @ 2020-11-04 18:27 UTC (permalink / raw) To: Laszlo Ersek Cc: devel, dovmurik, Dov.Murik1, ashish.kalra, brijesh.singh, tobin, david.kaplan, jon.grimm, thomas.lendacky, jejb, frankeh, Dr. David Alan Gilbert On 2020-11-03 09:59, Laszlo Ersek wrote: > Hi Tobin, > > (keeping full context -- I'm adding Dave) > > On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote: >> Hello, >> >> Dov Murik. James Bottomley, Hubertus Franke, and I have been working >> on >> a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when >> it's >> out and even hopefully Intel TDX) VMs. We have developed an approach >> that we believe is feasible and a demonstration that shows our >> solution >> to the most difficult part of the problem. In short, we have >> implemented >> a UEFI Application that can resume from a VM snapshot. We think this >> is >> the crux of SEV-ES live migration. After describing the context of our >> demo and how it works, we explain how it can be extended to a full >> SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live >> migration can be implemented in OVMF with minimal kernel changes. We >> provide a blueprint for doing so. >> >> Typically the hypervisor facilitates live migration. AMD SEV excludes >> the hypervisor from the trust domain of the guest. When a hypervisor >> (HV) examines the memory of an SEV guest, it will find only a >> ciphertext. If the HV moves the memory of an SEV guest, the ciphertext >> will be invalidated. Furthermore, with SEV-ES the hypervisor is >> largely >> unable to access guest CPU state. Thus, fast migration of SEV VMs >> requires support from inside the trust domain, i.e. the guest. >> >> One approach is to add support for SEV Migration to the Linux kernel. >> This would allow the guest to encrypt/decrypt its own memory with a >> transport key. This approach has met some resistance. We propose a >> similar approach implemented not in Linux, but in firmware, >> specifically >> OVMF. Since OVMF runs inside the guest, it has access to the guest >> memory and CPU state. OVMF should be able to perform the manipulations >> required for live migration of SEV and SEV-ES guests. >> >> The biggest challenge of this approach involves migrating the CPU >> state >> of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the >> CPU >> state of the target before the target begins executing. In our >> approach, >> the HV starts the target and OVMF must resume to whatever state the >> source was in. We believe this to be the crux (or at least the most >> difficult part) of live migration for SEV and we hope that by >> demonstrating resume from EFI, we can show that our approach is >> generally feasible. >> >> Our demo can be found at <https://github.com/secure-migration>. The >> tooling repository is the best starting point. It contains >> documentation >> about the project and the scripts needed to run the demo. There are >> two >> more repos associated with the project. One is a modified edk2 tree >> that >> contains our modified OVMF. The other is a modified qemu, that has a >> couple of temporary changes needed for the demo. Our demonstration is >> aimed only at resuming from a VM snapshot in OVMF. We provide the >> source >> CPU state and source memory to the destination using temporary >> plumbing >> that violates the SEV trust model. We explain the setup in more depth >> in >> README.md. We are showing only that OVMF can resume from a VM >> snapshot. >> At the end we will describe our plan for transferring CPU state and >> memory from source to guest. To be clear, the temporary tooling used >> for >> this demo isn't built for encrypted VMs, but below we explain how this >> demo applies to and can be extended to encrypted VMs. >> >> We Implemented our resume code in a very similar fashion to the >> recommended S3 resume code. When the HV sets the CPU state of a guest, >> it can do so when the guest is not executing. Setting the state from >> inside the guest is a delicate operation. There is no way to >> atomically >> set all of the CPU state from inside the guest. Instead, we must set >> most registers individually and account for changes in control flow >> that >> doing so might cause. We do this with a three-phase trampoline. OVMF >> calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and >> jumps to it. Phase 2 switches to an intermediate map that reconciles >> the >> OVMF map and the source map. Phase 3 switches to the source map, >> restores the registers, and returns into execution of the source. We >> will go backwards through these phases in more depth. >> >> The last thing that resume to EFI does is return. Specifically, we use >> IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a >> temporary stack and restores them atomically, thus returning to source >> execution. Prior to returning, we must manually restore most other >> registers to the values they had on the source. One particularly >> significant register is CR3. When we return to Linux, CR3 must be set >> to >> the source CR3 or the first instruction executed in Linux will cause a >> page fault. The code that we use to restore the registers and return >> must be mapped in the source page table or we would get a page fault >> executing the instructions prior to returning into Linux. The value of >> CR3 is so significant, that it defines the three phases of the >> trampoline. Phase 3 begins when CR3 is set to the source CR3. After >> setting CR3, we set all the other registers and return. >> >> Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, >> meaning >> that virtual addresses are the same as physical addresses. The kernel >> page table uses an offset mapping, meaning that virtual addresses >> differ >> from physical addresses by a constant (for the most part). Crucially, >> this means that the virtual address of the page that is executed by >> phase 3 differs between the OVMF map and the source map. If we are >> executing code mapped in OVMF and we change CR3 to point to the source >> map, although the page may be mapped in the source map, the virtual >> address will be different, and we will face undefined behavior. To fix >> this, we construct intermediate page tables that map the pages for >> phase >> 2 and 3 to the virtual address expected in OVMF and to the virtual >> address expected in the source map. Thus, we can switch CR3 from >> OVMF's >> map to the intermediate map and then from the intermediate map to the >> source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly >> responsible for switching to the intermediate map, flushing the TLB, >> and >> jumping to phase 3. >> >> Fortunately phase 1 is even simpler than phase 2. Phase 1 has two >> duties. First, since phase 2 and 3 operate without a stack and can't >> access values defined in OVMF (such as the addresses of the pages >> containing phase 2 and 3), phase 1 must pass these values to phase 2 >> by >> putting them in registers. Second, phase 1 must start phase 2 by >> jumping >> to it. >> >> Given that we can resume to a snapshot in OVMF, we should be able to >> migrate an SEV guest as long as we can securely communicate the VM >> snapshot from source to destination. For our demo, we do this with a >> handful of QMP commands. More sophisticated methods are required for a >> production implementation. >> >> When we refer to a snapshot, what we really mean is the device state, >> memory, and CPU state of a guest. In live migration this is >> transmitted >> dynamically as opposed to being saved and restored. Device state is >> not >> protected by SEV and can be handled entirely by the HV. Memory, on the >> other hand, cannot be handled only by the HV. As mentioned previously, >> memory needs to be encrypted with a transport key. A Migration Handler >> on the source will coordinate with the HV to encrypt pages and >> transmit >> them to the destination. The destination HV will receive the pages >> over >> the network and pass them to the Migration Handler in the target VM so >> they can be decrypted. This transmission will occur continuously until >> the memory of the source and target converges. >> >> Plain SEV does not protect the CPU state of the guest and therefore >> does >> not require any special mechanism for transmission of the CPU state. >> We >> plan to implement an end-to-end migration with plain SEV first. In >> SEV-ES, the PSP (platform security processor) encrypts CPU state on >> each >> VMExit. The encrypted state is stored in memory. Normally this memory >> (known as the VMSA) is not mapped into the guest, but we can add an >> entry to the nested page tables that will expose the VMSA to the >> guest. >> This means that when the guest VMExits, the CPU state will be saved to >> guest memory. With the CPU state in guest memory, it can be >> transmitted >> to the target using the method described above. >> >> In addition to the changes needed in OVMF to resume the VM, the >> transmission of the VM from source to target will require a new code >> path in the hypervisor. There will also need to be a few minor changes >> to Linux (adding a mapping for our Phase 3 pages). Despite all the >> moving pieces, we believe that this is a feasible approach for >> supporting live migration for SEV and SEV-ES. >> >> For the sake of brevity, we have left out a few issues, including SMP >> support, generation of the intermediate mappings, and more. We have >> included some notes about these issues in the COMPLICATIONS.md file. >> We >> also have an outline of an end-to-end implementation of live migration >> for SEV-ES in END-TO-END.md. See README.md for info on how to run the >> demo. While this is not a full migration, we hope to show that fast >> live >> migration with SEV and SEV-ES is possible without major kernel >> changes. >> >> -Tobin > > the one word that comes to my mind upon reading the above is, > "overwhelming". > > (I have not been addressed directly, but: > > - the subject says "RFC", > > - and the documentation at > > https://github.com/secure-migration/resume-from-edk2-tooling#what-changes-did-we-make > > states that AmdSevPkg was created for convenience, and that the feature > could be integrated into OVMF. (Paraphrased.) > > So I guess it's tolerable if I make a comment: ) > We've been looking forward to your perspective. > I've checked out the "mh-state-dev" branch of > <https://github.com/secure-migration/resume-from-efi-edk2.git>. It has > 80 commits on top of edk2 master (base commit: d5339c04d7cd, > "UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency", > 2020-04-23). > > These commits were authored over the 6-7 months since April. It's > obviously huge work. To me, most of these commits clearly aim at > getting > the demo / proof-of-concept functional, rather than guiding (more > precisely: hand-holding) reviewers through the construction of the > feature. > > In my opinion, the series is not upstreamable in its current format > (which is presently not much more readable than a single-commit code > drop). Upstreaming is probably not your intent, either, at this time. > > I agree that getting feedback ("buy-in") at this level of maturity is > justified from your POV, before you invest more work into cleaning up / > restructuring the series. > > My problem is that "hand-holding" is exactly what I'd need -- I cannot > dedicate one or two weeks, as an indivisible block, to understanding > your design. Nor can I approach the series patch-wise in its current > format. Personally I would need the patch series to lead me through the > whole design with baby steps ("ELI5"), meaning small code changes and > detailed commit messages. I'd *also* need the more comprehensive > guide-like documentation, as background material. > > Furthermore, I don't have an environment where I can test this > proof-of-concept (and provide you with further incentive for cleaning > up > the series, by reporting success). > > So I hope others can spend the time discussing the design with you, and > testing / repeating the demo. For me to review the patches, the patches > should condense and replay your thinking process from the last 7 > months, > in as small as possible logical steps. (On the list.) > I completely understand your position. This PoC has a lot of new ideas in it and you're right that our main priority was not to hand-hold/guide reviewers through the code. One thing that is worth emphasizing is that the pieces we are showcasing here are not the immediate priority when it comes to upstreaming. Specifically, we looked into the trampoline to make sure it was possible to migrate CPU state via firmware. While we need this for SEV-ES and our goal is to support SEV-ES, it is not the first step. We are currently working on a PoC for a full end-to-end migration with SEV (non-ES), which may be a better place for us to begin a serious discussion about getting things upstream. We will focus more on making these patches accessible to the upstream community. In the meantime, perhaps there is something we can do to help make our current work more clear. We could potentially explain things on a call or create some additional documentation. While our goal is not to shove this version of the trampoline upstream, it is significant to our plan as a whole and we want to help people understand it. -Tobin > I really don't want to be the bottleneck here, which is why I would > support introducing this feature as a separate top-level package > (AmdSevPkg). > > Thanks > Laszlo ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept 2020-11-04 18:27 ` Tobin Feldman-Fitzthum @ 2020-11-06 15:45 ` Laszlo Ersek 2020-11-06 20:03 ` Tobin Feldman-Fitzthum 2020-11-06 16:38 ` Dr. David Alan Gilbert 1 sibling, 1 reply; 16+ messages in thread From: Laszlo Ersek @ 2020-11-06 15:45 UTC (permalink / raw) To: Tobin Feldman-Fitzthum Cc: devel, dovmurik, Dov.Murik1, ashish.kalra, brijesh.singh, tobin, david.kaplan, jon.grimm, thomas.lendacky, jejb, frankeh, Dr. David Alan Gilbert On 11/04/20 19:27, Tobin Feldman-Fitzthum wrote: > In the meantime, perhaps there is something we can do to help > make our current work more clear. We could potentially explain > things on a call or create some additional documentation. While > our goal is not to shove this version of the trampoline upstream, > it is significant to our plan as a whole and we want to help > people understand it. >From my personal (selfish) perspective, a call would be counter-productive. Regarding documentation, I do have one thought that might help, with the (very tricky) page table manipulations / phases: diagrams (ascii or svg, perhaps). I don't know if that will help me look at this in detail earlier, but *when* I will look at it, it will definitely help me. (If there are diagrams already, then I apologize for not noticing them.) Thanks! Laszlo ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept 2020-11-06 15:45 ` Laszlo Ersek @ 2020-11-06 20:03 ` Tobin Feldman-Fitzthum 0 siblings, 0 replies; 16+ messages in thread From: Tobin Feldman-Fitzthum @ 2020-11-06 20:03 UTC (permalink / raw) To: lersek Cc: devel, dovmurik, Dov.Murik1, ashish.kalra, brijesh.singh, tobin, david.kaplan, jon.grimm, thomas.lendacky, jejb, frankeh, Dr. David Alan Gilbert On 2020-11-06 10:45, Laszlo Ersek wrote: > On 11/04/20 19:27, Tobin Feldman-Fitzthum wrote: > >> In the meantime, perhaps there is something we can do to help >> make our current work more clear. We could potentially explain >> things on a call or create some additional documentation. While >> our goal is not to shove this version of the trampoline upstream, >> it is significant to our plan as a whole and we want to help >> people understand it. > > From my personal (selfish) perspective, a call would be > counter-productive. Regarding documentation, I do have one thought that > might help, with the (very tricky) page table manipulations / phases: > diagrams (ascii or svg, perhaps). I don't know if that will help me > look > at this in detail earlier, but *when* I will look at it, it will > definitely help me. > > (If there are diagrams already, then I apologize for not noticing > them.) > We can work on some diagrams. I have a couple informal ones on paper. Adding something visual to the docs seems like a good idea. -Tobin > Thanks! > Laszlo > > > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept 2020-11-04 18:27 ` Tobin Feldman-Fitzthum 2020-11-06 15:45 ` Laszlo Ersek @ 2020-11-06 16:38 ` Dr. David Alan Gilbert 2020-11-06 21:48 ` Tobin Feldman-Fitzthum 1 sibling, 1 reply; 16+ messages in thread From: Dr. David Alan Gilbert @ 2020-11-06 16:38 UTC (permalink / raw) To: Tobin Feldman-Fitzthum Cc: Laszlo Ersek, devel, dovmurik, Dov.Murik1, ashish.kalra, brijesh.singh, tobin, david.kaplan, jon.grimm, thomas.lendacky, jejb, frankeh * Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote: > On 2020-11-03 09:59, Laszlo Ersek wrote: > > Hi Tobin, > > > > (keeping full context -- I'm adding Dave) > > > > On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote: > > > Hello, > > > > > > Dov Murik. James Bottomley, Hubertus Franke, and I have been working > > > on > > > a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when > > > it's > > > out and even hopefully Intel TDX) VMs. We have developed an approach > > > that we believe is feasible and a demonstration that shows our > > > solution > > > to the most difficult part of the problem. In short, we have > > > implemented > > > a UEFI Application that can resume from a VM snapshot. We think this > > > is > > > the crux of SEV-ES live migration. After describing the context of our > > > demo and how it works, we explain how it can be extended to a full > > > SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live > > > migration can be implemented in OVMF with minimal kernel changes. We > > > provide a blueprint for doing so. > > > > > > Typically the hypervisor facilitates live migration. AMD SEV excludes > > > the hypervisor from the trust domain of the guest. When a hypervisor > > > (HV) examines the memory of an SEV guest, it will find only a > > > ciphertext. If the HV moves the memory of an SEV guest, the ciphertext > > > will be invalidated. Furthermore, with SEV-ES the hypervisor is > > > largely > > > unable to access guest CPU state. Thus, fast migration of SEV VMs > > > requires support from inside the trust domain, i.e. the guest. > > > > > > One approach is to add support for SEV Migration to the Linux kernel. > > > This would allow the guest to encrypt/decrypt its own memory with a > > > transport key. This approach has met some resistance. We propose a > > > similar approach implemented not in Linux, but in firmware, > > > specifically > > > OVMF. Since OVMF runs inside the guest, it has access to the guest > > > memory and CPU state. OVMF should be able to perform the manipulations > > > required for live migration of SEV and SEV-ES guests. > > > > > > The biggest challenge of this approach involves migrating the CPU > > > state > > > of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the > > > CPU > > > state of the target before the target begins executing. In our > > > approach, > > > the HV starts the target and OVMF must resume to whatever state the > > > source was in. We believe this to be the crux (or at least the most > > > difficult part) of live migration for SEV and we hope that by > > > demonstrating resume from EFI, we can show that our approach is > > > generally feasible. > > > > > > Our demo can be found at <https://github.com/secure-migration>. The > > > tooling repository is the best starting point. It contains > > > documentation > > > about the project and the scripts needed to run the demo. There are > > > two > > > more repos associated with the project. One is a modified edk2 tree > > > that > > > contains our modified OVMF. The other is a modified qemu, that has a > > > couple of temporary changes needed for the demo. Our demonstration is > > > aimed only at resuming from a VM snapshot in OVMF. We provide the > > > source > > > CPU state and source memory to the destination using temporary > > > plumbing > > > that violates the SEV trust model. We explain the setup in more > > > depth in > > > README.md. We are showing only that OVMF can resume from a VM > > > snapshot. > > > At the end we will describe our plan for transferring CPU state and > > > memory from source to guest. To be clear, the temporary tooling used > > > for > > > this demo isn't built for encrypted VMs, but below we explain how this > > > demo applies to and can be extended to encrypted VMs. > > > > > > We Implemented our resume code in a very similar fashion to the > > > recommended S3 resume code. When the HV sets the CPU state of a guest, > > > it can do so when the guest is not executing. Setting the state from > > > inside the guest is a delicate operation. There is no way to > > > atomically > > > set all of the CPU state from inside the guest. Instead, we must set > > > most registers individually and account for changes in control flow > > > that > > > doing so might cause. We do this with a three-phase trampoline. OVMF > > > calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and > > > jumps to it. Phase 2 switches to an intermediate map that reconciles > > > the > > > OVMF map and the source map. Phase 3 switches to the source map, > > > restores the registers, and returns into execution of the source. We > > > will go backwards through these phases in more depth. > > > > > > The last thing that resume to EFI does is return. Specifically, we use > > > IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a > > > temporary stack and restores them atomically, thus returning to source > > > execution. Prior to returning, we must manually restore most other > > > registers to the values they had on the source. One particularly > > > significant register is CR3. When we return to Linux, CR3 must be > > > set to > > > the source CR3 or the first instruction executed in Linux will cause a > > > page fault. The code that we use to restore the registers and return > > > must be mapped in the source page table or we would get a page fault > > > executing the instructions prior to returning into Linux. The value of > > > CR3 is so significant, that it defines the three phases of the > > > trampoline. Phase 3 begins when CR3 is set to the source CR3. After > > > setting CR3, we set all the other registers and return. > > > > > > Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, > > > meaning > > > that virtual addresses are the same as physical addresses. The kernel > > > page table uses an offset mapping, meaning that virtual addresses > > > differ > > > from physical addresses by a constant (for the most part). Crucially, > > > this means that the virtual address of the page that is executed by > > > phase 3 differs between the OVMF map and the source map. If we are > > > executing code mapped in OVMF and we change CR3 to point to the source > > > map, although the page may be mapped in the source map, the virtual > > > address will be different, and we will face undefined behavior. To fix > > > this, we construct intermediate page tables that map the pages for > > > phase > > > 2 and 3 to the virtual address expected in OVMF and to the virtual > > > address expected in the source map. Thus, we can switch CR3 from > > > OVMF's > > > map to the intermediate map and then from the intermediate map to the > > > source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly > > > responsible for switching to the intermediate map, flushing the TLB, > > > and > > > jumping to phase 3. > > > > > > Fortunately phase 1 is even simpler than phase 2. Phase 1 has two > > > duties. First, since phase 2 and 3 operate without a stack and can't > > > access values defined in OVMF (such as the addresses of the pages > > > containing phase 2 and 3), phase 1 must pass these values to phase 2 > > > by > > > putting them in registers. Second, phase 1 must start phase 2 by > > > jumping > > > to it. > > > > > > Given that we can resume to a snapshot in OVMF, we should be able to > > > migrate an SEV guest as long as we can securely communicate the VM > > > snapshot from source to destination. For our demo, we do this with a > > > handful of QMP commands. More sophisticated methods are required for a > > > production implementation. > > > > > > When we refer to a snapshot, what we really mean is the device state, > > > memory, and CPU state of a guest. In live migration this is > > > transmitted > > > dynamically as opposed to being saved and restored. Device state is > > > not > > > protected by SEV and can be handled entirely by the HV. Memory, on the > > > other hand, cannot be handled only by the HV. As mentioned previously, > > > memory needs to be encrypted with a transport key. A Migration Handler > > > on the source will coordinate with the HV to encrypt pages and > > > transmit > > > them to the destination. The destination HV will receive the pages > > > over > > > the network and pass them to the Migration Handler in the target VM so > > > they can be decrypted. This transmission will occur continuously until > > > the memory of the source and target converges. > > > > > > Plain SEV does not protect the CPU state of the guest and therefore > > > does > > > not require any special mechanism for transmission of the CPU state. > > > We > > > plan to implement an end-to-end migration with plain SEV first. In > > > SEV-ES, the PSP (platform security processor) encrypts CPU state on > > > each > > > VMExit. The encrypted state is stored in memory. Normally this memory > > > (known as the VMSA) is not mapped into the guest, but we can add an > > > entry to the nested page tables that will expose the VMSA to the > > > guest. > > > This means that when the guest VMExits, the CPU state will be saved to > > > guest memory. With the CPU state in guest memory, it can be > > > transmitted > > > to the target using the method described above. > > > > > > In addition to the changes needed in OVMF to resume the VM, the > > > transmission of the VM from source to target will require a new code > > > path in the hypervisor. There will also need to be a few minor changes > > > to Linux (adding a mapping for our Phase 3 pages). Despite all the > > > moving pieces, we believe that this is a feasible approach for > > > supporting live migration for SEV and SEV-ES. > > > > > > For the sake of brevity, we have left out a few issues, including SMP > > > support, generation of the intermediate mappings, and more. We have > > > included some notes about these issues in the COMPLICATIONS.md file. > > > We > > > also have an outline of an end-to-end implementation of live migration > > > for SEV-ES in END-TO-END.md. See README.md for info on how to run the > > > demo. While this is not a full migration, we hope to show that fast > > > live > > > migration with SEV and SEV-ES is possible without major kernel > > > changes. > > > > > > -Tobin > > > > the one word that comes to my mind upon reading the above is, > > "overwhelming". > > > > (I have not been addressed directly, but: > > > > - the subject says "RFC", > > > > - and the documentation at > > > > https://github.com/secure-migration/resume-from-edk2-tooling#what-changes-did-we-make > > > > states that AmdSevPkg was created for convenience, and that the feature > > could be integrated into OVMF. (Paraphrased.) > > > > So I guess it's tolerable if I make a comment: ) > > > We've been looking forward to your perspective. > > > I've checked out the "mh-state-dev" branch of > > <https://github.com/secure-migration/resume-from-efi-edk2.git>. It has > > 80 commits on top of edk2 master (base commit: d5339c04d7cd, > > "UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency", > > 2020-04-23). > > > > These commits were authored over the 6-7 months since April. It's > > obviously huge work. To me, most of these commits clearly aim at getting > > the demo / proof-of-concept functional, rather than guiding (more > > precisely: hand-holding) reviewers through the construction of the > > feature. > > > > In my opinion, the series is not upstreamable in its current format > > (which is presently not much more readable than a single-commit code > > drop). Upstreaming is probably not your intent, either, at this time. > > > > I agree that getting feedback ("buy-in") at this level of maturity is > > justified from your POV, before you invest more work into cleaning up / > > restructuring the series. > > > > My problem is that "hand-holding" is exactly what I'd need -- I cannot > > dedicate one or two weeks, as an indivisible block, to understanding > > your design. Nor can I approach the series patch-wise in its current > > format. Personally I would need the patch series to lead me through the > > whole design with baby steps ("ELI5"), meaning small code changes and > > detailed commit messages. I'd *also* need the more comprehensive > > guide-like documentation, as background material. > > > > Furthermore, I don't have an environment where I can test this > > proof-of-concept (and provide you with further incentive for cleaning up > > the series, by reporting success). > > > > So I hope others can spend the time discussing the design with you, and > > testing / repeating the demo. For me to review the patches, the patches > > should condense and replay your thinking process from the last 7 months, > > in as small as possible logical steps. (On the list.) > > > I completely understand your position. This PoC has a lot of > new ideas in it and you're right that our main priority was not > to hand-hold/guide reviewers through the code. > > One thing that is worth emphasizing is that the pieces we > are showcasing here are not the immediate priority when it > comes to upstreaming. Specifically, we looked into the trampoline > to make sure it was possible to migrate CPU state via firmware. > While we need this for SEV-ES and our goal is to support SEV-ES, > it is not the first step. We are currently working on a PoC for > a full end-to-end migration with SEV (non-ES), which may be a better > place for us to begin a serious discussion about getting things > upstream. We will focus more on making these patches accessible > to the upstream community. With my migration maintainer hat on, I'd like to understand a bit more about these different approaches; they could be quite invasive, so I'd like to make sure we're not doing one and throwing it away - it would be great if you could explain your non-ES approach; you don't need to have POC code to explain it. Dave > In the meantime, perhaps there is something we can do to help > make our current work more clear. We could potentially explain > things on a call or create some additional documentation. While > our goal is not to shove this version of the trampoline upstream, > it is significant to our plan as a whole and we want to help > people understand it. > > -Tobin > > > I really don't want to be the bottleneck here, which is why I would > > support introducing this feature as a separate top-level package > > (AmdSevPkg). > > > > Thanks > > Laszlo > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept 2020-11-06 16:38 ` Dr. David Alan Gilbert @ 2020-11-06 21:48 ` Tobin Feldman-Fitzthum 2020-11-06 22:17 ` Ashish Kalra 2020-11-09 19:56 ` Dr. David Alan Gilbert 0 siblings, 2 replies; 16+ messages in thread From: Tobin Feldman-Fitzthum @ 2020-11-06 21:48 UTC (permalink / raw) To: Dr. David Alan Gilbert Cc: Laszlo Ersek, devel, dovmurik, Dov.Murik1, ashish.kalra, brijesh.singh, tobin, david.kaplan, jon.grimm, thomas.lendacky, jejb, frankeh On 2020-11-06 11:38, Dr. David Alan Gilbert wrote: > * Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote: >> On 2020-11-03 09:59, Laszlo Ersek wrote: >> > Hi Tobin, >> > >> > (keeping full context -- I'm adding Dave) >> > >> > On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote: >> > > Hello, >> > > >> > > Dov Murik. James Bottomley, Hubertus Franke, and I have been working >> > > on >> > > a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when >> > > it's >> > > out and even hopefully Intel TDX) VMs. We have developed an approach >> > > that we believe is feasible and a demonstration that shows our >> > > solution >> > > to the most difficult part of the problem. In short, we have >> > > implemented >> > > a UEFI Application that can resume from a VM snapshot. We think this >> > > is >> > > the crux of SEV-ES live migration. After describing the context of our >> > > demo and how it works, we explain how it can be extended to a full >> > > SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live >> > > migration can be implemented in OVMF with minimal kernel changes. We >> > > provide a blueprint for doing so. >> > > >> > > Typically the hypervisor facilitates live migration. AMD SEV excludes >> > > the hypervisor from the trust domain of the guest. When a hypervisor >> > > (HV) examines the memory of an SEV guest, it will find only a >> > > ciphertext. If the HV moves the memory of an SEV guest, the ciphertext >> > > will be invalidated. Furthermore, with SEV-ES the hypervisor is >> > > largely >> > > unable to access guest CPU state. Thus, fast migration of SEV VMs >> > > requires support from inside the trust domain, i.e. the guest. >> > > >> > > One approach is to add support for SEV Migration to the Linux kernel. >> > > This would allow the guest to encrypt/decrypt its own memory with a >> > > transport key. This approach has met some resistance. We propose a >> > > similar approach implemented not in Linux, but in firmware, >> > > specifically >> > > OVMF. Since OVMF runs inside the guest, it has access to the guest >> > > memory and CPU state. OVMF should be able to perform the manipulations >> > > required for live migration of SEV and SEV-ES guests. >> > > >> > > The biggest challenge of this approach involves migrating the CPU >> > > state >> > > of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the >> > > CPU >> > > state of the target before the target begins executing. In our >> > > approach, >> > > the HV starts the target and OVMF must resume to whatever state the >> > > source was in. We believe this to be the crux (or at least the most >> > > difficult part) of live migration for SEV and we hope that by >> > > demonstrating resume from EFI, we can show that our approach is >> > > generally feasible. >> > > >> > > Our demo can be found at <https://github.com/secure-migration>. The >> > > tooling repository is the best starting point. It contains >> > > documentation >> > > about the project and the scripts needed to run the demo. There are >> > > two >> > > more repos associated with the project. One is a modified edk2 tree >> > > that >> > > contains our modified OVMF. The other is a modified qemu, that has a >> > > couple of temporary changes needed for the demo. Our demonstration is >> > > aimed only at resuming from a VM snapshot in OVMF. We provide the >> > > source >> > > CPU state and source memory to the destination using temporary >> > > plumbing >> > > that violates the SEV trust model. We explain the setup in more >> > > depth in >> > > README.md. We are showing only that OVMF can resume from a VM >> > > snapshot. >> > > At the end we will describe our plan for transferring CPU state and >> > > memory from source to guest. To be clear, the temporary tooling used >> > > for >> > > this demo isn't built for encrypted VMs, but below we explain how this >> > > demo applies to and can be extended to encrypted VMs. >> > > >> > > We Implemented our resume code in a very similar fashion to the >> > > recommended S3 resume code. When the HV sets the CPU state of a guest, >> > > it can do so when the guest is not executing. Setting the state from >> > > inside the guest is a delicate operation. There is no way to >> > > atomically >> > > set all of the CPU state from inside the guest. Instead, we must set >> > > most registers individually and account for changes in control flow >> > > that >> > > doing so might cause. We do this with a three-phase trampoline. OVMF >> > > calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and >> > > jumps to it. Phase 2 switches to an intermediate map that reconciles >> > > the >> > > OVMF map and the source map. Phase 3 switches to the source map, >> > > restores the registers, and returns into execution of the source. We >> > > will go backwards through these phases in more depth. >> > > >> > > The last thing that resume to EFI does is return. Specifically, we use >> > > IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a >> > > temporary stack and restores them atomically, thus returning to source >> > > execution. Prior to returning, we must manually restore most other >> > > registers to the values they had on the source. One particularly >> > > significant register is CR3. When we return to Linux, CR3 must be >> > > set to >> > > the source CR3 or the first instruction executed in Linux will cause a >> > > page fault. The code that we use to restore the registers and return >> > > must be mapped in the source page table or we would get a page fault >> > > executing the instructions prior to returning into Linux. The value of >> > > CR3 is so significant, that it defines the three phases of the >> > > trampoline. Phase 3 begins when CR3 is set to the source CR3. After >> > > setting CR3, we set all the other registers and return. >> > > >> > > Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, >> > > meaning >> > > that virtual addresses are the same as physical addresses. The kernel >> > > page table uses an offset mapping, meaning that virtual addresses >> > > differ >> > > from physical addresses by a constant (for the most part). Crucially, >> > > this means that the virtual address of the page that is executed by >> > > phase 3 differs between the OVMF map and the source map. If we are >> > > executing code mapped in OVMF and we change CR3 to point to the source >> > > map, although the page may be mapped in the source map, the virtual >> > > address will be different, and we will face undefined behavior. To fix >> > > this, we construct intermediate page tables that map the pages for >> > > phase >> > > 2 and 3 to the virtual address expected in OVMF and to the virtual >> > > address expected in the source map. Thus, we can switch CR3 from >> > > OVMF's >> > > map to the intermediate map and then from the intermediate map to the >> > > source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly >> > > responsible for switching to the intermediate map, flushing the TLB, >> > > and >> > > jumping to phase 3. >> > > >> > > Fortunately phase 1 is even simpler than phase 2. Phase 1 has two >> > > duties. First, since phase 2 and 3 operate without a stack and can't >> > > access values defined in OVMF (such as the addresses of the pages >> > > containing phase 2 and 3), phase 1 must pass these values to phase 2 >> > > by >> > > putting them in registers. Second, phase 1 must start phase 2 by >> > > jumping >> > > to it. >> > > >> > > Given that we can resume to a snapshot in OVMF, we should be able to >> > > migrate an SEV guest as long as we can securely communicate the VM >> > > snapshot from source to destination. For our demo, we do this with a >> > > handful of QMP commands. More sophisticated methods are required for a >> > > production implementation. >> > > >> > > When we refer to a snapshot, what we really mean is the device state, >> > > memory, and CPU state of a guest. In live migration this is >> > > transmitted >> > > dynamically as opposed to being saved and restored. Device state is >> > > not >> > > protected by SEV and can be handled entirely by the HV. Memory, on the >> > > other hand, cannot be handled only by the HV. As mentioned previously, >> > > memory needs to be encrypted with a transport key. A Migration Handler >> > > on the source will coordinate with the HV to encrypt pages and >> > > transmit >> > > them to the destination. The destination HV will receive the pages >> > > over >> > > the network and pass them to the Migration Handler in the target VM so >> > > they can be decrypted. This transmission will occur continuously until >> > > the memory of the source and target converges. >> > > >> > > Plain SEV does not protect the CPU state of the guest and therefore >> > > does >> > > not require any special mechanism for transmission of the CPU state. >> > > We >> > > plan to implement an end-to-end migration with plain SEV first. In >> > > SEV-ES, the PSP (platform security processor) encrypts CPU state on >> > > each >> > > VMExit. The encrypted state is stored in memory. Normally this memory >> > > (known as the VMSA) is not mapped into the guest, but we can add an >> > > entry to the nested page tables that will expose the VMSA to the >> > > guest. >> > > This means that when the guest VMExits, the CPU state will be saved to >> > > guest memory. With the CPU state in guest memory, it can be >> > > transmitted >> > > to the target using the method described above. >> > > >> > > In addition to the changes needed in OVMF to resume the VM, the >> > > transmission of the VM from source to target will require a new code >> > > path in the hypervisor. There will also need to be a few minor changes >> > > to Linux (adding a mapping for our Phase 3 pages). Despite all the >> > > moving pieces, we believe that this is a feasible approach for >> > > supporting live migration for SEV and SEV-ES. >> > > >> > > For the sake of brevity, we have left out a few issues, including SMP >> > > support, generation of the intermediate mappings, and more. We have >> > > included some notes about these issues in the COMPLICATIONS.md file. >> > > We >> > > also have an outline of an end-to-end implementation of live migration >> > > for SEV-ES in END-TO-END.md. See README.md for info on how to run the >> > > demo. While this is not a full migration, we hope to show that fast >> > > live >> > > migration with SEV and SEV-ES is possible without major kernel >> > > changes. >> > > >> > > -Tobin >> > >> > the one word that comes to my mind upon reading the above is, >> > "overwhelming". >> > >> > (I have not been addressed directly, but: >> > >> > - the subject says "RFC", >> > >> > - and the documentation at >> > >> > https://github.com/secure-migration/resume-from-edk2-tooling#what-changes-did-we-make >> > >> > states that AmdSevPkg was created for convenience, and that the feature >> > could be integrated into OVMF. (Paraphrased.) >> > >> > So I guess it's tolerable if I make a comment: ) >> > >> We've been looking forward to your perspective. >> >> > I've checked out the "mh-state-dev" branch of >> > <https://github.com/secure-migration/resume-from-efi-edk2.git>. It has >> > 80 commits on top of edk2 master (base commit: d5339c04d7cd, >> > "UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency", >> > 2020-04-23). >> > >> > These commits were authored over the 6-7 months since April. It's >> > obviously huge work. To me, most of these commits clearly aim at getting >> > the demo / proof-of-concept functional, rather than guiding (more >> > precisely: hand-holding) reviewers through the construction of the >> > feature. >> > >> > In my opinion, the series is not upstreamable in its current format >> > (which is presently not much more readable than a single-commit code >> > drop). Upstreaming is probably not your intent, either, at this time. >> > >> > I agree that getting feedback ("buy-in") at this level of maturity is >> > justified from your POV, before you invest more work into cleaning up / >> > restructuring the series. >> > >> > My problem is that "hand-holding" is exactly what I'd need -- I cannot >> > dedicate one or two weeks, as an indivisible block, to understanding >> > your design. Nor can I approach the series patch-wise in its current >> > format. Personally I would need the patch series to lead me through the >> > whole design with baby steps ("ELI5"), meaning small code changes and >> > detailed commit messages. I'd *also* need the more comprehensive >> > guide-like documentation, as background material. >> > >> > Furthermore, I don't have an environment where I can test this >> > proof-of-concept (and provide you with further incentive for cleaning up >> > the series, by reporting success). >> > >> > So I hope others can spend the time discussing the design with you, and >> > testing / repeating the demo. For me to review the patches, the patches >> > should condense and replay your thinking process from the last 7 months, >> > in as small as possible logical steps. (On the list.) >> > >> I completely understand your position. This PoC has a lot of >> new ideas in it and you're right that our main priority was not >> to hand-hold/guide reviewers through the code. >> >> One thing that is worth emphasizing is that the pieces we >> are showcasing here are not the immediate priority when it >> comes to upstreaming. Specifically, we looked into the trampoline >> to make sure it was possible to migrate CPU state via firmware. >> While we need this for SEV-ES and our goal is to support SEV-ES, >> it is not the first step. We are currently working on a PoC for >> a full end-to-end migration with SEV (non-ES), which may be a better >> place for us to begin a serious discussion about getting things >> upstream. We will focus more on making these patches accessible >> to the upstream community. > > With my migration maintainer hat on, I'd like to understand a bit more > about these different approaches; they could be quite invasive, so I'd > like to make sure we're not doing one and throwing it away - it would > be great if you could explain your non-ES approach; you don't need to > have POC code to explain it. > Our non-ES approach is a subset of our ES approach. For ES, the Migration Handler in the guest needs to help out with memory and CPU state. For plain SEV, the HV can set the CPU state, but we still need a way to transfer the memory. The current POC only deals with the CPU state. We're still working out some of the details in QEMU, but the basic idea of transferring memory is that each time the HV needs to send a page to the target, it will ask the Migration Handler in the guest for a version of the page that is encrypted with a transport key. Since the MH is inside the guest, it can read from any address in guest memory. The Migration Handlers on the source and the target will share a key. Once the source encrypts the requested page with the transport key, it can safely hand it off to the HV. Once the page reaches the target, the target HV will pass the page into the Migration Handler, which will decrypt using the transport key and move the page to the appropriate address. A few things to note: - The Migration Handler on the source needs to be running in the guest alongside the VM. On the target, the MH needs to startup before we can receive any pages. In both cases we are thinking that an additional vCPU can be started for the MH to run on. This could be spawned dynamically or live for the duration of the guest. - We need to make sure that the Migration Handler on the target does not overwrite itself when it receives pages from the source. Since we run the same firmware on the source and target, and since the MH is runtime code, the memory footprint of the MH should match on the source and the target. We will need to make sure there are no weird relocations. - There are some complexities arising from the fact that not every page in an SEV VM is encrypted. We are looking into the best way to handle encrypted vs. shared pages. Hopefully those notes don't confound my earlier explanation too much. I think that's most of the picture for non-ES migration. Let me know if you have any questions. ES migration would use the same approach for transferring memory. -Tobin > Dave > >> In the meantime, perhaps there is something we can do to help >> make our current work more clear. We could potentially explain >> things on a call or create some additional documentation. While >> our goal is not to shove this version of the trampoline upstream, >> it is significant to our plan as a whole and we want to help >> people understand it. >> >> -Tobin >> >> > I really don't want to be the bottleneck here, which is why I would >> > support introducing this feature as a separate top-level package >> > (AmdSevPkg). >> > >> > Thanks >> > Laszlo >> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept 2020-11-06 21:48 ` Tobin Feldman-Fitzthum @ 2020-11-06 22:17 ` Ashish Kalra 2020-11-09 20:27 ` Tobin Feldman-Fitzthum 2020-11-09 19:56 ` Dr. David Alan Gilbert 1 sibling, 1 reply; 16+ messages in thread From: Ashish Kalra @ 2020-11-06 22:17 UTC (permalink / raw) To: Tobin Feldman-Fitzthum Cc: Dr. David Alan Gilbert, Laszlo Ersek, devel, dovmurik, Dov.Murik1, brijesh.singh, tobin, david.kaplan, jon.grimm, thomas.lendacky, jejb, frankeh Hello Tobin, On Fri, Nov 06, 2020 at 04:48:12PM -0500, Tobin Feldman-Fitzthum wrote: > On 2020-11-06 11:38, Dr. David Alan Gilbert wrote: > > * Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote: > > > On 2020-11-03 09:59, Laszlo Ersek wrote: > > > > Hi Tobin, > > > > > > > > (keeping full context -- I'm adding Dave) > > > > > > > > On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote: > > > > > Hello, > > > > > > > > > > Dov Murik. James Bottomley, Hubertus Franke, and I have been working > > > > > on > > > > > a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when > > > > > it's > > > > > out and even hopefully Intel TDX) VMs. We have developed an approach > > > > > that we believe is feasible and a demonstration that shows our > > > > > solution > > > > > to the most difficult part of the problem. In short, we have > > > > > implemented > > > > > a UEFI Application that can resume from a VM snapshot. We think this > > > > > is > > > > > the crux of SEV-ES live migration. After describing the context of our > > > > > demo and how it works, we explain how it can be extended to a full > > > > > SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live > > > > > migration can be implemented in OVMF with minimal kernel changes. We > > > > > provide a blueprint for doing so. > > > > > > > > > > Typically the hypervisor facilitates live migration. AMD SEV excludes > > > > > the hypervisor from the trust domain of the guest. When a hypervisor > > > > > (HV) examines the memory of an SEV guest, it will find only a > > > > > ciphertext. If the HV moves the memory of an SEV guest, the ciphertext > > > > > will be invalidated. Furthermore, with SEV-ES the hypervisor is > > > > > largely > > > > > unable to access guest CPU state. Thus, fast migration of SEV VMs > > > > > requires support from inside the trust domain, i.e. the guest. > > > > > > > > > > One approach is to add support for SEV Migration to the Linux kernel. > > > > > This would allow the guest to encrypt/decrypt its own memory with a > > > > > transport key. This approach has met some resistance. We propose a > > > > > similar approach implemented not in Linux, but in firmware, > > > > > specifically > > > > > OVMF. Since OVMF runs inside the guest, it has access to the guest > > > > > memory and CPU state. OVMF should be able to perform the manipulations > > > > > required for live migration of SEV and SEV-ES guests. > > > > > > > > > > The biggest challenge of this approach involves migrating the CPU > > > > > state > > > > > of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the > > > > > CPU > > > > > state of the target before the target begins executing. In our > > > > > approach, > > > > > the HV starts the target and OVMF must resume to whatever state the > > > > > source was in. We believe this to be the crux (or at least the most > > > > > difficult part) of live migration for SEV and we hope that by > > > > > demonstrating resume from EFI, we can show that our approach is > > > > > generally feasible. > > > > > > > > > > Our demo can be found at <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsecure-migration&data=04%7C01%7Cashish.kalra%40amd.com%7C94e1ccd037b648bd43ef08d8829dac65%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637402961010808338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QA0DmtLkHFEovIu2Wd%2BYscW%2Fa9cNofg2xEQn3jPth9A%3D&reserved=0>. The > > > > > tooling repository is the best starting point. It contains > > > > > documentation > > > > > about the project and the scripts needed to run the demo. There are > > > > > two > > > > > more repos associated with the project. One is a modified edk2 tree > > > > > that > > > > > contains our modified OVMF. The other is a modified qemu, that has a > > > > > couple of temporary changes needed for the demo. Our demonstration is > > > > > aimed only at resuming from a VM snapshot in OVMF. We provide the > > > > > source > > > > > CPU state and source memory to the destination using temporary > > > > > plumbing > > > > > that violates the SEV trust model. We explain the setup in more > > > > > depth in > > > > > README.md. We are showing only that OVMF can resume from a VM > > > > > snapshot. > > > > > At the end we will describe our plan for transferring CPU state and > > > > > memory from source to guest. To be clear, the temporary tooling used > > > > > for > > > > > this demo isn't built for encrypted VMs, but below we explain how this > > > > > demo applies to and can be extended to encrypted VMs. > > > > > > > > > > We Implemented our resume code in a very similar fashion to the > > > > > recommended S3 resume code. When the HV sets the CPU state of a guest, > > > > > it can do so when the guest is not executing. Setting the state from > > > > > inside the guest is a delicate operation. There is no way to > > > > > atomically > > > > > set all of the CPU state from inside the guest. Instead, we must set > > > > > most registers individually and account for changes in control flow > > > > > that > > > > > doing so might cause. We do this with a three-phase trampoline. OVMF > > > > > calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and > > > > > jumps to it. Phase 2 switches to an intermediate map that reconciles > > > > > the > > > > > OVMF map and the source map. Phase 3 switches to the source map, > > > > > restores the registers, and returns into execution of the source. We > > > > > will go backwards through these phases in more depth. > > > > > > > > > > The last thing that resume to EFI does is return. Specifically, we use > > > > > IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a > > > > > temporary stack and restores them atomically, thus returning to source > > > > > execution. Prior to returning, we must manually restore most other > > > > > registers to the values they had on the source. One particularly > > > > > significant register is CR3. When we return to Linux, CR3 must be > > > > > set to > > > > > the source CR3 or the first instruction executed in Linux will cause a > > > > > page fault. The code that we use to restore the registers and return > > > > > must be mapped in the source page table or we would get a page fault > > > > > executing the instructions prior to returning into Linux. The value of > > > > > CR3 is so significant, that it defines the three phases of the > > > > > trampoline. Phase 3 begins when CR3 is set to the source CR3. After > > > > > setting CR3, we set all the other registers and return. > > > > > > > > > > Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, > > > > > meaning > > > > > that virtual addresses are the same as physical addresses. The kernel > > > > > page table uses an offset mapping, meaning that virtual addresses > > > > > differ > > > > > from physical addresses by a constant (for the most part). Crucially, > > > > > this means that the virtual address of the page that is executed by > > > > > phase 3 differs between the OVMF map and the source map. If we are > > > > > executing code mapped in OVMF and we change CR3 to point to the source > > > > > map, although the page may be mapped in the source map, the virtual > > > > > address will be different, and we will face undefined behavior. To fix > > > > > this, we construct intermediate page tables that map the pages for > > > > > phase > > > > > 2 and 3 to the virtual address expected in OVMF and to the virtual > > > > > address expected in the source map. Thus, we can switch CR3 from > > > > > OVMF's > > > > > map to the intermediate map and then from the intermediate map to the > > > > > source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly > > > > > responsible for switching to the intermediate map, flushing the TLB, > > > > > and > > > > > jumping to phase 3. > > > > > > > > > > Fortunately phase 1 is even simpler than phase 2. Phase 1 has two > > > > > duties. First, since phase 2 and 3 operate without a stack and can't > > > > > access values defined in OVMF (such as the addresses of the pages > > > > > containing phase 2 and 3), phase 1 must pass these values to phase 2 > > > > > by > > > > > putting them in registers. Second, phase 1 must start phase 2 by > > > > > jumping > > > > > to it. > > > > > > > > > > Given that we can resume to a snapshot in OVMF, we should be able to > > > > > migrate an SEV guest as long as we can securely communicate the VM > > > > > snapshot from source to destination. For our demo, we do this with a > > > > > handful of QMP commands. More sophisticated methods are required for a > > > > > production implementation. > > > > > > > > > > When we refer to a snapshot, what we really mean is the device state, > > > > > memory, and CPU state of a guest. In live migration this is > > > > > transmitted > > > > > dynamically as opposed to being saved and restored. Device state is > > > > > not > > > > > protected by SEV and can be handled entirely by the HV. Memory, on the > > > > > other hand, cannot be handled only by the HV. As mentioned previously, > > > > > memory needs to be encrypted with a transport key. A Migration Handler > > > > > on the source will coordinate with the HV to encrypt pages and > > > > > transmit > > > > > them to the destination. The destination HV will receive the pages > > > > > over > > > > > the network and pass them to the Migration Handler in the target VM so > > > > > they can be decrypted. This transmission will occur continuously until > > > > > the memory of the source and target converges. > > > > > > > > > > Plain SEV does not protect the CPU state of the guest and therefore > > > > > does > > > > > not require any special mechanism for transmission of the CPU state. > > > > > We > > > > > plan to implement an end-to-end migration with plain SEV first. In > > > > > SEV-ES, the PSP (platform security processor) encrypts CPU state on > > > > > each > > > > > VMExit. The encrypted state is stored in memory. Normally this memory > > > > > (known as the VMSA) is not mapped into the guest, but we can add an > > > > > entry to the nested page tables that will expose the VMSA to the > > > > > guest. > > > > > This means that when the guest VMExits, the CPU state will be saved to > > > > > guest memory. With the CPU state in guest memory, it can be > > > > > transmitted > > > > > to the target using the method described above. > > > > > > > > > > In addition to the changes needed in OVMF to resume the VM, the > > > > > transmission of the VM from source to target will require a new code > > > > > path in the hypervisor. There will also need to be a few minor changes > > > > > to Linux (adding a mapping for our Phase 3 pages). Despite all the > > > > > moving pieces, we believe that this is a feasible approach for > > > > > supporting live migration for SEV and SEV-ES. > > > > > > > > > > For the sake of brevity, we have left out a few issues, including SMP > > > > > support, generation of the intermediate mappings, and more. We have > > > > > included some notes about these issues in the COMPLICATIONS.md file. > > > > > We > > > > > also have an outline of an end-to-end implementation of live migration > > > > > for SEV-ES in END-TO-END.md. See README.md for info on how to run the > > > > > demo. While this is not a full migration, we hope to show that fast > > > > > live > > > > > migration with SEV and SEV-ES is possible without major kernel > > > > > changes. > > > > > > > > > > -Tobin > > > > > > > > the one word that comes to my mind upon reading the above is, > > > > "overwhelming". > > > > > > > > (I have not been addressed directly, but: > > > > > > > > - the subject says "RFC", > > > > > > > > - and the documentation at > > > > > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsecure-migration%2Fresume-from-edk2-tooling%23what-changes-did-we-make&data=04%7C01%7Cashish.kalra%40amd.com%7C94e1ccd037b648bd43ef08d8829dac65%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637402961010808338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=3%2FYBNKU90Kas%2F%2FUccbeqLI5CB2QRXBlA0ARkrnEAe0U%3D&reserved=0 > > > > > > > > states that AmdSevPkg was created for convenience, and that the feature > > > > could be integrated into OVMF. (Paraphrased.) > > > > > > > > So I guess it's tolerable if I make a comment: ) > > > > > > > We've been looking forward to your perspective. > > > > > > > I've checked out the "mh-state-dev" branch of > > > > <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsecure-migration%2Fresume-from-efi-edk2.git&data=04%7C01%7Cashish.kalra%40amd.com%7C94e1ccd037b648bd43ef08d8829dac65%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637402961010808338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=WP17dXixeaanEpMzbwNmsIhTtGiizcl1jBMb4xmRMuk%3D&reserved=0>. It has > > > > 80 commits on top of edk2 master (base commit: d5339c04d7cd, > > > > "UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency", > > > > 2020-04-23). > > > > > > > > These commits were authored over the 6-7 months since April. It's > > > > obviously huge work. To me, most of these commits clearly aim at getting > > > > the demo / proof-of-concept functional, rather than guiding (more > > > > precisely: hand-holding) reviewers through the construction of the > > > > feature. > > > > > > > > In my opinion, the series is not upstreamable in its current format > > > > (which is presently not much more readable than a single-commit code > > > > drop). Upstreaming is probably not your intent, either, at this time. > > > > > > > > I agree that getting feedback ("buy-in") at this level of maturity is > > > > justified from your POV, before you invest more work into cleaning up / > > > > restructuring the series. > > > > > > > > My problem is that "hand-holding" is exactly what I'd need -- I cannot > > > > dedicate one or two weeks, as an indivisible block, to understanding > > > > your design. Nor can I approach the series patch-wise in its current > > > > format. Personally I would need the patch series to lead me through the > > > > whole design with baby steps ("ELI5"), meaning small code changes and > > > > detailed commit messages. I'd *also* need the more comprehensive > > > > guide-like documentation, as background material. > > > > > > > > Furthermore, I don't have an environment where I can test this > > > > proof-of-concept (and provide you with further incentive for cleaning up > > > > the series, by reporting success). > > > > > > > > So I hope others can spend the time discussing the design with you, and > > > > testing / repeating the demo. For me to review the patches, the patches > > > > should condense and replay your thinking process from the last 7 months, > > > > in as small as possible logical steps. (On the list.) > > > > > > > I completely understand your position. This PoC has a lot of > > > new ideas in it and you're right that our main priority was not > > > to hand-hold/guide reviewers through the code. > > > > > > One thing that is worth emphasizing is that the pieces we > > > are showcasing here are not the immediate priority when it > > > comes to upstreaming. Specifically, we looked into the trampoline > > > to make sure it was possible to migrate CPU state via firmware. > > > While we need this for SEV-ES and our goal is to support SEV-ES, > > > it is not the first step. We are currently working on a PoC for > > > a full end-to-end migration with SEV (non-ES), which may be a better > > > place for us to begin a serious discussion about getting things > > > upstream. We will focus more on making these patches accessible > > > to the upstream community. > > > > With my migration maintainer hat on, I'd like to understand a bit more > > about these different approaches; they could be quite invasive, so I'd > > like to make sure we're not doing one and throwing it away - it would > > be great if you could explain your non-ES approach; you don't need to > > have POC code to explain it. > > > Our non-ES approach is a subset of our ES approach. For ES, the > Migration Handler in the guest needs to help out with memory and > CPU state. For plain SEV, the HV can set the CPU state, but we still > need a way to transfer the memory. The current POC only deals > with the CPU state. > > We're still working out some of the details in QEMU, but the basic > idea of transferring memory is that each time the HV needs to send a > page to the target, it will ask the Migration Handler in the guest > for a version of the page that is encrypted with a transport key. > Since the MH is inside the guest, it can read from any address > in guest memory. The Migration Handlers on the source and the target > will share a key. Once the source encrypts the requested page with > the transport key, it can safely hand it off to the HV. Once the page > reaches the target, the target HV will pass the page into the > Migration Handler, which will decrypt using the transport key and > move the page to the appropriate address. > > A few things to note: > > - The Migration Handler on the source needs to be running in the > guest alongside the VM. On the target, the MH needs to startup > before we can receive any pages. In both cases we are thinking > that an additional vCPU can be started for the MH to run on. > This could be spawned dynamically or live for the duration of > the guest. > > - We need to make sure that the Migration Handler on the target > does not overwrite itself when it receives pages from the > source. Since we run the same firmware on the source and > target, and since the MH is runtime code, the memory > footprint of the MH should match on the source and the > target. We will need to make sure there are no weird > relocations. > > - There are some complexities arising from the fact that not > every page in an SEV VM is encrypted. We are looking into > the best way to handle encrypted vs. shared pages. > Raising this question here as part of this discussion ... are you thinking of adding the page encryption bitmap (as we do for the slow migration patches) here to figure out if the guest pages are encrypted or not ? The page encryption status will need notifications from the guest kernel and OVMF. Additionally, is the page encrpytion bitmap support going to be added as a hypercall interface to the guest, which also means that the guest kernel needs to be modified ? Thanks, Ashish > Hopefully those notes don't confound my earlier explanation too > much. I think that's most of the picture for non-ES migration. > Let me know if you have any questions. ES migration would use > the same approach for transferring memory. > > -Tobin > > > Dave > > > > > In the meantime, perhaps there is something we can do to help > > > make our current work more clear. We could potentially explain > > > things on a call or create some additional documentation. While > > > our goal is not to shove this version of the trampoline upstream, > > > it is significant to our plan as a whole and we want to help > > > people understand it. > > > > > > -Tobin > > > > > > > I really don't want to be the bottleneck here, which is why I would > > > > support introducing this feature as a separate top-level package > > > > (AmdSevPkg). > > > > > > > > Thanks > > > > Laszlo > > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept 2020-11-06 22:17 ` Ashish Kalra @ 2020-11-09 20:27 ` Tobin Feldman-Fitzthum 2020-11-09 20:34 ` Kalra, Ashish 0 siblings, 1 reply; 16+ messages in thread From: Tobin Feldman-Fitzthum @ 2020-11-09 20:27 UTC (permalink / raw) To: Ashish Kalra Cc: Dr. David Alan Gilbert, Laszlo Ersek, devel, dovmurik, Dov.Murik1, brijesh.singh, tobin, david.kaplan, jon.grimm, thomas.lendacky, jejb, frankeh On 2020-11-06 17:17, Ashish Kalra wrote: > Hello Tobin, > > On Fri, Nov 06, 2020 at 04:48:12PM -0500, Tobin Feldman-Fitzthum wrote: >> On 2020-11-06 11:38, Dr. David Alan Gilbert wrote: >> > * Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote: >> > > On 2020-11-03 09:59, Laszlo Ersek wrote: >> > > > Hi Tobin, >> > > > >> > > > (keeping full context -- I'm adding Dave) >> > > > >> > > > On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote: >> > > > > Hello, >> > > > > >> > > > > Dov Murik. James Bottomley, Hubertus Franke, and I have been working >> > > > > on >> > > > > a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when >> > > > > it's >> > > > > out and even hopefully Intel TDX) VMs. We have developed an approach >> > > > > that we believe is feasible and a demonstration that shows our >> > > > > solution >> > > > > to the most difficult part of the problem. In short, we have >> > > > > implemented >> > > > > a UEFI Application that can resume from a VM snapshot. We think this >> > > > > is >> > > > > the crux of SEV-ES live migration. After describing the context of our >> > > > > demo and how it works, we explain how it can be extended to a full >> > > > > SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live >> > > > > migration can be implemented in OVMF with minimal kernel changes. We >> > > > > provide a blueprint for doing so. >> > > > > >> > > > > Typically the hypervisor facilitates live migration. AMD SEV excludes >> > > > > the hypervisor from the trust domain of the guest. When a hypervisor >> > > > > (HV) examines the memory of an SEV guest, it will find only a >> > > > > ciphertext. If the HV moves the memory of an SEV guest, the ciphertext >> > > > > will be invalidated. Furthermore, with SEV-ES the hypervisor is >> > > > > largely >> > > > > unable to access guest CPU state. Thus, fast migration of SEV VMs >> > > > > requires support from inside the trust domain, i.e. the guest. >> > > > > >> > > > > One approach is to add support for SEV Migration to the Linux kernel. >> > > > > This would allow the guest to encrypt/decrypt its own memory with a >> > > > > transport key. This approach has met some resistance. We propose a >> > > > > similar approach implemented not in Linux, but in firmware, >> > > > > specifically >> > > > > OVMF. Since OVMF runs inside the guest, it has access to the guest >> > > > > memory and CPU state. OVMF should be able to perform the manipulations >> > > > > required for live migration of SEV and SEV-ES guests. >> > > > > >> > > > > The biggest challenge of this approach involves migrating the CPU >> > > > > state >> > > > > of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the >> > > > > CPU >> > > > > state of the target before the target begins executing. In our >> > > > > approach, >> > > > > the HV starts the target and OVMF must resume to whatever state the >> > > > > source was in. We believe this to be the crux (or at least the most >> > > > > difficult part) of live migration for SEV and we hope that by >> > > > > demonstrating resume from EFI, we can show that our approach is >> > > > > generally feasible. >> > > > > >> > > > > Our demo can be found at <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsecure-migration&data=04%7C01%7Cashish.kalra%40amd.com%7C94e1ccd037b648bd43ef08d8829dac65%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637402961010808338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QA0DmtLkHFEovIu2Wd%2BYscW%2Fa9cNofg2xEQn3jPth9A%3D&reserved=0>. The >> > > > > tooling repository is the best starting point. It contains >> > > > > documentation >> > > > > about the project and the scripts needed to run the demo. There are >> > > > > two >> > > > > more repos associated with the project. One is a modified edk2 tree >> > > > > that >> > > > > contains our modified OVMF. The other is a modified qemu, that has a >> > > > > couple of temporary changes needed for the demo. Our demonstration is >> > > > > aimed only at resuming from a VM snapshot in OVMF. We provide the >> > > > > source >> > > > > CPU state and source memory to the destination using temporary >> > > > > plumbing >> > > > > that violates the SEV trust model. We explain the setup in more >> > > > > depth in >> > > > > README.md. We are showing only that OVMF can resume from a VM >> > > > > snapshot. >> > > > > At the end we will describe our plan for transferring CPU state and >> > > > > memory from source to guest. To be clear, the temporary tooling used >> > > > > for >> > > > > this demo isn't built for encrypted VMs, but below we explain how this >> > > > > demo applies to and can be extended to encrypted VMs. >> > > > > >> > > > > We Implemented our resume code in a very similar fashion to the >> > > > > recommended S3 resume code. When the HV sets the CPU state of a guest, >> > > > > it can do so when the guest is not executing. Setting the state from >> > > > > inside the guest is a delicate operation. There is no way to >> > > > > atomically >> > > > > set all of the CPU state from inside the guest. Instead, we must set >> > > > > most registers individually and account for changes in control flow >> > > > > that >> > > > > doing so might cause. We do this with a three-phase trampoline. OVMF >> > > > > calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and >> > > > > jumps to it. Phase 2 switches to an intermediate map that reconciles >> > > > > the >> > > > > OVMF map and the source map. Phase 3 switches to the source map, >> > > > > restores the registers, and returns into execution of the source. We >> > > > > will go backwards through these phases in more depth. >> > > > > >> > > > > The last thing that resume to EFI does is return. Specifically, we use >> > > > > IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a >> > > > > temporary stack and restores them atomically, thus returning to source >> > > > > execution. Prior to returning, we must manually restore most other >> > > > > registers to the values they had on the source. One particularly >> > > > > significant register is CR3. When we return to Linux, CR3 must be >> > > > > set to >> > > > > the source CR3 or the first instruction executed in Linux will cause a >> > > > > page fault. The code that we use to restore the registers and return >> > > > > must be mapped in the source page table or we would get a page fault >> > > > > executing the instructions prior to returning into Linux. The value of >> > > > > CR3 is so significant, that it defines the three phases of the >> > > > > trampoline. Phase 3 begins when CR3 is set to the source CR3. After >> > > > > setting CR3, we set all the other registers and return. >> > > > > >> > > > > Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, >> > > > > meaning >> > > > > that virtual addresses are the same as physical addresses. The kernel >> > > > > page table uses an offset mapping, meaning that virtual addresses >> > > > > differ >> > > > > from physical addresses by a constant (for the most part). Crucially, >> > > > > this means that the virtual address of the page that is executed by >> > > > > phase 3 differs between the OVMF map and the source map. If we are >> > > > > executing code mapped in OVMF and we change CR3 to point to the source >> > > > > map, although the page may be mapped in the source map, the virtual >> > > > > address will be different, and we will face undefined behavior. To fix >> > > > > this, we construct intermediate page tables that map the pages for >> > > > > phase >> > > > > 2 and 3 to the virtual address expected in OVMF and to the virtual >> > > > > address expected in the source map. Thus, we can switch CR3 from >> > > > > OVMF's >> > > > > map to the intermediate map and then from the intermediate map to the >> > > > > source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly >> > > > > responsible for switching to the intermediate map, flushing the TLB, >> > > > > and >> > > > > jumping to phase 3. >> > > > > >> > > > > Fortunately phase 1 is even simpler than phase 2. Phase 1 has two >> > > > > duties. First, since phase 2 and 3 operate without a stack and can't >> > > > > access values defined in OVMF (such as the addresses of the pages >> > > > > containing phase 2 and 3), phase 1 must pass these values to phase 2 >> > > > > by >> > > > > putting them in registers. Second, phase 1 must start phase 2 by >> > > > > jumping >> > > > > to it. >> > > > > >> > > > > Given that we can resume to a snapshot in OVMF, we should be able to >> > > > > migrate an SEV guest as long as we can securely communicate the VM >> > > > > snapshot from source to destination. For our demo, we do this with a >> > > > > handful of QMP commands. More sophisticated methods are required for a >> > > > > production implementation. >> > > > > >> > > > > When we refer to a snapshot, what we really mean is the device state, >> > > > > memory, and CPU state of a guest. In live migration this is >> > > > > transmitted >> > > > > dynamically as opposed to being saved and restored. Device state is >> > > > > not >> > > > > protected by SEV and can be handled entirely by the HV. Memory, on the >> > > > > other hand, cannot be handled only by the HV. As mentioned previously, >> > > > > memory needs to be encrypted with a transport key. A Migration Handler >> > > > > on the source will coordinate with the HV to encrypt pages and >> > > > > transmit >> > > > > them to the destination. The destination HV will receive the pages >> > > > > over >> > > > > the network and pass them to the Migration Handler in the target VM so >> > > > > they can be decrypted. This transmission will occur continuously until >> > > > > the memory of the source and target converges. >> > > > > >> > > > > Plain SEV does not protect the CPU state of the guest and therefore >> > > > > does >> > > > > not require any special mechanism for transmission of the CPU state. >> > > > > We >> > > > > plan to implement an end-to-end migration with plain SEV first. In >> > > > > SEV-ES, the PSP (platform security processor) encrypts CPU state on >> > > > > each >> > > > > VMExit. The encrypted state is stored in memory. Normally this memory >> > > > > (known as the VMSA) is not mapped into the guest, but we can add an >> > > > > entry to the nested page tables that will expose the VMSA to the >> > > > > guest. >> > > > > This means that when the guest VMExits, the CPU state will be saved to >> > > > > guest memory. With the CPU state in guest memory, it can be >> > > > > transmitted >> > > > > to the target using the method described above. >> > > > > >> > > > > In addition to the changes needed in OVMF to resume the VM, the >> > > > > transmission of the VM from source to target will require a new code >> > > > > path in the hypervisor. There will also need to be a few minor changes >> > > > > to Linux (adding a mapping for our Phase 3 pages). Despite all the >> > > > > moving pieces, we believe that this is a feasible approach for >> > > > > supporting live migration for SEV and SEV-ES. >> > > > > >> > > > > For the sake of brevity, we have left out a few issues, including SMP >> > > > > support, generation of the intermediate mappings, and more. We have >> > > > > included some notes about these issues in the COMPLICATIONS.md file. >> > > > > We >> > > > > also have an outline of an end-to-end implementation of live migration >> > > > > for SEV-ES in END-TO-END.md. See README.md for info on how to run the >> > > > > demo. While this is not a full migration, we hope to show that fast >> > > > > live >> > > > > migration with SEV and SEV-ES is possible without major kernel >> > > > > changes. >> > > > > >> > > > > -Tobin >> > > > >> > > > the one word that comes to my mind upon reading the above is, >> > > > "overwhelming". >> > > > >> > > > (I have not been addressed directly, but: >> > > > >> > > > - the subject says "RFC", >> > > > >> > > > - and the documentation at >> > > > >> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsecure-migration%2Fresume-from-edk2-tooling%23what-changes-did-we-make&data=04%7C01%7Cashish.kalra%40amd.com%7C94e1ccd037b648bd43ef08d8829dac65%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637402961010808338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=3%2FYBNKU90Kas%2F%2FUccbeqLI5CB2QRXBlA0ARkrnEAe0U%3D&reserved=0 >> > > > >> > > > states that AmdSevPkg was created for convenience, and that the feature >> > > > could be integrated into OVMF. (Paraphrased.) >> > > > >> > > > So I guess it's tolerable if I make a comment: ) >> > > > >> > > We've been looking forward to your perspective. >> > > >> > > > I've checked out the "mh-state-dev" branch of >> > > > <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsecure-migration%2Fresume-from-efi-edk2.git&data=04%7C01%7Cashish.kalra%40amd.com%7C94e1ccd037b648bd43ef08d8829dac65%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637402961010808338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=WP17dXixeaanEpMzbwNmsIhTtGiizcl1jBMb4xmRMuk%3D&reserved=0>. It has >> > > > 80 commits on top of edk2 master (base commit: d5339c04d7cd, >> > > > "UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency", >> > > > 2020-04-23). >> > > > >> > > > These commits were authored over the 6-7 months since April. It's >> > > > obviously huge work. To me, most of these commits clearly aim at getting >> > > > the demo / proof-of-concept functional, rather than guiding (more >> > > > precisely: hand-holding) reviewers through the construction of the >> > > > feature. >> > > > >> > > > In my opinion, the series is not upstreamable in its current format >> > > > (which is presently not much more readable than a single-commit code >> > > > drop). Upstreaming is probably not your intent, either, at this time. >> > > > >> > > > I agree that getting feedback ("buy-in") at this level of maturity is >> > > > justified from your POV, before you invest more work into cleaning up / >> > > > restructuring the series. >> > > > >> > > > My problem is that "hand-holding" is exactly what I'd need -- I cannot >> > > > dedicate one or two weeks, as an indivisible block, to understanding >> > > > your design. Nor can I approach the series patch-wise in its current >> > > > format. Personally I would need the patch series to lead me through the >> > > > whole design with baby steps ("ELI5"), meaning small code changes and >> > > > detailed commit messages. I'd *also* need the more comprehensive >> > > > guide-like documentation, as background material. >> > > > >> > > > Furthermore, I don't have an environment where I can test this >> > > > proof-of-concept (and provide you with further incentive for cleaning up >> > > > the series, by reporting success). >> > > > >> > > > So I hope others can spend the time discussing the design with you, and >> > > > testing / repeating the demo. For me to review the patches, the patches >> > > > should condense and replay your thinking process from the last 7 months, >> > > > in as small as possible logical steps. (On the list.) >> > > > >> > > I completely understand your position. This PoC has a lot of >> > > new ideas in it and you're right that our main priority was not >> > > to hand-hold/guide reviewers through the code. >> > > >> > > One thing that is worth emphasizing is that the pieces we >> > > are showcasing here are not the immediate priority when it >> > > comes to upstreaming. Specifically, we looked into the trampoline >> > > to make sure it was possible to migrate CPU state via firmware. >> > > While we need this for SEV-ES and our goal is to support SEV-ES, >> > > it is not the first step. We are currently working on a PoC for >> > > a full end-to-end migration with SEV (non-ES), which may be a better >> > > place for us to begin a serious discussion about getting things >> > > upstream. We will focus more on making these patches accessible >> > > to the upstream community. >> > >> > With my migration maintainer hat on, I'd like to understand a bit more >> > about these different approaches; they could be quite invasive, so I'd >> > like to make sure we're not doing one and throwing it away - it would >> > be great if you could explain your non-ES approach; you don't need to >> > have POC code to explain it. >> > >> Our non-ES approach is a subset of our ES approach. For ES, the >> Migration Handler in the guest needs to help out with memory and >> CPU state. For plain SEV, the HV can set the CPU state, but we still >> need a way to transfer the memory. The current POC only deals >> with the CPU state. >> >> We're still working out some of the details in QEMU, but the basic >> idea of transferring memory is that each time the HV needs to send a >> page to the target, it will ask the Migration Handler in the guest >> for a version of the page that is encrypted with a transport key. >> Since the MH is inside the guest, it can read from any address >> in guest memory. The Migration Handlers on the source and the target >> will share a key. Once the source encrypts the requested page with >> the transport key, it can safely hand it off to the HV. Once the page >> reaches the target, the target HV will pass the page into the >> Migration Handler, which will decrypt using the transport key and >> move the page to the appropriate address. >> >> A few things to note: >> >> - The Migration Handler on the source needs to be running in the >> guest alongside the VM. On the target, the MH needs to startup >> before we can receive any pages. In both cases we are thinking >> that an additional vCPU can be started for the MH to run on. >> This could be spawned dynamically or live for the duration of >> the guest. >> >> - We need to make sure that the Migration Handler on the target >> does not overwrite itself when it receives pages from the >> source. Since we run the same firmware on the source and >> target, and since the MH is runtime code, the memory >> footprint of the MH should match on the source and the >> target. We will need to make sure there are no weird >> relocations. >> >> - There are some complexities arising from the fact that not >> every page in an SEV VM is encrypted. We are looking into >> the best way to handle encrypted vs. shared pages. >> > > Raising this question here as part of this discussion ... are you > thinking of adding the page encryption bitmap (as we do for the slow > migration patches) here to figure out if the guest pages are encrypted > or not ? > We are using the bitmap for the first iteration of our end-to-end POC. > The page encryption status will need notifications from the guest > kernel > and OVMF. > > Additionally, is the page encrpytion bitmap support going to be added > as > a hypercall interface to the guest, which also means that the > guest kernel needs to be modified ? Although the bitmap is handy, we would like to avoid the patches you are alluding to. We are currently looking into how we can eliminate the bitmap. -Tobin > > Thanks, > Ashish > >> Hopefully those notes don't confound my earlier explanation too >> much. I think that's most of the picture for non-ES migration. >> Let me know if you have any questions. ES migration would use >> the same approach for transferring memory. >> >> -Tobin >> >> > Dave >> > >> > > In the meantime, perhaps there is something we can do to help >> > > make our current work more clear. We could potentially explain >> > > things on a call or create some additional documentation. While >> > > our goal is not to shove this version of the trampoline upstream, >> > > it is significant to our plan as a whole and we want to help >> > > people understand it. >> > > >> > > -Tobin >> > > >> > > > I really don't want to be the bottleneck here, which is why I would >> > > > support introducing this feature as a separate top-level package >> > > > (AmdSevPkg). >> > > > >> > > > Thanks >> > > > Laszlo >> > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept 2020-11-09 20:27 ` Tobin Feldman-Fitzthum @ 2020-11-09 20:34 ` Kalra, Ashish 0 siblings, 0 replies; 16+ messages in thread From: Kalra, Ashish @ 2020-11-09 20:34 UTC (permalink / raw) To: Tobin Feldman-Fitzthum Cc: Dr. David Alan Gilbert, Laszlo Ersek, devel@edk2.groups.io, dovmurik@linux.vnet.ibm.com, Dov.Murik1@il.ibm.com, Singh, Brijesh, tobin@ibm.com, Kaplan, David, Grimm, Jon, Lendacky, Thomas, jejb@linux.ibm.com, frankeh@us.ibm.com [AMD Public Use] Hello Tobin, -----Original Message----- From: Tobin Feldman-Fitzthum <tobin@linux.ibm.com> Sent: Monday, November 9, 2020 2:28 PM To: Kalra, Ashish <Ashish.Kalra@amd.com> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>; Laszlo Ersek <lersek@redhat.com>; devel@edk2.groups.io; dovmurik@linux.vnet.ibm.com; Dov.Murik1@il.ibm.com; Singh, Brijesh <brijesh.singh@amd.com>; tobin@ibm.com; Kaplan, David <David.Kaplan@amd.com>; Grimm, Jon <Jon.Grimm@amd.com>; Lendacky, Thomas <Thomas.Lendacky@amd.com>; jejb@linux.ibm.com; frankeh@us.ibm.com Subject: Re: [edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept On 2020-11-06 17:17, Ashish Kalra wrote: > Hello Tobin, > > On Fri, Nov 06, 2020 at 04:48:12PM -0500, Tobin Feldman-Fitzthum wrote: >> On 2020-11-06 11:38, Dr. David Alan Gilbert wrote: >> > * Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote: >> > > On 2020-11-03 09:59, Laszlo Ersek wrote: >> > > > Hi Tobin, >> > > > >> > > > (keeping full context -- I'm adding Dave) >> > > > >> > > > On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote: >> > > > > Hello, >> > > > > >> > > > > Dov Murik. James Bottomley, Hubertus Franke, and I have been >> > > > > working on a plan for fast live migration of SEV and SEV-ES >> > > > > (and SEV-SNP when it's out and even hopefully Intel TDX) VMs. >> > > > > We have developed an approach that we believe is feasible and >> > > > > a demonstration that shows our solution to the most difficult >> > > > > part of the problem. In short, we have implemented a UEFI >> > > > > Application that can resume from a VM snapshot. We think this >> > > > > is the crux of SEV-ES live migration. After describing the >> > > > > context of our demo and how it works, we explain how it can >> > > > > be extended to a full SEV-ES migration. Our goal is to show >> > > > > that fast SEV and SEV-ES live migration can be implemented in >> > > > > OVMF with minimal kernel changes. We provide a blueprint for >> > > > > doing so. >> > > > > >> > > > > Typically the hypervisor facilitates live migration. AMD SEV >> > > > > excludes the hypervisor from the trust domain of the guest. >> > > > > When a hypervisor >> > > > > (HV) examines the memory of an SEV guest, it will find only a >> > > > > ciphertext. If the HV moves the memory of an SEV guest, the >> > > > > ciphertext will be invalidated. Furthermore, with SEV-ES the >> > > > > hypervisor is largely unable to access guest CPU state. Thus, >> > > > > fast migration of SEV VMs requires support from inside the >> > > > > trust domain, i.e. the guest. >> > > > > >> > > > > One approach is to add support for SEV Migration to the Linux kernel. >> > > > > This would allow the guest to encrypt/decrypt its own memory >> > > > > with a transport key. This approach has met some resistance. >> > > > > We propose a similar approach implemented not in Linux, but >> > > > > in firmware, specifically OVMF. Since OVMF runs inside the >> > > > > guest, it has access to the guest memory and CPU state. OVMF >> > > > > should be able to perform the manipulations required for live >> > > > > migration of SEV and SEV-ES guests. >> > > > > >> > > > > The biggest challenge of this approach involves migrating the >> > > > > CPU state of an SEV-ES guest. In a normal (non-SEV migration) >> > > > > the HV sets the CPU state of the target before the target >> > > > > begins executing. In our approach, the HV starts the target >> > > > > and OVMF must resume to whatever state the source was in. We >> > > > > believe this to be the crux (or at least the most difficult >> > > > > part) of live migration for SEV and we hope that by >> > > > > demonstrating resume from EFI, we can show that our approach >> > > > > is generally feasible. >> > > > > >> > > > > Our demo can be found at >> > > > > <https://nam11.safelinks.protection.outlook.com/?url=https%3A >> > > > > %2F%2Fgithub.com%2Fsecure-migration&data=04%7C01%7Cashish >> > > > > .kalra%40amd.com%7C5180f68f099546c3a49e08d884edf727%7C3dd8961 >> > > > > fe4884e608e11a82d994e183d%7C0%7C0%7C637405504892572249%7CUnkn >> > > > > own%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI >> > > > > 6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=dkF04%2FoQgl8rLYXXxF >> > > > > 2nQNwDr1VmfvMfZ8amC6QHZV4%3D&reserved=0>. The tooling >> > > > > repository is the best starting point. It contains >> > > > > documentation about the project and the scripts needed to run >> > > > > the demo. There are two more repos associated with the >> > > > > project. One is a modified edk2 tree that contains our >> > > > > modified OVMF. The other is a modified qemu, that has a >> > > > > couple of temporary changes needed for the demo. Our >> > > > > demonstration is aimed only at resuming from a VM snapshot in >> > > > > OVMF. We provide the source CPU state and source memory to >> > > > > the destination using temporary plumbing that violates the SEV trust model. We explain the setup in more depth in README.md. We are showing only that OVMF can resume from a VM snapshot. >> > > > > At the end we will describe our plan for transferring CPU >> > > > > state and memory from source to guest. To be clear, the >> > > > > temporary tooling used for this demo isn't built for >> > > > > encrypted VMs, but below we explain how this demo applies to >> > > > > and can be extended to encrypted VMs. >> > > > > >> > > > > We Implemented our resume code in a very similar fashion to >> > > > > the recommended S3 resume code. When the HV sets the CPU >> > > > > state of a guest, it can do so when the guest is not >> > > > > executing. Setting the state from inside the guest is a >> > > > > delicate operation. There is no way to atomically set all of >> > > > > the CPU state from inside the guest. Instead, we must set >> > > > > most registers individually and account for changes in >> > > > > control flow that doing so might cause. We do this with a >> > > > > three-phase trampoline. OVMF calls phase 1, which runs on the >> > > > > OVMF map. Phase 1 sets up phase 2 and jumps to it. Phase 2 >> > > > > switches to an intermediate map that reconciles the OVMF map >> > > > > and the source map. Phase 3 switches to the source map, >> > > > > restores the registers, and returns into execution of the >> > > > > source. We will go backwards through these phases in more depth. >> > > > > >> > > > > The last thing that resume to EFI does is return. >> > > > > Specifically, we use IRETQ, which reads the values of RIP, >> > > > > CS, RFLAGS, RSP, and SS from a temporary stack and restores >> > > > > them atomically, thus returning to source execution. Prior to >> > > > > returning, we must manually restore most other registers to >> > > > > the values they had on the source. One particularly >> > > > > significant register is CR3. When we return to Linux, CR3 >> > > > > must be set to the source CR3 or the first instruction >> > > > > executed in Linux will cause a page fault. The code that we >> > > > > use to restore the registers and return must be mapped in the >> > > > > source page table or we would get a page fault executing the >> > > > > instructions prior to returning into Linux. The value of >> > > > > CR3 is so significant, that it defines the three phases of >> > > > > the trampoline. Phase 3 begins when CR3 is set to the source >> > > > > CR3. After setting CR3, we set all the other registers and return. >> > > > > >> > > > > Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 >> > > > > mapping, meaning that virtual addresses are the same as >> > > > > physical addresses. The kernel page table uses an offset >> > > > > mapping, meaning that virtual addresses differ from physical >> > > > > addresses by a constant (for the most part). Crucially, this >> > > > > means that the virtual address of the page that is executed >> > > > > by phase 3 differs between the OVMF map and the source map. >> > > > > If we are executing code mapped in OVMF and we change CR3 to >> > > > > point to the source map, although the page may be mapped in >> > > > > the source map, the virtual address will be different, and we >> > > > > will face undefined behavior. To fix this, we construct >> > > > > intermediate page tables that map the pages for phase >> > > > > 2 and 3 to the virtual address expected in OVMF and to the >> > > > > virtual address expected in the source map. Thus, we can >> > > > > switch CR3 from OVMF's map to the intermediate map and then >> > > > > from the intermediate map to the source map. Phase 2 is much >> > > > > shorter than phase 3. Phase 2 is mainly responsible for >> > > > > switching to the intermediate map, flushing the TLB, and >> > > > > jumping to phase 3. >> > > > > >> > > > > Fortunately phase 1 is even simpler than phase 2. Phase 1 has >> > > > > two duties. First, since phase 2 and 3 operate without a >> > > > > stack and can't access values defined in OVMF (such as the >> > > > > addresses of the pages containing phase 2 and 3), phase 1 >> > > > > must pass these values to phase 2 by putting them in >> > > > > registers. Second, phase 1 must start phase 2 by jumping to >> > > > > it. >> > > > > >> > > > > Given that we can resume to a snapshot in OVMF, we should be >> > > > > able to migrate an SEV guest as long as we can securely >> > > > > communicate the VM snapshot from source to destination. For >> > > > > our demo, we do this with a handful of QMP commands. More >> > > > > sophisticated methods are required for a production implementation. >> > > > > >> > > > > When we refer to a snapshot, what we really mean is the >> > > > > device state, memory, and CPU state of a guest. In live >> > > > > migration this is transmitted dynamically as opposed to being >> > > > > saved and restored. Device state is not protected by SEV and >> > > > > can be handled entirely by the HV. Memory, on the other hand, >> > > > > cannot be handled only by the HV. As mentioned previously, >> > > > > memory needs to be encrypted with a transport key. A >> > > > > Migration Handler on the source will coordinate with the HV >> > > > > to encrypt pages and transmit them to the destination. The >> > > > > destination HV will receive the pages over the network and >> > > > > pass them to the Migration Handler in the target VM so they >> > > > > can be decrypted. This transmission will occur continuously >> > > > > until the memory of the source and target converges. >> > > > > >> > > > > Plain SEV does not protect the CPU state of the guest and >> > > > > therefore does not require any special mechanism for >> > > > > transmission of the CPU state. >> > > > > We >> > > > > plan to implement an end-to-end migration with plain SEV >> > > > > first. In SEV-ES, the PSP (platform security processor) >> > > > > encrypts CPU state on each VMExit. The encrypted state is >> > > > > stored in memory. Normally this memory (known as the VMSA) is >> > > > > not mapped into the guest, but we can add an entry to the >> > > > > nested page tables that will expose the VMSA to the guest. >> > > > > This means that when the guest VMExits, the CPU state will be >> > > > > saved to guest memory. With the CPU state in guest memory, it >> > > > > can be transmitted to the target using the method described >> > > > > above. >> > > > > >> > > > > In addition to the changes needed in OVMF to resume the VM, >> > > > > the transmission of the VM from source to target will require >> > > > > a new code path in the hypervisor. There will also need to be >> > > > > a few minor changes to Linux (adding a mapping for our Phase >> > > > > 3 pages). Despite all the moving pieces, we believe that this >> > > > > is a feasible approach for supporting live migration for SEV and SEV-ES. >> > > > > >> > > > > For the sake of brevity, we have left out a few issues, >> > > > > including SMP support, generation of the intermediate >> > > > > mappings, and more. We have included some notes about these issues in the COMPLICATIONS.md file. >> > > > > We >> > > > > also have an outline of an end-to-end implementation of live >> > > > > migration for SEV-ES in END-TO-END.md. See README.md for info >> > > > > on how to run the demo. While this is not a full migration, >> > > > > we hope to show that fast live migration with SEV and SEV-ES >> > > > > is possible without major kernel changes. >> > > > > >> > > > > -Tobin >> > > > >> > > > the one word that comes to my mind upon reading the above is, >> > > > "overwhelming". >> > > > >> > > > (I have not been addressed directly, but: >> > > > >> > > > - the subject says "RFC", >> > > > >> > > > - and the documentation at >> > > > >> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F >> > > > %2Fgithub.com%2Fsecure-migration%2Fresume-from-edk2-tooling%23w >> > > > hat-changes-did-we-make&data=04%7C01%7Cashish.kalra%40amd.c >> > > > om%7C5180f68f099546c3a49e08d884edf727%7C3dd8961fe4884e608e11a82 >> > > > d994e183d%7C0%7C0%7C637405504892582241%7CUnknown%7CTWFpbGZsb3d8 >> > > > eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0 >> > > > %3D%7C1000&sdata=ztsOgLs3hNcn90iOPRV5gfSly11o0X3kq7yMmYhmRe >> > > > E%3D&reserved=0 >> > > > >> > > > states that AmdSevPkg was created for convenience, and that the >> > > > feature could be integrated into OVMF. (Paraphrased.) >> > > > >> > > > So I guess it's tolerable if I make a comment: ) >> > > > >> > > We've been looking forward to your perspective. >> > > >> > > > I've checked out the "mh-state-dev" branch of >> > > > <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2 >> > > > F%2Fgithub.com%2Fsecure-migration%2Fresume-from-efi-edk2.git&am >> > > > p;data=04%7C01%7Cashish.kalra%40amd.com%7C5180f68f099546c3a49e0 >> > > > 8d884edf727%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637405 >> > > > 504892582241%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIj >> > > > oiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=nL0QD >> > > > AEW3%2B4%2Fw4GJRtyoF0D12gRRiTno6tA%2BE3%2BjNhM%3D&reserved= >> > > > 0>. It has >> > > > 80 commits on top of edk2 master (base commit: d5339c04d7cd, >> > > > "UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency", >> > > > 2020-04-23). >> > > > >> > > > These commits were authored over the 6-7 months since April. >> > > > It's obviously huge work. To me, most of these commits clearly >> > > > aim at getting the demo / proof-of-concept functional, rather >> > > > than guiding (more >> > > > precisely: hand-holding) reviewers through the construction of >> > > > the feature. >> > > > >> > > > In my opinion, the series is not upstreamable in its current >> > > > format (which is presently not much more readable than a >> > > > single-commit code drop). Upstreaming is probably not your intent, either, at this time. >> > > > >> > > > I agree that getting feedback ("buy-in") at this level of >> > > > maturity is justified from your POV, before you invest more >> > > > work into cleaning up / restructuring the series. >> > > > >> > > > My problem is that "hand-holding" is exactly what I'd need -- I >> > > > cannot dedicate one or two weeks, as an indivisible block, to >> > > > understanding your design. Nor can I approach the series >> > > > patch-wise in its current format. Personally I would need the >> > > > patch series to lead me through the whole design with baby >> > > > steps ("ELI5"), meaning small code changes and detailed commit >> > > > messages. I'd *also* need the more comprehensive guide-like documentation, as background material. >> > > > >> > > > Furthermore, I don't have an environment where I can test this >> > > > proof-of-concept (and provide you with further incentive for >> > > > cleaning up the series, by reporting success). >> > > > >> > > > So I hope others can spend the time discussing the design with >> > > > you, and testing / repeating the demo. For me to review the >> > > > patches, the patches should condense and replay your thinking >> > > > process from the last 7 months, in as small as possible logical >> > > > steps. (On the list.) >> > > > >> > > I completely understand your position. This PoC has a lot of new >> > > ideas in it and you're right that our main priority was not to >> > > hand-hold/guide reviewers through the code. >> > > >> > > One thing that is worth emphasizing is that the pieces we are >> > > showcasing here are not the immediate priority when it comes to >> > > upstreaming. Specifically, we looked into the trampoline to make >> > > sure it was possible to migrate CPU state via firmware. >> > > While we need this for SEV-ES and our goal is to support SEV-ES, >> > > it is not the first step. We are currently working on a PoC for a >> > > full end-to-end migration with SEV (non-ES), which may be a >> > > better place for us to begin a serious discussion about getting >> > > things upstream. We will focus more on making these patches >> > > accessible to the upstream community. >> > >> > With my migration maintainer hat on, I'd like to understand a bit >> > more about these different approaches; they could be quite >> > invasive, so I'd like to make sure we're not doing one and throwing >> > it away - it would be great if you could explain your non-ES >> > approach; you don't need to have POC code to explain it. >> > >> Our non-ES approach is a subset of our ES approach. For ES, the >> Migration Handler in the guest needs to help out with memory and CPU >> state. For plain SEV, the HV can set the CPU state, but we still need >> a way to transfer the memory. The current POC only deals with the CPU >> state. >> >> We're still working out some of the details in QEMU, but the basic >> idea of transferring memory is that each time the HV needs to send a >> page to the target, it will ask the Migration Handler in the guest >> for a version of the page that is encrypted with a transport key. >> Since the MH is inside the guest, it can read from any address in >> guest memory. The Migration Handlers on the source and the target >> will share a key. Once the source encrypts the requested page with >> the transport key, it can safely hand it off to the HV. Once the page >> reaches the target, the target HV will pass the page into the >> Migration Handler, which will decrypt using the transport key and >> move the page to the appropriate address. >> >> A few things to note: >> >> - The Migration Handler on the source needs to be running in the >> guest alongside the VM. On the target, the MH needs to startup >> before we can receive any pages. In both cases we are thinking >> that an additional vCPU can be started for the MH to run on. >> This could be spawned dynamically or live for the duration of >> the guest. >> >> - We need to make sure that the Migration Handler on the target >> does not overwrite itself when it receives pages from the >> source. Since we run the same firmware on the source and >> target, and since the MH is runtime code, the memory >> footprint of the MH should match on the source and the >> target. We will need to make sure there are no weird >> relocations. >> >> - There are some complexities arising from the fact that not >> every page in an SEV VM is encrypted. We are looking into >> the best way to handle encrypted vs. shared pages. >> > > Raising this question here as part of this discussion ... are you > thinking of adding the page encryption bitmap (as we do for the slow > migration patches) here to figure out if the guest pages are encrypted > or not ? > > We are using the bitmap for the first iteration of our end-to-end POC. Ok. > The page encryption status will need notifications from the guest > kernel and OVMF. > > Additionally, is the page encrpytion bitmap support going to be added > as a hypercall interface to the guest, which also means that the guest > kernel needs to be modified ? > Although the bitmap is handy, we would like to avoid the patches you are alluding to. We are currently looking into how we can eliminate the bitmap. Please note, the page encryption bitmap is also required for SEV guest page migration and SEV guest debug support, therefore it might be useful for having these patches available. If you want us to push Brijesh's and my patches for the page encryption bitmap separately for the kernel then let us know. Thanks, Ashish > >> Hopefully those notes don't confound my earlier explanation too much. >> I think that's most of the picture for non-ES migration. >> Let me know if you have any questions. ES migration would use the >> same approach for transferring memory. >> >> -Tobin >> >> > Dave >> > >> > > In the meantime, perhaps there is something we can do to help >> > > make our current work more clear. We could potentially explain >> > > things on a call or create some additional documentation. While >> > > our goal is not to shove this version of the trampoline upstream, >> > > it is significant to our plan as a whole and we want to help >> > > people understand it. >> > > >> > > -Tobin >> > > >> > > > I really don't want to be the bottleneck here, which is why I >> > > > would support introducing this feature as a separate top-level >> > > > package (AmdSevPkg). >> > > > >> > > > Thanks >> > > > Laszlo >> > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept 2020-11-06 21:48 ` Tobin Feldman-Fitzthum 2020-11-06 22:17 ` Ashish Kalra @ 2020-11-09 19:56 ` Dr. David Alan Gilbert 2020-11-09 22:37 ` Tobin Feldman-Fitzthum 1 sibling, 1 reply; 16+ messages in thread From: Dr. David Alan Gilbert @ 2020-11-09 19:56 UTC (permalink / raw) To: Tobin Feldman-Fitzthum Cc: Laszlo Ersek, devel, dovmurik, Dov.Murik1, ashish.kalra, brijesh.singh, tobin, david.kaplan, jon.grimm, thomas.lendacky, jejb, frankeh * Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote: > On 2020-11-06 11:38, Dr. David Alan Gilbert wrote: > > * Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote: > > > On 2020-11-03 09:59, Laszlo Ersek wrote: > > > > Hi Tobin, > > > > > > > > (keeping full context -- I'm adding Dave) > > > > > > > > On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote: > > > > > Hello, > > > > > > > > > > Dov Murik. James Bottomley, Hubertus Franke, and I have been working > > > > > on > > > > > a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when > > > > > it's > > > > > out and even hopefully Intel TDX) VMs. We have developed an approach > > > > > that we believe is feasible and a demonstration that shows our > > > > > solution > > > > > to the most difficult part of the problem. In short, we have > > > > > implemented > > > > > a UEFI Application that can resume from a VM snapshot. We think this > > > > > is > > > > > the crux of SEV-ES live migration. After describing the context of our > > > > > demo and how it works, we explain how it can be extended to a full > > > > > SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live > > > > > migration can be implemented in OVMF with minimal kernel changes. We > > > > > provide a blueprint for doing so. > > > > > > > > > > Typically the hypervisor facilitates live migration. AMD SEV excludes > > > > > the hypervisor from the trust domain of the guest. When a hypervisor > > > > > (HV) examines the memory of an SEV guest, it will find only a > > > > > ciphertext. If the HV moves the memory of an SEV guest, the ciphertext > > > > > will be invalidated. Furthermore, with SEV-ES the hypervisor is > > > > > largely > > > > > unable to access guest CPU state. Thus, fast migration of SEV VMs > > > > > requires support from inside the trust domain, i.e. the guest. > > > > > > > > > > One approach is to add support for SEV Migration to the Linux kernel. > > > > > This would allow the guest to encrypt/decrypt its own memory with a > > > > > transport key. This approach has met some resistance. We propose a > > > > > similar approach implemented not in Linux, but in firmware, > > > > > specifically > > > > > OVMF. Since OVMF runs inside the guest, it has access to the guest > > > > > memory and CPU state. OVMF should be able to perform the manipulations > > > > > required for live migration of SEV and SEV-ES guests. > > > > > > > > > > The biggest challenge of this approach involves migrating the CPU > > > > > state > > > > > of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the > > > > > CPU > > > > > state of the target before the target begins executing. In our > > > > > approach, > > > > > the HV starts the target and OVMF must resume to whatever state the > > > > > source was in. We believe this to be the crux (or at least the most > > > > > difficult part) of live migration for SEV and we hope that by > > > > > demonstrating resume from EFI, we can show that our approach is > > > > > generally feasible. > > > > > > > > > > Our demo can be found at <https://github.com/secure-migration>. The > > > > > tooling repository is the best starting point. It contains > > > > > documentation > > > > > about the project and the scripts needed to run the demo. There are > > > > > two > > > > > more repos associated with the project. One is a modified edk2 tree > > > > > that > > > > > contains our modified OVMF. The other is a modified qemu, that has a > > > > > couple of temporary changes needed for the demo. Our demonstration is > > > > > aimed only at resuming from a VM snapshot in OVMF. We provide the > > > > > source > > > > > CPU state and source memory to the destination using temporary > > > > > plumbing > > > > > that violates the SEV trust model. We explain the setup in more > > > > > depth in > > > > > README.md. We are showing only that OVMF can resume from a VM > > > > > snapshot. > > > > > At the end we will describe our plan for transferring CPU state and > > > > > memory from source to guest. To be clear, the temporary tooling used > > > > > for > > > > > this demo isn't built for encrypted VMs, but below we explain how this > > > > > demo applies to and can be extended to encrypted VMs. > > > > > > > > > > We Implemented our resume code in a very similar fashion to the > > > > > recommended S3 resume code. When the HV sets the CPU state of a guest, > > > > > it can do so when the guest is not executing. Setting the state from > > > > > inside the guest is a delicate operation. There is no way to > > > > > atomically > > > > > set all of the CPU state from inside the guest. Instead, we must set > > > > > most registers individually and account for changes in control flow > > > > > that > > > > > doing so might cause. We do this with a three-phase trampoline. OVMF > > > > > calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and > > > > > jumps to it. Phase 2 switches to an intermediate map that reconciles > > > > > the > > > > > OVMF map and the source map. Phase 3 switches to the source map, > > > > > restores the registers, and returns into execution of the source. We > > > > > will go backwards through these phases in more depth. > > > > > > > > > > The last thing that resume to EFI does is return. Specifically, we use > > > > > IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a > > > > > temporary stack and restores them atomically, thus returning to source > > > > > execution. Prior to returning, we must manually restore most other > > > > > registers to the values they had on the source. One particularly > > > > > significant register is CR3. When we return to Linux, CR3 must be > > > > > set to > > > > > the source CR3 or the first instruction executed in Linux will cause a > > > > > page fault. The code that we use to restore the registers and return > > > > > must be mapped in the source page table or we would get a page fault > > > > > executing the instructions prior to returning into Linux. The value of > > > > > CR3 is so significant, that it defines the three phases of the > > > > > trampoline. Phase 3 begins when CR3 is set to the source CR3. After > > > > > setting CR3, we set all the other registers and return. > > > > > > > > > > Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, > > > > > meaning > > > > > that virtual addresses are the same as physical addresses. The kernel > > > > > page table uses an offset mapping, meaning that virtual addresses > > > > > differ > > > > > from physical addresses by a constant (for the most part). Crucially, > > > > > this means that the virtual address of the page that is executed by > > > > > phase 3 differs between the OVMF map and the source map. If we are > > > > > executing code mapped in OVMF and we change CR3 to point to the source > > > > > map, although the page may be mapped in the source map, the virtual > > > > > address will be different, and we will face undefined behavior. To fix > > > > > this, we construct intermediate page tables that map the pages for > > > > > phase > > > > > 2 and 3 to the virtual address expected in OVMF and to the virtual > > > > > address expected in the source map. Thus, we can switch CR3 from > > > > > OVMF's > > > > > map to the intermediate map and then from the intermediate map to the > > > > > source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly > > > > > responsible for switching to the intermediate map, flushing the TLB, > > > > > and > > > > > jumping to phase 3. > > > > > > > > > > Fortunately phase 1 is even simpler than phase 2. Phase 1 has two > > > > > duties. First, since phase 2 and 3 operate without a stack and can't > > > > > access values defined in OVMF (such as the addresses of the pages > > > > > containing phase 2 and 3), phase 1 must pass these values to phase 2 > > > > > by > > > > > putting them in registers. Second, phase 1 must start phase 2 by > > > > > jumping > > > > > to it. > > > > > > > > > > Given that we can resume to a snapshot in OVMF, we should be able to > > > > > migrate an SEV guest as long as we can securely communicate the VM > > > > > snapshot from source to destination. For our demo, we do this with a > > > > > handful of QMP commands. More sophisticated methods are required for a > > > > > production implementation. > > > > > > > > > > When we refer to a snapshot, what we really mean is the device state, > > > > > memory, and CPU state of a guest. In live migration this is > > > > > transmitted > > > > > dynamically as opposed to being saved and restored. Device state is > > > > > not > > > > > protected by SEV and can be handled entirely by the HV. Memory, on the > > > > > other hand, cannot be handled only by the HV. As mentioned previously, > > > > > memory needs to be encrypted with a transport key. A Migration Handler > > > > > on the source will coordinate with the HV to encrypt pages and > > > > > transmit > > > > > them to the destination. The destination HV will receive the pages > > > > > over > > > > > the network and pass them to the Migration Handler in the target VM so > > > > > they can be decrypted. This transmission will occur continuously until > > > > > the memory of the source and target converges. > > > > > > > > > > Plain SEV does not protect the CPU state of the guest and therefore > > > > > does > > > > > not require any special mechanism for transmission of the CPU state. > > > > > We > > > > > plan to implement an end-to-end migration with plain SEV first. In > > > > > SEV-ES, the PSP (platform security processor) encrypts CPU state on > > > > > each > > > > > VMExit. The encrypted state is stored in memory. Normally this memory > > > > > (known as the VMSA) is not mapped into the guest, but we can add an > > > > > entry to the nested page tables that will expose the VMSA to the > > > > > guest. > > > > > This means that when the guest VMExits, the CPU state will be saved to > > > > > guest memory. With the CPU state in guest memory, it can be > > > > > transmitted > > > > > to the target using the method described above. > > > > > > > > > > In addition to the changes needed in OVMF to resume the VM, the > > > > > transmission of the VM from source to target will require a new code > > > > > path in the hypervisor. There will also need to be a few minor changes > > > > > to Linux (adding a mapping for our Phase 3 pages). Despite all the > > > > > moving pieces, we believe that this is a feasible approach for > > > > > supporting live migration for SEV and SEV-ES. > > > > > > > > > > For the sake of brevity, we have left out a few issues, including SMP > > > > > support, generation of the intermediate mappings, and more. We have > > > > > included some notes about these issues in the COMPLICATIONS.md file. > > > > > We > > > > > also have an outline of an end-to-end implementation of live migration > > > > > for SEV-ES in END-TO-END.md. See README.md for info on how to run the > > > > > demo. While this is not a full migration, we hope to show that fast > > > > > live > > > > > migration with SEV and SEV-ES is possible without major kernel > > > > > changes. > > > > > > > > > > -Tobin > > > > > > > > the one word that comes to my mind upon reading the above is, > > > > "overwhelming". > > > > > > > > (I have not been addressed directly, but: > > > > > > > > - the subject says "RFC", > > > > > > > > - and the documentation at > > > > > > > > https://github.com/secure-migration/resume-from-edk2-tooling#what-changes-did-we-make > > > > > > > > states that AmdSevPkg was created for convenience, and that the feature > > > > could be integrated into OVMF. (Paraphrased.) > > > > > > > > So I guess it's tolerable if I make a comment: ) > > > > > > > We've been looking forward to your perspective. > > > > > > > I've checked out the "mh-state-dev" branch of > > > > <https://github.com/secure-migration/resume-from-efi-edk2.git>. It has > > > > 80 commits on top of edk2 master (base commit: d5339c04d7cd, > > > > "UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency", > > > > 2020-04-23). > > > > > > > > These commits were authored over the 6-7 months since April. It's > > > > obviously huge work. To me, most of these commits clearly aim at getting > > > > the demo / proof-of-concept functional, rather than guiding (more > > > > precisely: hand-holding) reviewers through the construction of the > > > > feature. > > > > > > > > In my opinion, the series is not upstreamable in its current format > > > > (which is presently not much more readable than a single-commit code > > > > drop). Upstreaming is probably not your intent, either, at this time. > > > > > > > > I agree that getting feedback ("buy-in") at this level of maturity is > > > > justified from your POV, before you invest more work into cleaning up / > > > > restructuring the series. > > > > > > > > My problem is that "hand-holding" is exactly what I'd need -- I cannot > > > > dedicate one or two weeks, as an indivisible block, to understanding > > > > your design. Nor can I approach the series patch-wise in its current > > > > format. Personally I would need the patch series to lead me through the > > > > whole design with baby steps ("ELI5"), meaning small code changes and > > > > detailed commit messages. I'd *also* need the more comprehensive > > > > guide-like documentation, as background material. > > > > > > > > Furthermore, I don't have an environment where I can test this > > > > proof-of-concept (and provide you with further incentive for cleaning up > > > > the series, by reporting success). > > > > > > > > So I hope others can spend the time discussing the design with you, and > > > > testing / repeating the demo. For me to review the patches, the patches > > > > should condense and replay your thinking process from the last 7 months, > > > > in as small as possible logical steps. (On the list.) > > > > > > > I completely understand your position. This PoC has a lot of > > > new ideas in it and you're right that our main priority was not > > > to hand-hold/guide reviewers through the code. > > > > > > One thing that is worth emphasizing is that the pieces we > > > are showcasing here are not the immediate priority when it > > > comes to upstreaming. Specifically, we looked into the trampoline > > > to make sure it was possible to migrate CPU state via firmware. > > > While we need this for SEV-ES and our goal is to support SEV-ES, > > > it is not the first step. We are currently working on a PoC for > > > a full end-to-end migration with SEV (non-ES), which may be a better > > > place for us to begin a serious discussion about getting things > > > upstream. We will focus more on making these patches accessible > > > to the upstream community. > > > > With my migration maintainer hat on, I'd like to understand a bit more > > about these different approaches; they could be quite invasive, so I'd > > like to make sure we're not doing one and throwing it away - it would > > be great if you could explain your non-ES approach; you don't need to > > have POC code to explain it. > > > Our non-ES approach is a subset of our ES approach. For ES, the > Migration Handler in the guest needs to help out with memory and > CPU state. For plain SEV, the HV can set the CPU state, but we still > need a way to transfer the memory. The current POC only deals > with the CPU state. OK, so as long as that's a subset, and this POC glues on for SEV-ES registers that's fine. > We're still working out some of the details in QEMU, but the basic > idea of transferring memory is that each time the HV needs to send a > page to the target, it will ask the Migration Handler in the guest > for a version of the page that is encrypted with a transport key. > Since the MH is inside the guest, it can read from any address > in guest memory. The Migration Handlers on the source and the target > will share a key. Once the source encrypts the requested page with > the transport key, it can safely hand it off to the HV. Once the page > reaches the target, the target HV will pass the page into the > Migration Handler, which will decrypt using the transport key and > move the page to the appropriate address. So somehow we have to get that transport key negotiated and into the the migration-handlers. > A few things to note: > > - The Migration Handler on the source needs to be running in the > guest alongside the VM. On the target, the MH needs to startup > before we can receive any pages. In both cases we are thinking > that an additional vCPU can be started for the MH to run on. > This could be spawned dynamically or live for the duration of > the guest. And on the source it needs to keep running even when the other vCPUs stop for the stop-copy phase at the end. I know various people had asked the question whether we could have some form of helper vCPU or whether hte vCPU would be guest visible. > - We need to make sure that the Migration Handler on the target > does not overwrite itself when it receives pages from the > source. Since we run the same firmware on the source and > target, and since the MH is runtime code, the memory > footprint of the MH should match on the source and the > target. We will need to make sure there are no weird > relocations. So hmm; that depends whether you're going to transfer the MH using the AMD hardware, or somehow rely on it being the same on the two sides I think. > - There are some complexities arising from the fact that not > every page in an SEV VM is encrypted. We are looking into > the best way to handle encrypted vs. shared pages. Right. > Hopefully those notes don't confound my earlier explanation too > much. I think that's most of the picture for non-ES migration. > Let me know if you have any questions. ES migration would use > the same approach for transferring memory. OK, good. Dave > -Tobin > > > Dave > > > > > In the meantime, perhaps there is something we can do to help > > > make our current work more clear. We could potentially explain > > > things on a call or create some additional documentation. While > > > our goal is not to shove this version of the trampoline upstream, > > > it is significant to our plan as a whole and we want to help > > > people understand it. > > > > > > -Tobin > > > > > > > I really don't want to be the bottleneck here, which is why I would > > > > support introducing this feature as a separate top-level package > > > > (AmdSevPkg). > > > > > > > > Thanks > > > > Laszlo > > > > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept 2020-11-09 19:56 ` Dr. David Alan Gilbert @ 2020-11-09 22:37 ` Tobin Feldman-Fitzthum 2020-11-09 23:44 ` James Bottomley 0 siblings, 1 reply; 16+ messages in thread From: Tobin Feldman-Fitzthum @ 2020-11-09 22:37 UTC (permalink / raw) To: devel, dgilbert Cc: Laszlo Ersek, dovmurik, Dov.Murik1, ashish.kalra, brijesh.singh, tobin, david.kaplan, jon.grimm, thomas.lendacky, jejb, frankeh On 2020-11-09 14:56, Dr. David Alan Gilbert wrote: > * Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote: >> On 2020-11-06 11:38, Dr. David Alan Gilbert wrote: >> > * Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote: >> > > On 2020-11-03 09:59, Laszlo Ersek wrote: >> > > > Hi Tobin, >> > > > >> > > > (keeping full context -- I'm adding Dave) >> > > > >> > > > On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote: >> > > > > Hello, >> > > > > >> > > > > Dov Murik. James Bottomley, Hubertus Franke, and I have been working >> > > > > on >> > > > > a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when >> > > > > it's >> > > > > out and even hopefully Intel TDX) VMs. We have developed an approach >> > > > > that we believe is feasible and a demonstration that shows our >> > > > > solution >> > > > > to the most difficult part of the problem. In short, we have >> > > > > implemented >> > > > > a UEFI Application that can resume from a VM snapshot. We think this >> > > > > is >> > > > > the crux of SEV-ES live migration. After describing the context of our >> > > > > demo and how it works, we explain how it can be extended to a full >> > > > > SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live >> > > > > migration can be implemented in OVMF with minimal kernel changes. We >> > > > > provide a blueprint for doing so. >> > > > > >> > > > > Typically the hypervisor facilitates live migration. AMD SEV excludes >> > > > > the hypervisor from the trust domain of the guest. When a hypervisor >> > > > > (HV) examines the memory of an SEV guest, it will find only a >> > > > > ciphertext. If the HV moves the memory of an SEV guest, the ciphertext >> > > > > will be invalidated. Furthermore, with SEV-ES the hypervisor is >> > > > > largely >> > > > > unable to access guest CPU state. Thus, fast migration of SEV VMs >> > > > > requires support from inside the trust domain, i.e. the guest. >> > > > > >> > > > > One approach is to add support for SEV Migration to the Linux kernel. >> > > > > This would allow the guest to encrypt/decrypt its own memory with a >> > > > > transport key. This approach has met some resistance. We propose a >> > > > > similar approach implemented not in Linux, but in firmware, >> > > > > specifically >> > > > > OVMF. Since OVMF runs inside the guest, it has access to the guest >> > > > > memory and CPU state. OVMF should be able to perform the manipulations >> > > > > required for live migration of SEV and SEV-ES guests. >> > > > > >> > > > > The biggest challenge of this approach involves migrating the CPU >> > > > > state >> > > > > of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the >> > > > > CPU >> > > > > state of the target before the target begins executing. In our >> > > > > approach, >> > > > > the HV starts the target and OVMF must resume to whatever state the >> > > > > source was in. We believe this to be the crux (or at least the most >> > > > > difficult part) of live migration for SEV and we hope that by >> > > > > demonstrating resume from EFI, we can show that our approach is >> > > > > generally feasible. >> > > > > >> > > > > Our demo can be found at <https://github.com/secure-migration>. The >> > > > > tooling repository is the best starting point. It contains >> > > > > documentation >> > > > > about the project and the scripts needed to run the demo. There are >> > > > > two >> > > > > more repos associated with the project. One is a modified edk2 tree >> > > > > that >> > > > > contains our modified OVMF. The other is a modified qemu, that has a >> > > > > couple of temporary changes needed for the demo. Our demonstration is >> > > > > aimed only at resuming from a VM snapshot in OVMF. We provide the >> > > > > source >> > > > > CPU state and source memory to the destination using temporary >> > > > > plumbing >> > > > > that violates the SEV trust model. We explain the setup in more >> > > > > depth in >> > > > > README.md. We are showing only that OVMF can resume from a VM >> > > > > snapshot. >> > > > > At the end we will describe our plan for transferring CPU state and >> > > > > memory from source to guest. To be clear, the temporary tooling used >> > > > > for >> > > > > this demo isn't built for encrypted VMs, but below we explain how this >> > > > > demo applies to and can be extended to encrypted VMs. >> > > > > >> > > > > We Implemented our resume code in a very similar fashion to the >> > > > > recommended S3 resume code. When the HV sets the CPU state of a guest, >> > > > > it can do so when the guest is not executing. Setting the state from >> > > > > inside the guest is a delicate operation. There is no way to >> > > > > atomically >> > > > > set all of the CPU state from inside the guest. Instead, we must set >> > > > > most registers individually and account for changes in control flow >> > > > > that >> > > > > doing so might cause. We do this with a three-phase trampoline. OVMF >> > > > > calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and >> > > > > jumps to it. Phase 2 switches to an intermediate map that reconciles >> > > > > the >> > > > > OVMF map and the source map. Phase 3 switches to the source map, >> > > > > restores the registers, and returns into execution of the source. We >> > > > > will go backwards through these phases in more depth. >> > > > > >> > > > > The last thing that resume to EFI does is return. Specifically, we use >> > > > > IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a >> > > > > temporary stack and restores them atomically, thus returning to source >> > > > > execution. Prior to returning, we must manually restore most other >> > > > > registers to the values they had on the source. One particularly >> > > > > significant register is CR3. When we return to Linux, CR3 must be >> > > > > set to >> > > > > the source CR3 or the first instruction executed in Linux will cause a >> > > > > page fault. The code that we use to restore the registers and return >> > > > > must be mapped in the source page table or we would get a page fault >> > > > > executing the instructions prior to returning into Linux. The value of >> > > > > CR3 is so significant, that it defines the three phases of the >> > > > > trampoline. Phase 3 begins when CR3 is set to the source CR3. After >> > > > > setting CR3, we set all the other registers and return. >> > > > > >> > > > > Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, >> > > > > meaning >> > > > > that virtual addresses are the same as physical addresses. The kernel >> > > > > page table uses an offset mapping, meaning that virtual addresses >> > > > > differ >> > > > > from physical addresses by a constant (for the most part). Crucially, >> > > > > this means that the virtual address of the page that is executed by >> > > > > phase 3 differs between the OVMF map and the source map. If we are >> > > > > executing code mapped in OVMF and we change CR3 to point to the source >> > > > > map, although the page may be mapped in the source map, the virtual >> > > > > address will be different, and we will face undefined behavior. To fix >> > > > > this, we construct intermediate page tables that map the pages for >> > > > > phase >> > > > > 2 and 3 to the virtual address expected in OVMF and to the virtual >> > > > > address expected in the source map. Thus, we can switch CR3 from >> > > > > OVMF's >> > > > > map to the intermediate map and then from the intermediate map to the >> > > > > source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly >> > > > > responsible for switching to the intermediate map, flushing the TLB, >> > > > > and >> > > > > jumping to phase 3. >> > > > > >> > > > > Fortunately phase 1 is even simpler than phase 2. Phase 1 has two >> > > > > duties. First, since phase 2 and 3 operate without a stack and can't >> > > > > access values defined in OVMF (such as the addresses of the pages >> > > > > containing phase 2 and 3), phase 1 must pass these values to phase 2 >> > > > > by >> > > > > putting them in registers. Second, phase 1 must start phase 2 by >> > > > > jumping >> > > > > to it. >> > > > > >> > > > > Given that we can resume to a snapshot in OVMF, we should be able to >> > > > > migrate an SEV guest as long as we can securely communicate the VM >> > > > > snapshot from source to destination. For our demo, we do this with a >> > > > > handful of QMP commands. More sophisticated methods are required for a >> > > > > production implementation. >> > > > > >> > > > > When we refer to a snapshot, what we really mean is the device state, >> > > > > memory, and CPU state of a guest. In live migration this is >> > > > > transmitted >> > > > > dynamically as opposed to being saved and restored. Device state is >> > > > > not >> > > > > protected by SEV and can be handled entirely by the HV. Memory, on the >> > > > > other hand, cannot be handled only by the HV. As mentioned previously, >> > > > > memory needs to be encrypted with a transport key. A Migration Handler >> > > > > on the source will coordinate with the HV to encrypt pages and >> > > > > transmit >> > > > > them to the destination. The destination HV will receive the pages >> > > > > over >> > > > > the network and pass them to the Migration Handler in the target VM so >> > > > > they can be decrypted. This transmission will occur continuously until >> > > > > the memory of the source and target converges. >> > > > > >> > > > > Plain SEV does not protect the CPU state of the guest and therefore >> > > > > does >> > > > > not require any special mechanism for transmission of the CPU state. >> > > > > We >> > > > > plan to implement an end-to-end migration with plain SEV first. In >> > > > > SEV-ES, the PSP (platform security processor) encrypts CPU state on >> > > > > each >> > > > > VMExit. The encrypted state is stored in memory. Normally this memory >> > > > > (known as the VMSA) is not mapped into the guest, but we can add an >> > > > > entry to the nested page tables that will expose the VMSA to the >> > > > > guest. >> > > > > This means that when the guest VMExits, the CPU state will be saved to >> > > > > guest memory. With the CPU state in guest memory, it can be >> > > > > transmitted >> > > > > to the target using the method described above. >> > > > > >> > > > > In addition to the changes needed in OVMF to resume the VM, the >> > > > > transmission of the VM from source to target will require a new code >> > > > > path in the hypervisor. There will also need to be a few minor changes >> > > > > to Linux (adding a mapping for our Phase 3 pages). Despite all the >> > > > > moving pieces, we believe that this is a feasible approach for >> > > > > supporting live migration for SEV and SEV-ES. >> > > > > >> > > > > For the sake of brevity, we have left out a few issues, including SMP >> > > > > support, generation of the intermediate mappings, and more. We have >> > > > > included some notes about these issues in the COMPLICATIONS.md file. >> > > > > We >> > > > > also have an outline of an end-to-end implementation of live migration >> > > > > for SEV-ES in END-TO-END.md. See README.md for info on how to run the >> > > > > demo. While this is not a full migration, we hope to show that fast >> > > > > live >> > > > > migration with SEV and SEV-ES is possible without major kernel >> > > > > changes. >> > > > > >> > > > > -Tobin >> > > > >> > > > the one word that comes to my mind upon reading the above is, >> > > > "overwhelming". >> > > > >> > > > (I have not been addressed directly, but: >> > > > >> > > > - the subject says "RFC", >> > > > >> > > > - and the documentation at >> > > > >> > > > https://github.com/secure-migration/resume-from-edk2-tooling#what-changes-did-we-make >> > > > >> > > > states that AmdSevPkg was created for convenience, and that the feature >> > > > could be integrated into OVMF. (Paraphrased.) >> > > > >> > > > So I guess it's tolerable if I make a comment: ) >> > > > >> > > We've been looking forward to your perspective. >> > > >> > > > I've checked out the "mh-state-dev" branch of >> > > > <https://github.com/secure-migration/resume-from-efi-edk2.git>. It has >> > > > 80 commits on top of edk2 master (base commit: d5339c04d7cd, >> > > > "UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency", >> > > > 2020-04-23). >> > > > >> > > > These commits were authored over the 6-7 months since April. It's >> > > > obviously huge work. To me, most of these commits clearly aim at getting >> > > > the demo / proof-of-concept functional, rather than guiding (more >> > > > precisely: hand-holding) reviewers through the construction of the >> > > > feature. >> > > > >> > > > In my opinion, the series is not upstreamable in its current format >> > > > (which is presently not much more readable than a single-commit code >> > > > drop). Upstreaming is probably not your intent, either, at this time. >> > > > >> > > > I agree that getting feedback ("buy-in") at this level of maturity is >> > > > justified from your POV, before you invest more work into cleaning up / >> > > > restructuring the series. >> > > > >> > > > My problem is that "hand-holding" is exactly what I'd need -- I cannot >> > > > dedicate one or two weeks, as an indivisible block, to understanding >> > > > your design. Nor can I approach the series patch-wise in its current >> > > > format. Personally I would need the patch series to lead me through the >> > > > whole design with baby steps ("ELI5"), meaning small code changes and >> > > > detailed commit messages. I'd *also* need the more comprehensive >> > > > guide-like documentation, as background material. >> > > > >> > > > Furthermore, I don't have an environment where I can test this >> > > > proof-of-concept (and provide you with further incentive for cleaning up >> > > > the series, by reporting success). >> > > > >> > > > So I hope others can spend the time discussing the design with you, and >> > > > testing / repeating the demo. For me to review the patches, the patches >> > > > should condense and replay your thinking process from the last 7 months, >> > > > in as small as possible logical steps. (On the list.) >> > > > >> > > I completely understand your position. This PoC has a lot of >> > > new ideas in it and you're right that our main priority was not >> > > to hand-hold/guide reviewers through the code. >> > > >> > > One thing that is worth emphasizing is that the pieces we >> > > are showcasing here are not the immediate priority when it >> > > comes to upstreaming. Specifically, we looked into the trampoline >> > > to make sure it was possible to migrate CPU state via firmware. >> > > While we need this for SEV-ES and our goal is to support SEV-ES, >> > > it is not the first step. We are currently working on a PoC for >> > > a full end-to-end migration with SEV (non-ES), which may be a better >> > > place for us to begin a serious discussion about getting things >> > > upstream. We will focus more on making these patches accessible >> > > to the upstream community. >> > >> > With my migration maintainer hat on, I'd like to understand a bit more >> > about these different approaches; they could be quite invasive, so I'd >> > like to make sure we're not doing one and throwing it away - it would >> > be great if you could explain your non-ES approach; you don't need to >> > have POC code to explain it. >> > >> Our non-ES approach is a subset of our ES approach. For ES, the >> Migration Handler in the guest needs to help out with memory and >> CPU state. For plain SEV, the HV can set the CPU state, but we still >> need a way to transfer the memory. The current POC only deals >> with the CPU state. > > OK, so as long as that's a subset, and this POC glues on for SEV-ES > registers that's fine. > >> We're still working out some of the details in QEMU, but the basic >> idea of transferring memory is that each time the HV needs to send a >> page to the target, it will ask the Migration Handler in the guest >> for a version of the page that is encrypted with a transport key. >> Since the MH is inside the guest, it can read from any address >> in guest memory. The Migration Handlers on the source and the target >> will share a key. Once the source encrypts the requested page with >> the transport key, it can safely hand it off to the HV. Once the page >> reaches the target, the target HV will pass the page into the >> Migration Handler, which will decrypt using the transport key and >> move the page to the appropriate address. > > So somehow we have to get that transport key negotiated and into the > the migration-handlers. Inject-launch-secret is one of the main pieces here. James might have more info about this step. > >> A few things to note: >> >> - The Migration Handler on the source needs to be running in the >> guest alongside the VM. On the target, the MH needs to startup >> before we can receive any pages. In both cases we are thinking >> that an additional vCPU can be started for the MH to run on. >> This could be spawned dynamically or live for the duration of >> the guest. > > And on the source it needs to keep running even when the other vCPUs > stop for the stop-copy phase at the end. Yes. Good point. > > I know various people had asked the question whether we could have > some form of helper vCPU or whether hte vCPU would be guest visible. > >> - We need to make sure that the Migration Handler on the target >> does not overwrite itself when it receives pages from the >> source. Since we run the same firmware on the source and >> target, and since the MH is runtime code, the memory >> footprint of the MH should match on the source and the >> target. We will need to make sure there are no weird >> relocations. > > So hmm; that depends whether you're going to transfer the MH > using the AMD hardware, or somehow rely on it being the same on > the two sides I think. > We don't transfer the MH itself. Even if we did, we would still need to make sure that the MH on the target and the OS on the source do not overlap. Currently our approach for this is to designate the MH as a runtime driver, meaning that the code for the MH is on reserved pages that won't be mapped by Linux. We'll use the same firmware and thus the same driver on the source and destination. We think this will be enough, but it is a somewhat delicate step that we may need to revisit. -Tobin >> - There are some complexities arising from the fact that not >> every page in an SEV VM is encrypted. We are looking into >> the best way to handle encrypted vs. shared pages. > > Right. > >> Hopefully those notes don't confound my earlier explanation too >> much. I think that's most of the picture for non-ES migration. >> Let me know if you have any questions. ES migration would use >> the same approach for transferring memory. > > OK, good. > > Dave > >> -Tobin >> >> > Dave >> > >> > > In the meantime, perhaps there is something we can do to help >> > > make our current work more clear. We could potentially explain >> > > things on a call or create some additional documentation. While >> > > our goal is not to shove this version of the trampoline upstream, >> > > it is significant to our plan as a whole and we want to help >> > > people understand it. >> > > >> > > -Tobin >> > > >> > > > I really don't want to be the bottleneck here, which is why I would >> > > > support introducing this feature as a separate top-level package >> > > > (AmdSevPkg). >> > > > >> > > > Thanks >> > > > Laszlo >> > > >> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept 2020-11-09 22:37 ` Tobin Feldman-Fitzthum @ 2020-11-09 23:44 ` James Bottomley 0 siblings, 0 replies; 16+ messages in thread From: James Bottomley @ 2020-11-09 23:44 UTC (permalink / raw) To: devel, tobin, dgilbert Cc: Laszlo Ersek, dovmurik, Dov.Murik1, ashish.kalra, brijesh.singh, tobin, david.kaplan, jon.grimm, thomas.lendacky, frankeh On Mon, 2020-11-09 at 17:37 -0500, Tobin Feldman-Fitzthum wrote: > On 2020-11-09 14:56, Dr. David Alan Gilbert wrote: > > * Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote: [...] > > > We're still working out some of the details in QEMU, but the > > > basic idea of transferring memory is that each time the HV needs > > > to send a page to the target, it will ask the Migration Handler > > > in the guest for a version of the page that is encrypted with a > > > transport key. Since the MH is inside the guest, it can read > > > from any address in guest memory. The Migration Handlers on the > > > source and the target will share a key. Once the source encrypts > > > the requested page with the transport key, it can safely hand it > > > off to the HV. Once the page reaches the target, the target HV > > > will pass the page into the Migration Handler, which will decrypt > > > using the transport key and move the page to the appropriate > > > address. > > > > So somehow we have to get that transport key negotiated and into > > the the migration-handlers. > > Inject-launch-secret is one of the main pieces here. James might have > more info about this step. So there are a couple of ways I was thinking this could work. In the current slow migration, the PSPs on each end validate each other by exchanging keys. We could do something similar by having the two MHs do an ECDHE exchange to agree a trusted transfer key between them and then having them both exchange trusted information about the SEV environment i.e. both validating each other. However, the alternative and simpler way is simply to have the machine owner control everything. So encrypted boot would provision two secrets: one for the actual encrypted root which grub needs but the other would be what the MH needs. The MH secret would be the private part of an ECDH key (effectively the MH identity) and the public ECDH key of the MH source, so only the source MH would be able to make encrypted contact for migration. On boot from image, the public key part would be empty indicating boot should proceed normally. On migration, we make sure we know the source public key and provision it to the target along with a random target key. To trigger the migration, we have to tell the source what the target's public key is and they can now make encrypted contact in a manner that should be cryptographically secure. The MH ECDH key would exist for the lifetime of the VM on a SEV system and would be destroyed on either image shutdown or successful migration. James ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2020-11-09 23:44 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-10-28 19:31 RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept Tobin Feldman-Fitzthum 2020-10-29 17:06 ` Ashish Kalra 2020-10-29 20:36 ` tobin 2020-10-30 18:35 ` Ashish Kalra 2020-11-03 14:59 ` [edk2-devel] " Laszlo Ersek 2020-11-04 18:27 ` Tobin Feldman-Fitzthum 2020-11-06 15:45 ` Laszlo Ersek 2020-11-06 20:03 ` Tobin Feldman-Fitzthum 2020-11-06 16:38 ` Dr. David Alan Gilbert 2020-11-06 21:48 ` Tobin Feldman-Fitzthum 2020-11-06 22:17 ` Ashish Kalra 2020-11-09 20:27 ` Tobin Feldman-Fitzthum 2020-11-09 20:34 ` Kalra, Ashish 2020-11-09 19:56 ` Dr. David Alan Gilbert 2020-11-09 22:37 ` Tobin Feldman-Fitzthum 2020-11-09 23:44 ` James Bottomley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox