From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Tobin Feldman-Fitzthum <tobin@linux.ibm.com>
Cc: Laszlo Ersek <lersek@redhat.com>,
devel@edk2.groups.io, dovmurik@linux.vnet.ibm.com,
Dov.Murik1@il.ibm.com, ashish.kalra@amd.com,
brijesh.singh@amd.com, tobin@ibm.com, david.kaplan@amd.com,
jon.grimm@amd.com, thomas.lendacky@amd.com, jejb@linux.ibm.com,
frankeh@us.ibm.com
Subject: Re: [edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept
Date: Fri, 6 Nov 2020 16:38:48 +0000 [thread overview]
Message-ID: <20201106163848.GM3576@work-vm> (raw)
In-Reply-To: <61acbc7b318b2c099a106151116f25ea@linux.vnet.ibm.com>
* Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote:
> On 2020-11-03 09:59, Laszlo Ersek wrote:
> > Hi Tobin,
> >
> > (keeping full context -- I'm adding Dave)
> >
> > On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote:
> > > Hello,
> > >
> > > Dov Murik. James Bottomley, Hubertus Franke, and I have been working
> > > on
> > > a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when
> > > it's
> > > out and even hopefully Intel TDX) VMs. We have developed an approach
> > > that we believe is feasible and a demonstration that shows our
> > > solution
> > > to the most difficult part of the problem. In short, we have
> > > implemented
> > > a UEFI Application that can resume from a VM snapshot. We think this
> > > is
> > > the crux of SEV-ES live migration. After describing the context of our
> > > demo and how it works, we explain how it can be extended to a full
> > > SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live
> > > migration can be implemented in OVMF with minimal kernel changes. We
> > > provide a blueprint for doing so.
> > >
> > > Typically the hypervisor facilitates live migration. AMD SEV excludes
> > > the hypervisor from the trust domain of the guest. When a hypervisor
> > > (HV) examines the memory of an SEV guest, it will find only a
> > > ciphertext. If the HV moves the memory of an SEV guest, the ciphertext
> > > will be invalidated. Furthermore, with SEV-ES the hypervisor is
> > > largely
> > > unable to access guest CPU state. Thus, fast migration of SEV VMs
> > > requires support from inside the trust domain, i.e. the guest.
> > >
> > > One approach is to add support for SEV Migration to the Linux kernel.
> > > This would allow the guest to encrypt/decrypt its own memory with a
> > > transport key. This approach has met some resistance. We propose a
> > > similar approach implemented not in Linux, but in firmware,
> > > specifically
> > > OVMF. Since OVMF runs inside the guest, it has access to the guest
> > > memory and CPU state. OVMF should be able to perform the manipulations
> > > required for live migration of SEV and SEV-ES guests.
> > >
> > > The biggest challenge of this approach involves migrating the CPU
> > > state
> > > of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the
> > > CPU
> > > state of the target before the target begins executing. In our
> > > approach,
> > > the HV starts the target and OVMF must resume to whatever state the
> > > source was in. We believe this to be the crux (or at least the most
> > > difficult part) of live migration for SEV and we hope that by
> > > demonstrating resume from EFI, we can show that our approach is
> > > generally feasible.
> > >
> > > Our demo can be found at <https://github.com/secure-migration>. The
> > > tooling repository is the best starting point. It contains
> > > documentation
> > > about the project and the scripts needed to run the demo. There are
> > > two
> > > more repos associated with the project. One is a modified edk2 tree
> > > that
> > > contains our modified OVMF. The other is a modified qemu, that has a
> > > couple of temporary changes needed for the demo. Our demonstration is
> > > aimed only at resuming from a VM snapshot in OVMF. We provide the
> > > source
> > > CPU state and source memory to the destination using temporary
> > > plumbing
> > > that violates the SEV trust model. We explain the setup in more
> > > depth in
> > > README.md. We are showing only that OVMF can resume from a VM
> > > snapshot.
> > > At the end we will describe our plan for transferring CPU state and
> > > memory from source to guest. To be clear, the temporary tooling used
> > > for
> > > this demo isn't built for encrypted VMs, but below we explain how this
> > > demo applies to and can be extended to encrypted VMs.
> > >
> > > We Implemented our resume code in a very similar fashion to the
> > > recommended S3 resume code. When the HV sets the CPU state of a guest,
> > > it can do so when the guest is not executing. Setting the state from
> > > inside the guest is a delicate operation. There is no way to
> > > atomically
> > > set all of the CPU state from inside the guest. Instead, we must set
> > > most registers individually and account for changes in control flow
> > > that
> > > doing so might cause. We do this with a three-phase trampoline. OVMF
> > > calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and
> > > jumps to it. Phase 2 switches to an intermediate map that reconciles
> > > the
> > > OVMF map and the source map. Phase 3 switches to the source map,
> > > restores the registers, and returns into execution of the source. We
> > > will go backwards through these phases in more depth.
> > >
> > > The last thing that resume to EFI does is return. Specifically, we use
> > > IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a
> > > temporary stack and restores them atomically, thus returning to source
> > > execution. Prior to returning, we must manually restore most other
> > > registers to the values they had on the source. One particularly
> > > significant register is CR3. When we return to Linux, CR3 must be
> > > set to
> > > the source CR3 or the first instruction executed in Linux will cause a
> > > page fault. The code that we use to restore the registers and return
> > > must be mapped in the source page table or we would get a page fault
> > > executing the instructions prior to returning into Linux. The value of
> > > CR3 is so significant, that it defines the three phases of the
> > > trampoline. Phase 3 begins when CR3 is set to the source CR3. After
> > > setting CR3, we set all the other registers and return.
> > >
> > > Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping,
> > > meaning
> > > that virtual addresses are the same as physical addresses. The kernel
> > > page table uses an offset mapping, meaning that virtual addresses
> > > differ
> > > from physical addresses by a constant (for the most part). Crucially,
> > > this means that the virtual address of the page that is executed by
> > > phase 3 differs between the OVMF map and the source map. If we are
> > > executing code mapped in OVMF and we change CR3 to point to the source
> > > map, although the page may be mapped in the source map, the virtual
> > > address will be different, and we will face undefined behavior. To fix
> > > this, we construct intermediate page tables that map the pages for
> > > phase
> > > 2 and 3 to the virtual address expected in OVMF and to the virtual
> > > address expected in the source map. Thus, we can switch CR3 from
> > > OVMF's
> > > map to the intermediate map and then from the intermediate map to the
> > > source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly
> > > responsible for switching to the intermediate map, flushing the TLB,
> > > and
> > > jumping to phase 3.
> > >
> > > Fortunately phase 1 is even simpler than phase 2. Phase 1 has two
> > > duties. First, since phase 2 and 3 operate without a stack and can't
> > > access values defined in OVMF (such as the addresses of the pages
> > > containing phase 2 and 3), phase 1 must pass these values to phase 2
> > > by
> > > putting them in registers. Second, phase 1 must start phase 2 by
> > > jumping
> > > to it.
> > >
> > > Given that we can resume to a snapshot in OVMF, we should be able to
> > > migrate an SEV guest as long as we can securely communicate the VM
> > > snapshot from source to destination. For our demo, we do this with a
> > > handful of QMP commands. More sophisticated methods are required for a
> > > production implementation.
> > >
> > > When we refer to a snapshot, what we really mean is the device state,
> > > memory, and CPU state of a guest. In live migration this is
> > > transmitted
> > > dynamically as opposed to being saved and restored. Device state is
> > > not
> > > protected by SEV and can be handled entirely by the HV. Memory, on the
> > > other hand, cannot be handled only by the HV. As mentioned previously,
> > > memory needs to be encrypted with a transport key. A Migration Handler
> > > on the source will coordinate with the HV to encrypt pages and
> > > transmit
> > > them to the destination. The destination HV will receive the pages
> > > over
> > > the network and pass them to the Migration Handler in the target VM so
> > > they can be decrypted. This transmission will occur continuously until
> > > the memory of the source and target converges.
> > >
> > > Plain SEV does not protect the CPU state of the guest and therefore
> > > does
> > > not require any special mechanism for transmission of the CPU state.
> > > We
> > > plan to implement an end-to-end migration with plain SEV first. In
> > > SEV-ES, the PSP (platform security processor) encrypts CPU state on
> > > each
> > > VMExit. The encrypted state is stored in memory. Normally this memory
> > > (known as the VMSA) is not mapped into the guest, but we can add an
> > > entry to the nested page tables that will expose the VMSA to the
> > > guest.
> > > This means that when the guest VMExits, the CPU state will be saved to
> > > guest memory. With the CPU state in guest memory, it can be
> > > transmitted
> > > to the target using the method described above.
> > >
> > > In addition to the changes needed in OVMF to resume the VM, the
> > > transmission of the VM from source to target will require a new code
> > > path in the hypervisor. There will also need to be a few minor changes
> > > to Linux (adding a mapping for our Phase 3 pages). Despite all the
> > > moving pieces, we believe that this is a feasible approach for
> > > supporting live migration for SEV and SEV-ES.
> > >
> > > For the sake of brevity, we have left out a few issues, including SMP
> > > support, generation of the intermediate mappings, and more. We have
> > > included some notes about these issues in the COMPLICATIONS.md file.
> > > We
> > > also have an outline of an end-to-end implementation of live migration
> > > for SEV-ES in END-TO-END.md. See README.md for info on how to run the
> > > demo. While this is not a full migration, we hope to show that fast
> > > live
> > > migration with SEV and SEV-ES is possible without major kernel
> > > changes.
> > >
> > > -Tobin
> >
> > the one word that comes to my mind upon reading the above is,
> > "overwhelming".
> >
> > (I have not been addressed directly, but:
> >
> > - the subject says "RFC",
> >
> > - and the documentation at
> >
> > https://github.com/secure-migration/resume-from-edk2-tooling#what-changes-did-we-make
> >
> > states that AmdSevPkg was created for convenience, and that the feature
> > could be integrated into OVMF. (Paraphrased.)
> >
> > So I guess it's tolerable if I make a comment: )
> >
> We've been looking forward to your perspective.
>
> > I've checked out the "mh-state-dev" branch of
> > <https://github.com/secure-migration/resume-from-efi-edk2.git>. It has
> > 80 commits on top of edk2 master (base commit: d5339c04d7cd,
> > "UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency",
> > 2020-04-23).
> >
> > These commits were authored over the 6-7 months since April. It's
> > obviously huge work. To me, most of these commits clearly aim at getting
> > the demo / proof-of-concept functional, rather than guiding (more
> > precisely: hand-holding) reviewers through the construction of the
> > feature.
> >
> > In my opinion, the series is not upstreamable in its current format
> > (which is presently not much more readable than a single-commit code
> > drop). Upstreaming is probably not your intent, either, at this time.
> >
> > I agree that getting feedback ("buy-in") at this level of maturity is
> > justified from your POV, before you invest more work into cleaning up /
> > restructuring the series.
> >
> > My problem is that "hand-holding" is exactly what I'd need -- I cannot
> > dedicate one or two weeks, as an indivisible block, to understanding
> > your design. Nor can I approach the series patch-wise in its current
> > format. Personally I would need the patch series to lead me through the
> > whole design with baby steps ("ELI5"), meaning small code changes and
> > detailed commit messages. I'd *also* need the more comprehensive
> > guide-like documentation, as background material.
> >
> > Furthermore, I don't have an environment where I can test this
> > proof-of-concept (and provide you with further incentive for cleaning up
> > the series, by reporting success).
> >
> > So I hope others can spend the time discussing the design with you, and
> > testing / repeating the demo. For me to review the patches, the patches
> > should condense and replay your thinking process from the last 7 months,
> > in as small as possible logical steps. (On the list.)
> >
> I completely understand your position. This PoC has a lot of
> new ideas in it and you're right that our main priority was not
> to hand-hold/guide reviewers through the code.
>
> One thing that is worth emphasizing is that the pieces we
> are showcasing here are not the immediate priority when it
> comes to upstreaming. Specifically, we looked into the trampoline
> to make sure it was possible to migrate CPU state via firmware.
> While we need this for SEV-ES and our goal is to support SEV-ES,
> it is not the first step. We are currently working on a PoC for
> a full end-to-end migration with SEV (non-ES), which may be a better
> place for us to begin a serious discussion about getting things
> upstream. We will focus more on making these patches accessible
> to the upstream community.
With my migration maintainer hat on, I'd like to understand a bit more
about these different approaches; they could be quite invasive, so I'd
like to make sure we're not doing one and throwing it away - it would
be great if you could explain your non-ES approach; you don't need to
have POC code to explain it.
Dave
> In the meantime, perhaps there is something we can do to help
> make our current work more clear. We could potentially explain
> things on a call or create some additional documentation. While
> our goal is not to shove this version of the trampoline upstream,
> it is significant to our plan as a whole and we want to help
> people understand it.
>
> -Tobin
>
> > I really don't want to be the bottleneck here, which is why I would
> > support introducing this feature as a separate top-level package
> > (AmdSevPkg).
> >
> > Thanks
> > Laszlo
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
next prev parent reply other threads:[~2020-11-06 16:39 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-10-28 19:31 RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept Tobin Feldman-Fitzthum
2020-10-29 17:06 ` Ashish Kalra
2020-10-29 20:36 ` tobin
2020-10-30 18:35 ` Ashish Kalra
2020-11-03 14:59 ` [edk2-devel] " Laszlo Ersek
2020-11-04 18:27 ` Tobin Feldman-Fitzthum
2020-11-06 15:45 ` Laszlo Ersek
2020-11-06 20:03 ` Tobin Feldman-Fitzthum
2020-11-06 16:38 ` Dr. David Alan Gilbert [this message]
2020-11-06 21:48 ` Tobin Feldman-Fitzthum
2020-11-06 22:17 ` Ashish Kalra
2020-11-09 20:27 ` Tobin Feldman-Fitzthum
2020-11-09 20:34 ` Kalra, Ashish
2020-11-09 19:56 ` Dr. David Alan Gilbert
2020-11-09 22:37 ` Tobin Feldman-Fitzthum
2020-11-09 23:44 ` James Bottomley
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-list from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20201106163848.GM3576@work-vm \
--to=devel@edk2.groups.io \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox