From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.158.5]) by mx.groups.io with SMTP id smtpd.web08.3442.1604003775219600580 for ; Thu, 29 Oct 2020 13:36:15 -0700 Authentication-Results: mx.groups.io; dkim=pass header.i=@ibm.com header.s=pp1 header.b=GzhqOyxT; spf=pass (domain: linux.ibm.com, ip: 148.163.158.5, mailfrom: tobin@linux.ibm.com) Received: from pps.filterd (m0098419.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 09TKWZun187782; Thu, 29 Oct 2020 16:36:12 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=mime-version : date : from : to : cc : subject : in-reply-to : references : message-id : content-type : content-transfer-encoding; s=pp1; bh=gsZZW9wRVMIE8/8I37ONei5oPnLQV4jRC64swdPCybE=; b=GzhqOyxTvNJxh4whw2kfCWBd8j+uHGZIhZIWVkucKU4tpxRnCM/cXHF19K+sggMx6azy SIlOckFtAZgNBgw49sN+ZniuD/nxisTLAEpfOaoc9o5kaSQEoD8GMXyUgJIYMm2heI8K GOHm0q98j8CIqJqBBja7NkNiPueOL/YtGTqk4Ukj5WhDbHGvWEEPCG/BG6wMZ6hfH8wK V4qchiBfAo0tlngF8HNjk49Neu5K7SkYZu8x3LZ0ezTiMtsWBkQmNnIKHoC+it0UvPQo injCMKLbzwx/RuZQp4/37nv7oHLubb6VTqV/ZT7RgtTcSUQ/LfbfpJrESzdFterwv9Pf vQ== Received: from ppma05wdc.us.ibm.com (1b.90.2fa9.ip4.static.sl-reverse.com [169.47.144.27]) by mx0b-001b2d01.pphosted.com with ESMTP id 34g15gr1a0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 29 Oct 2020 16:36:12 -0400 Received: from pps.filterd (ppma05wdc.us.ibm.com [127.0.0.1]) by ppma05wdc.us.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 09TKSBWB023748; Thu, 29 Oct 2020 20:36:12 GMT Received: from b03cxnp08028.gho.boulder.ibm.com (b03cxnp08028.gho.boulder.ibm.com [9.17.130.20]) by ppma05wdc.us.ibm.com with ESMTP id 34fy75jg9d-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 29 Oct 2020 20:36:12 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp08028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 09TKa8EB55706034 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 29 Oct 2020 20:36:08 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 7F3D27805F; Thu, 29 Oct 2020 20:36:08 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 239A87805E; Thu, 29 Oct 2020 20:36:08 +0000 (GMT) Received: from ltc.linux.ibm.com (unknown [9.16.170.189]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Thu, 29 Oct 2020 20:36:07 +0000 (GMT) MIME-Version: 1.0 Date: Thu, 29 Oct 2020 16:36:07 -0400 From: tobin@linux.ibm.com To: Ashish Kalra Cc: devel@edk2.groups.io, dovmurik@linux.vnet.ibm.com, Dov.Murik1@il.ibm.com, brijesh.singh@amd.com, tobin@ibm.com, david.kaplan@amd.com, jon.grimm@amd.com, thomas.lendacky@amd.com, jejb@linux.ibm.com, frankeh@us.ibm.com Subject: Re: RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept In-Reply-To: <20201029170638.GA16080@ashkalra_ubuntu_server> References: <20201029170638.GA16080@ashkalra_ubuntu_server> Message-ID: <5dc185214309c0cf309e5244dea37c11@linux.vnet.ibm.com> X-Sender: tobin@linux.ibm.com User-Agent: Roundcube Webmail/1.0.1 X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.312,18.0.737 definitions=2020-10-29_12:2020-10-29,2020-10-29 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 mlxlogscore=999 priorityscore=1501 suspectscore=0 bulkscore=0 adultscore=0 mlxscore=0 spamscore=0 lowpriorityscore=0 clxscore=1015 malwarescore=0 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2010290138 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit On 2020-10-29 13:06, Ashish Kalra wrote: > Hello Tobin, > > On Wed, Oct 28, 2020 at 03:31:44PM -0400, Tobin Feldman-Fitzthum wrote: >> Hello, >> >> Dov Murik. James Bottomley, Hubertus Franke, and I have been working >> on a >> plan for fast live migration of SEV and SEV-ES (and SEV-SNP when it's >> out >> and even hopefully Intel TDX) VMs. We have developed an approach that >> we >> believe is feasible and a demonstration that shows our solution to the >> most >> difficult part of the problem. In short, we have implemented a UEFI >> Application that can resume from a VM snapshot. We think this is the >> crux of >> SEV-ES live migration. After describing the context of our demo and >> how it >> works, we explain how it can be extended to a full SEV-ES migration. >> Our >> goal is to show that fast SEV and SEV-ES live migration can be >> implemented >> in OVMF with minimal kernel changes. We provide a blueprint for doing >> so. >> >> Typically the hypervisor facilitates live migration. AMD SEV excludes >> the >> hypervisor from the trust domain of the guest. When a hypervisor (HV) >> examines the memory of an SEV guest, it will find only a ciphertext. >> If the >> HV moves the memory of an SEV guest, the ciphertext will be >> invalidated. >> Furthermore, with SEV-ES the hypervisor is largely unable to access >> guest >> CPU state. Thus, fast migration of SEV VMs requires support from >> inside the >> trust domain, i.e. the guest. >> >> One approach is to add support for SEV Migration to the Linux kernel. >> This >> would allow the guest to encrypt/decrypt its own memory with a >> transport >> key. This approach has met some resistance. We propose a similar >> approach >> implemented not in Linux, but in firmware, specifically OVMF. Since >> OVMF >> runs inside the guest, it has access to the guest memory and CPU >> state. OVMF >> should be able to perform the manipulations required for live >> migration of >> SEV and SEV-ES guests. >> >> The biggest challenge of this approach involves migrating the CPU >> state of >> an SEV-ES guest. In a normal (non-SEV migration) the HV sets the CPU >> state >> of the target before the target begins executing. In our approach, the >> HV >> starts the target and OVMF must resume to whatever state the source >> was in. >> We believe this to be the crux (or at least the most difficult part) >> of live >> migration for SEV and we hope that by demonstrating resume from EFI, >> we can >> show that our approach is generally feasible. >> >> Our demo can be found at >> . >> The tooling repository is the best starting point. It contains >> documentation >> about the project and the scripts needed to run the demo. There are >> two more >> repos associated with the project. One is a modified edk2 tree that >> contains >> our modified OVMF. The other is a modified qemu, that has a couple of >> temporary changes needed for the demo. Our demonstration is aimed only >> at >> resuming from a VM snapshot in OVMF. We provide the source CPU state >> and >> source memory to the destination using temporary plumbing that >> violates the >> SEV trust model. We explain the setup in more depth in README.md. We >> are >> showing only that OVMF can resume from a VM snapshot. At the end we >> will >> describe our plan for transferring CPU state and memory from source to >> guest. To be clear, the temporary tooling used for this demo isn't >> built for >> encrypted VMs, but below we explain how this demo applies to and can >> be >> extended to encrypted VMs. >> >> We Implemented our resume code in a very similar fashion to the >> recommended >> S3 resume code. When the HV sets the CPU state of a guest, it can do >> so when >> the guest is not executing. Setting the state from inside the guest is >> a >> delicate operation. There is no way to atomically set all of the CPU >> state >> from inside the guest. Instead, we must set most registers >> individually and >> account for changes in control flow that doing so might cause. We do >> this >> with a three-phase trampoline. OVMF calls phase 1, which runs on the >> OVMF >> map. Phase 1 sets up phase 2 and jumps to it. Phase 2 switches to an >> intermediate map that reconciles the OVMF map and the source map. >> Phase 3 >> switches to the source map, restores the registers, and returns into >> execution of the source. We will go backwards through these phases in >> more >> depth. >> >> The last thing that resume to EFI does is return. Specifically, we use >> IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a >> temporary stack and restores them atomically, thus returning to source >> execution. Prior to returning, we must manually restore most other >> registers >> to the values they had on the source. One particularly significant >> register >> is CR3. When we return to Linux, CR3 must be set to the source CR3 or >> the >> first instruction executed in Linux will cause a page fault. The code >> that >> we use to restore the registers and return must be mapped in the >> source page >> table or we would get a page fault executing the instructions prior to >> returning into Linux. The value of CR3 is so significant, that it >> defines >> the three phases of the trampoline. Phase 3 begins when CR3 is set to >> the >> source CR3. After setting CR3, we set all the other registers and >> return. >> >> Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, >> meaning >> that virtual addresses are the same as physical addresses. The kernel >> page >> table uses an offset mapping, meaning that virtual addresses differ >> from >> physical addresses by a constant (for the most part). Crucially, this >> means >> that the virtual address of the page that is executed by phase 3 >> differs >> between the OVMF map and the source map. If we are executing code >> mapped in >> OVMF and we change CR3 to point to the source map, although the page >> may be >> mapped in the source map, the virtual address will be different, and >> we will >> face undefined behavior. To fix this, we construct intermediate page >> tables >> that map the pages for phase 2 and 3 to the virtual address expected >> in OVMF >> and to the virtual address expected in the source map. Thus, we can >> switch >> CR3 from OVMF's map to the intermediate map and then from the >> intermediate >> map to the source map. Phase 2 is much shorter than phase 3. Phase 2 >> is >> mainly responsible for switching to the intermediate map, flushing the >> TLB, >> and jumping to phase 3. >> >> Fortunately phase 1 is even simpler than phase 2. Phase 1 has two >> duties. >> First, since phase 2 and 3 operate without a stack and can't access >> values >> defined in OVMF (such as the addresses of the pages containing phase 2 >> and >> 3), phase 1 must pass these values to phase 2 by putting them in >> registers. >> Second, phase 1 must start phase 2 by jumping to it. >> >> Given that we can resume to a snapshot in OVMF, we should be able to >> migrate >> an SEV guest as long as we can securely communicate the VM snapshot >> from >> source to destination. For our demo, we do this with a handful of QMP >> commands. More sophisticated methods are required for a production >> implementation. >> >> When we refer to a snapshot, what we really mean is the device state, >> memory, and CPU state of a guest. In live migration this is >> transmitted >> dynamically as opposed to being saved and restored. Device state is >> not >> protected by SEV and can be handled entirely by the HV. Memory, on the >> other >> hand, cannot be handled only by the HV. As mentioned previously, >> memory >> needs to be encrypted with a transport key. A Migration Handler on the >> source will coordinate with the HV to encrypt pages and transmit them >> to the >> destination. The destination HV will receive the pages over the >> network and >> pass them to the Migration Handler in the target VM so they can be >> decrypted. This transmission will occur continuously until the memory >> of the >> source and target converges. >> >> Plain SEV does not protect the CPU state of the guest and therefore >> does not >> require any special mechanism for transmission of the CPU state. We >> plan to >> implement an end-to-end migration with plain SEV first. In SEV-ES, the >> PSP >> (platform security processor) encrypts CPU state on each VMExit. The >> encrypted state is stored in memory. Normally this memory (known as >> the >> VMSA) is not mapped into the guest, but we can add an entry to the >> nested >> page tables that will expose the VMSA to the guest. > > I have a question here, is there any kind of integrity protection on > the > CPU state when the target VM is resumed after nigration, for example, > if > there is a malicious hypervisor which maps a page with subverted CPU > state on the nested page tables, what prevents the target VM to resume > execution on a subverted or compromised CPU state ? Good question. Here is my thinking. The VMSA is mapped in the guest memory. It will be transmitted to the target like any other page, with encryption and integrity-checking. So we have integrity checking for CPU state while it is in flight. I think you are wondering something slightly different, though. Once the page with the VMSA arrives at the target and is decrypted and put in place, the hypervisor could potentially change the NPT to replace the data. Since the page with the VMSA will be encrypted (and the Migration Handler will expect this), the HV can't replace the page with arbitrary values. Since the VMSA is in memory, we have the protections that SEV provides for memory. Prior to SNP, this does not include integrity protection. The HV could attempt a replay attack by replacing the page with the VMSA with an older version of the same page. That said, the target will have just booted so there isn't much to replay. If we really need to, we could add functionality to the Migration Handler that would allow the HV to ask for an HMAC of the VMSA on the source. The Migration Handler on the target could use this to verify the VMSA just prior to starting the trampoline. Given the above, I am not sure this is necessary. Hopefully I've understood the attack you're suggesting correctly. -Tobin > > Thanks, > Ashish > >> This means that when the >> guest VMExits, the CPU state will be saved to guest memory. With the >> CPU >> state in guest memory, it can be transmitted to the target using the >> method >> described above. >> >> In addition to the changes needed in OVMF to resume the VM, the >> transmission >> of the VM from source to target will require a new code path in the >> hypervisor. There will also need to be a few minor changes to Linux >> (adding >> a mapping for our Phase 3 pages). Despite all the moving pieces, we >> believe >> that this is a feasible approach for supporting live migration for SEV >> and >> SEV-ES. >> >> For the sake of brevity, we have left out a few issues, including SMP >> support, generation of the intermediate mappings, and more. We have >> included >> some notes about these issues in the COMPLICATIONS.md file. We also >> have an >> outline of an end-to-end implementation of live migration for SEV-ES >> in >> END-TO-END.md. See README.md for info on how to run the demo. While >> this is >> not a full migration, we hope to show that fast live migration with >> SEV and >> SEV-ES is possible without major kernel changes. >> >> -Tobin >>