From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from NAM10-MW2-obe.outbound.protection.outlook.com (NAM10-MW2-obe.outbound.protection.outlook.com [40.107.94.48]) by mx.groups.io with SMTP id smtpd.web09.18562.1604082934091999004 for ; Fri, 30 Oct 2020 11:35:34 -0700 Authentication-Results: mx.groups.io; dkim=pass header.i=@amdcloud.onmicrosoft.com header.s=selector2-amdcloud-onmicrosoft-com header.b=XEtJ3LM/; spf=none, err=SPF record not found (domain: amd.com, ip: 40.107.94.48, mailfrom: ashish.kalra@amd.com) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=M+sMyAaAncO6MAMPfQgPP0fYsBXIBkyp20Sz8WqNxLtbvUa3bY22jRiD1xBRGKK3IwKPcNeQn8NnLQOs6TW5/XSid/ggv+h0pbMCb1G78k3NNEdA8Fb5ZZeYDNAa/NywORfux3or+RWxIfcBZvUNY/fsnAS1mFgsAqhFhOA5X3Nb690X8EBhKhE0211sY9pks72L1JllfW/xNrC1CgTEyETFRD1VWcRsHXfXFUQXeA8A4nhj7PBH8bT3169gruL+GqErQV6Fux5+fw9VElNQUO/Fb7flzdHzgUT4w8WQ9RKzdpKTO8lzFco3n6BkRnz86nnt8RZ5bY2nsxj3Tg/hYw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=/uOD2YDoQDe9N/fa0mV+oVT5zW3an9sFDhKf4d2w0lA=; b=PF29xVYudpeUQ8QOq3t6wIp74sxl0kelUHcZ+GVuyXXNWUzGeexPgIEAlJ6yUv/M5yV392XaBnaeb22wtEKnr9eEBlhm8xvZa1CSxMVWwBRR83WAGrGG+Ew4BZkiCdXBxbRg/ynBcL0wOCKJf2ytLQ1DZ3JxXEdZBI4jW8HKYUAzKXw3x5JY31Wta3+1eS8P1hPl73R2rCSHP0dAgxNe9mTlfK7rldvf6r4034tD1X/UWbENBdffghg3WbiEdpn7d0+eFu/8OjAJUunFijclLYWY1aCvVnk1Jdft+NKgDeD/rSsmedHJjxEOj3eQ6JHTsvVF3Ok2WKTw0y5pVa2zEw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=amd.com; dmarc=pass action=none header.from=amd.com; dkim=pass header.d=amd.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amdcloud.onmicrosoft.com; s=selector2-amdcloud-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=/uOD2YDoQDe9N/fa0mV+oVT5zW3an9sFDhKf4d2w0lA=; b=XEtJ3LM/JPUp3m6dQlD4+B2jRiFAA4dlBBDEmSNxPrLUGPk0PNRNQypFAvUhODYK7k27IPB2/krQWCh0diGxIPwUxHBaBYb8a4OGNh038gZdnzBf8PuhmmJ/PKsi5j/zShcOoUC/zQCZyyLDVz01e3pyYNcCBZUkWtoL7vqoRrQ= Authentication-Results: linux.ibm.com; dkim=none (message not signed) header.d=none;linux.ibm.com; dmarc=none action=none header.from=amd.com; Received: from SN6PR12MB2767.namprd12.prod.outlook.com (2603:10b6:805:75::23) by SN1PR12MB2512.namprd12.prod.outlook.com (2603:10b6:802:31::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3499.18; Fri, 30 Oct 2020 18:35:32 +0000 Received: from SN6PR12MB2767.namprd12.prod.outlook.com ([fe80::d8f2:fde4:5e1d:afec]) by SN6PR12MB2767.namprd12.prod.outlook.com ([fe80::d8f2:fde4:5e1d:afec%3]) with mapi id 15.20.3499.027; Fri, 30 Oct 2020 18:35:32 +0000 Date: Fri, 30 Oct 2020 18:35:25 +0000 From: Ashish Kalra To: Tobin Feldman-Fitzthum Cc: devel@edk2.groups.io, dovmurik@linux.vnet.ibm.com, Dov.Murik1@il.ibm.com, brijesh.singh@amd.com, tobin@ibm.com, david.kaplan@amd.com, jon.grimm@amd.com, thomas.lendacky@amd.com, jejb@linux.ibm.com, frankeh@us.ibm.com Subject: Re: RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept Message-ID: <20201030183525.GA17491@ashkalra_ubuntu_server> References: <20201029170638.GA16080@ashkalra_ubuntu_server> <5dc185214309c0cf309e5244dea37c11@linux.vnet.ibm.com> In-Reply-To: <5dc185214309c0cf309e5244dea37c11@linux.vnet.ibm.com> User-Agent: Mutt/1.9.4 (2018-02-28) X-Originating-IP: [165.204.77.1] X-ClientProxiedBy: SA9PR13CA0065.namprd13.prod.outlook.com (2603:10b6:806:23::10) To SN6PR12MB2767.namprd12.prod.outlook.com (2603:10b6:805:75::23) Return-Path: ashish.kalra@amd.com MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 Received: from ashkalra_ubuntu_server (165.204.77.1) by SA9PR13CA0065.namprd13.prod.outlook.com (2603:10b6:806:23::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3499.18 via Frontend Transport; Fri, 30 Oct 2020 18:35:31 +0000 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-HT: Tenant X-MS-Office365-Filtering-Correlation-Id: 6ad57445-e718-4144-e146-08d87d029517 X-MS-TrafficTypeDiagnostic: SN1PR12MB2512: X-MS-Exchange-Transport-Forked: True X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:10000; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: XUPKSqDD2vVcyBIGj0TaR4vHIJ//NlVq0HFxYC+h2YMEcSOjKmMRadFWDsnoJpJ+ylu6wBukYiM0nPSjCI7JE4qbLeRF/iZKgTJAJL7fQTbMKxqOkJL4cCu0WCgTbJCT0aYkuiAj23pJinXdIfrCPKXmKDSY/1i0gFc0fDRUSYYk/KQ0syrb/IIMgw/WLeXX+WJyD6UWn7gSEV8pqcNAd4mHkhuG5JqPuFWXtdTPYUF3LWzAjZ4zuZjKQARUz1LdgdD3+jinK3IrUzA8Jp3SaeN/XWxa1gqbXB6g/e+TDUrck5z+soJWK6mnzE45/ZprV64/+C0oKVTOWOO9ecPveBIJsHrfBux+rPnwrFDGX7AwNmCseVWxVuu05Lf/GO0cDEQN+xV1Gm1eLBaDh8egL4Yt7bsYl+gEWUEDhrSWOgLcUiePJKmv9xZYPKbV6Vp1Lr8PnYulhV+lK44R2NpXnw== X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:SN6PR12MB2767.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(4636009)(396003)(376002)(346002)(136003)(39860400002)(366004)(186003)(26005)(8936002)(83380400001)(6666004)(2906002)(30864003)(45080400002)(5660300002)(8676002)(66556008)(55016002)(956004)(66946007)(316002)(4326008)(66476007)(52116002)(1076003)(16526019)(33656002)(6916009)(4001150100001)(53546011)(33716001)(44832011)(86362001)(478600001)(9686003)(6496006)(206463001);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData: YD+hSUY1LVTPQeQqZWwthenL9H4sqZCNBmfEa/iy0ee5JXRjAtfmkmRajJeogazP5mlDGIlkHGcmwJrIQHn5wXnl40bKeREhtP/y/6jA6LDClg/isiFsLavipmudqwJahhs7exaypq8j2++nZMtL/UF3WIKcr9x+qiNxQeNXTCemoh5u68/nm/rvIJfelEyoPd/85hLS6cKTfEdbaboe0SDKy1lTyAu6kaYzOBdeIF+r1N7IJuDyntoSDoJCmgghLL1vdIyumwEjXOdrtPiWI6+i6gMg9R/cFepn0A5xcpQOGRihib+2nZ1jNNxE5Njx8G9F/gJRGgxT+jVQWS2iT93Ty33JVpvIkEtSNJRoXW6WeEisO2m1Jnc8KnC1TpIdpOjbkR3IFWnWOT1WpROqI3cUsBiz80u3Yh/QWFKYbcBnaP6BFiJRloOTwMJmZPUza2sTDnW06OuyP4kdQbA7v5RgHtv80Tb0UKEQ/1VGaNCNSjgB09ohagjAk/TOPIlb3/q9OnzXK80ZJ5BmbfLIjxDTX0QY7FrI6lku1Y2lsdhNmplqaxdIVS5qIF0gRZOAMntRnQFKsTTPWssedw829Bd3LHEC9I9ykLG927C7o4IjnHmMOmvy7Dtrij28w/JuGu7aQfc+85VpmpRF6E0uUg== X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-Network-Message-Id: 6ad57445-e718-4144-e146-08d87d029517 X-MS-Exchange-CrossTenant-AuthSource: SN6PR12MB2767.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 30 Oct 2020 18:35:32.3068 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: pPFdEPQBcmoi6KF+0tLePEqnZgQGyrVDEKiEbtAR4c5BhKBntYJk8NbPrhCbGaCtyOzam7bQ6BsQk1Zhqkp6WA== X-MS-Exchange-Transport-CrossTenantHeadersStamped: SN1PR12MB2512 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hello Tobin, On Thu, Oct 29, 2020 at 04:36:07PM -0400, Tobin Feldman-Fitzthum wrote: > On 2020-10-29 13:06, Ashish Kalra wrote: > > Hello Tobin, > > > > On Wed, Oct 28, 2020 at 03:31:44PM -0400, Tobin Feldman-Fitzthum wrote: > > > Hello, > > > > > > Dov Murik. James Bottomley, Hubertus Franke, and I have been working > > > on a > > > plan for fast live migration of SEV and SEV-ES (and SEV-SNP when > > > it's out > > > and even hopefully Intel TDX) VMs. We have developed an approach > > > that we > > > believe is feasible and a demonstration that shows our solution to > > > the most > > > difficult part of the problem. In short, we have implemented a UEFI > > > Application that can resume from a VM snapshot. We think this is the > > > crux of > > > SEV-ES live migration. After describing the context of our demo and > > > how it > > > works, we explain how it can be extended to a full SEV-ES migration. > > > Our > > > goal is to show that fast SEV and SEV-ES live migration can be > > > implemented > > > in OVMF with minimal kernel changes. We provide a blueprint for > > > doing so. > > > > > > Typically the hypervisor facilitates live migration. AMD SEV > > > excludes the > > > hypervisor from the trust domain of the guest. When a hypervisor (HV) > > > examines the memory of an SEV guest, it will find only a ciphertext. > > > If the > > > HV moves the memory of an SEV guest, the ciphertext will be > > > invalidated. > > > Furthermore, with SEV-ES the hypervisor is largely unable to access > > > guest > > > CPU state. Thus, fast migration of SEV VMs requires support from > > > inside the > > > trust domain, i.e. the guest. > > > > > > One approach is to add support for SEV Migration to the Linux > > > kernel. This > > > would allow the guest to encrypt/decrypt its own memory with a > > > transport > > > key. This approach has met some resistance. We propose a similar > > > approach > > > implemented not in Linux, but in firmware, specifically OVMF. Since > > > OVMF > > > runs inside the guest, it has access to the guest memory and CPU > > > state. OVMF > > > should be able to perform the manipulations required for live > > > migration of > > > SEV and SEV-ES guests. > > > > > > The biggest challenge of this approach involves migrating the CPU > > > state of > > > an SEV-ES guest. In a normal (non-SEV migration) the HV sets the CPU > > > state > > > of the target before the target begins executing. In our approach, > > > the HV > > > starts the target and OVMF must resume to whatever state the source > > > was in. > > > We believe this to be the crux (or at least the most difficult part) > > > of live > > > migration for SEV and we hope that by demonstrating resume from EFI, > > > we can > > > show that our approach is generally feasible. > > > > > > Our demo can be found at . > > > The tooling repository is the best starting point. It contains > > > documentation > > > about the project and the scripts needed to run the demo. There are > > > two more > > > repos associated with the project. One is a modified edk2 tree that > > > contains > > > our modified OVMF. The other is a modified qemu, that has a couple of > > > temporary changes needed for the demo. Our demonstration is aimed > > > only at > > > resuming from a VM snapshot in OVMF. We provide the source CPU state > > > and > > > source memory to the destination using temporary plumbing that > > > violates the > > > SEV trust model. We explain the setup in more depth in README.md. We > > > are > > > showing only that OVMF can resume from a VM snapshot. At the end we > > > will > > > describe our plan for transferring CPU state and memory from source to > > > guest. To be clear, the temporary tooling used for this demo isn't > > > built for > > > encrypted VMs, but below we explain how this demo applies to and can > > > be > > > extended to encrypted VMs. > > > > > > We Implemented our resume code in a very similar fashion to the > > > recommended > > > S3 resume code. When the HV sets the CPU state of a guest, it can do > > > so when > > > the guest is not executing. Setting the state from inside the guest > > > is a > > > delicate operation. There is no way to atomically set all of the CPU > > > state > > > from inside the guest. Instead, we must set most registers > > > individually and > > > account for changes in control flow that doing so might cause. We do > > > this > > > with a three-phase trampoline. OVMF calls phase 1, which runs on the > > > OVMF > > > map. Phase 1 sets up phase 2 and jumps to it. Phase 2 switches to an > > > intermediate map that reconciles the OVMF map and the source map. > > > Phase 3 > > > switches to the source map, restores the registers, and returns into > > > execution of the source. We will go backwards through these phases > > > in more > > > depth. > > > > > > The last thing that resume to EFI does is return. Specifically, we use > > > IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a > > > temporary stack and restores them atomically, thus returning to source > > > execution. Prior to returning, we must manually restore most other > > > registers > > > to the values they had on the source. One particularly significant > > > register > > > is CR3. When we return to Linux, CR3 must be set to the source CR3 > > > or the > > > first instruction executed in Linux will cause a page fault. The > > > code that > > > we use to restore the registers and return must be mapped in the > > > source page > > > table or we would get a page fault executing the instructions prior to > > > returning into Linux. The value of CR3 is so significant, that it > > > defines > > > the three phases of the trampoline. Phase 3 begins when CR3 is set > > > to the > > > source CR3. After setting CR3, we set all the other registers and > > > return. > > > > > > Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, > > > meaning > > > that virtual addresses are the same as physical addresses. The > > > kernel page > > > table uses an offset mapping, meaning that virtual addresses differ > > > from > > > physical addresses by a constant (for the most part). Crucially, > > > this means > > > that the virtual address of the page that is executed by phase 3 > > > differs > > > between the OVMF map and the source map. If we are executing code > > > mapped in > > > OVMF and we change CR3 to point to the source map, although the page > > > may be > > > mapped in the source map, the virtual address will be different, and > > > we will > > > face undefined behavior. To fix this, we construct intermediate page > > > tables > > > that map the pages for phase 2 and 3 to the virtual address expected > > > in OVMF > > > and to the virtual address expected in the source map. Thus, we can > > > switch > > > CR3 from OVMF's map to the intermediate map and then from the > > > intermediate > > > map to the source map. Phase 2 is much shorter than phase 3. Phase 2 > > > is > > > mainly responsible for switching to the intermediate map, flushing > > > the TLB, > > > and jumping to phase 3. > > > > > > Fortunately phase 1 is even simpler than phase 2. Phase 1 has two > > > duties. > > > First, since phase 2 and 3 operate without a stack and can't access > > > values > > > defined in OVMF (such as the addresses of the pages containing phase > > > 2 and > > > 3), phase 1 must pass these values to phase 2 by putting them in > > > registers. > > > Second, phase 1 must start phase 2 by jumping to it. > > > > > > Given that we can resume to a snapshot in OVMF, we should be able to > > > migrate > > > an SEV guest as long as we can securely communicate the VM snapshot > > > from > > > source to destination. For our demo, we do this with a handful of QMP > > > commands. More sophisticated methods are required for a production > > > implementation. > > > > > > When we refer to a snapshot, what we really mean is the device state, > > > memory, and CPU state of a guest. In live migration this is > > > transmitted > > > dynamically as opposed to being saved and restored. Device state is > > > not > > > protected by SEV and can be handled entirely by the HV. Memory, on > > > the other > > > hand, cannot be handled only by the HV. As mentioned previously, > > > memory > > > needs to be encrypted with a transport key. A Migration Handler on the > > > source will coordinate with the HV to encrypt pages and transmit > > > them to the > > > destination. The destination HV will receive the pages over the > > > network and > > > pass them to the Migration Handler in the target VM so they can be > > > decrypted. This transmission will occur continuously until the > > > memory of the > > > source and target converges. > > > > > > Plain SEV does not protect the CPU state of the guest and therefore > > > does not > > > require any special mechanism for transmission of the CPU state. We > > > plan to > > > implement an end-to-end migration with plain SEV first. In SEV-ES, > > > the PSP > > > (platform security processor) encrypts CPU state on each VMExit. The > > > encrypted state is stored in memory. Normally this memory (known as > > > the > > > VMSA) is not mapped into the guest, but we can add an entry to the > > > nested > > > page tables that will expose the VMSA to the guest. > > > > I have a question here, is there any kind of integrity protection on the > > CPU state when the target VM is resumed after nigration, for example, if > > there is a malicious hypervisor which maps a page with subverted CPU > > state on the nested page tables, what prevents the target VM to resume > > execution on a subverted or compromised CPU state ? > > Good question. Here is my thinking. The VMSA is mapped in the guest memory. > It will be transmitted to the target like any other page, with encryption > and integrity-checking. So we have integrity checking for CPU state while > it is in flight. > > I think you are wondering something slightly different, though. Once the > page with the VMSA arrives at the target and is decrypted and put in place, > the hypervisor could potentially change the NPT to replace the data. Since > the page with the VMSA will be encrypted (and the Migration Handler will > expect this), the HV can't replace the page with arbitrary values. > > Since the VMSA is in memory, we have the protections that SEV provides > for memory. Prior to SNP, this does not include integrity protection. > The HV could attempt a replay attack by replacing the page with the > VMSA with an older version of the same page. That said, the target will > have just booted so there isn't much to replay. > > If we really need to, we could add functionality to the Migration Handler > that would allow the HV to ask for an HMAC of the VMSA on the source. > The Migration Handler on the target could use this to verify the VMSA > just prior to starting the trampoline. Given the above, I am not sure > this is necessary. Hopefully I've understood the attack you're suggesting > correctly. > Yes this is the attack i am suggesting about a compromised or malicious hypervisor replacing the page containing the CPU state with compromised data in the NPT when the target VM starts. Thanks, Ashish > > > This means that when the > > > guest VMExits, the CPU state will be saved to guest memory. With the > > > CPU > > > state in guest memory, it can be transmitted to the target using the > > > method > > > described above. > > > > > > In addition to the changes needed in OVMF to resume the VM, the > > > transmission > > > of the VM from source to target will require a new code path in the > > > hypervisor. There will also need to be a few minor changes to Linux > > > (adding > > > a mapping for our Phase 3 pages). Despite all the moving pieces, we > > > believe > > > that this is a feasible approach for supporting live migration for > > > SEV and > > > SEV-ES. > > > > > > For the sake of brevity, we have left out a few issues, including SMP > > > support, generation of the intermediate mappings, and more. We have > > > included > > > some notes about these issues in the COMPLICATIONS.md file. We also > > > have an > > > outline of an end-to-end implementation of live migration for SEV-ES > > > in > > > END-TO-END.md. See README.md for info on how to run the demo. While > > > this is > > > not a full migration, we hope to show that fast live migration with > > > SEV and > > > SEV-ES is possible without major kernel changes. > > > > > > -Tobin > > >