From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from NAM12-BN8-obe.outbound.protection.outlook.com (NAM12-BN8-obe.outbound.protection.outlook.com [40.107.237.62]) by mx.groups.io with SMTP id smtpd.web09.204.1603991209146600681 for ; Thu, 29 Oct 2020 10:06:49 -0700 Authentication-Results: mx.groups.io; dkim=pass header.i=@amdcloud.onmicrosoft.com header.s=selector2-amdcloud-onmicrosoft-com header.b=cDorQTJl; spf=none, err=SPF record not found (domain: amd.com, ip: 40.107.237.62, mailfrom: ashish.kalra@amd.com) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=TgYGGHzqFidizAx4dyJwHHuhHNMa3cal2r3wkP3ZHtJu2HepbxM5mutUwlha6OeiKKpVuV+o7KEMg2v5qMKlJT+Z4+/KyyuGBttNv5iJFQZ+zXWcb3NLr2nAwKpNszY8okfqwpNl1DJ3j1fmIbTJ2pb+scCcJLc1wPT3oIaZO8MaCkOfL4KSQ4TF+CHHdsTuY/jLayK+uj9YIvpdfAT2VG/7iJqyNNULSPc/aw2jCq8IS+zNvVloObrIkyXtJ5V/qT5hi0YisiJW+P9kgKtebsycm5enAjGMbcWZhfpJG7n720iMzxlIwN4lvdqON8RaZlbl/f6LtW1RS5BZsI1wfA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=gnLwigXtL0DlHh1eHBJgJfq++DmAJ8S1ewfQExjDdMA=; b=glFDKg3zbWDql/GQIOlvQIgaMy/1G4hF2lNYaIGJkCTPBsBIv0lnFmLp+NcMtbp4YOU8wXmnVTyz23wmdNaJZ4AmuPQwFf9eYYk4JeY+x+VlIg5tHWfbrvqp7gd8Cfppd7MS9UaarU/SnlMB2QkE7Lbfvgi3Vc9K4StEyPsLJBwbuICrQxUPejVdG2m74Byam+munb5vUr1QwEvNpTWNr3RXRwsudc+x8AeXOtfgRoDN8xDad59Cp0Lis/lBTB7IYjy6q/BFBRSQ4wu+aTcSU6gipAS/uZNCOXFVFtRt5yqfcw2BU+53UxkYxksWWlD66hAOFeimTznwn4Zxmv4bJg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=amd.com; dmarc=pass action=none header.from=amd.com; dkim=pass header.d=amd.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amdcloud.onmicrosoft.com; s=selector2-amdcloud-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=gnLwigXtL0DlHh1eHBJgJfq++DmAJ8S1ewfQExjDdMA=; b=cDorQTJl30pjG0/jiICLuhr4saiRshEdLlWLpq8WgAn69hPNyX+f+xZBD2KqOnPe6bXns+mI7lLt4WLODd6K7PR2/aGOPq0XqZF6/ZdpYmzM2AAPa+w6OQO9ojcdwxV/w6uzr6PGcLyatc9+TD48pXJUvgoyQyUxI0vJjmLhQz8= Authentication-Results: linux.ibm.com; dkim=none (message not signed) header.d=none;linux.ibm.com; dmarc=none action=none header.from=amd.com; Received: from SN6PR12MB2767.namprd12.prod.outlook.com (2603:10b6:805:75::23) by SN1PR12MB2367.namprd12.prod.outlook.com (2603:10b6:802:26::31) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3499.18; Thu, 29 Oct 2020 17:06:47 +0000 Received: from SN6PR12MB2767.namprd12.prod.outlook.com ([fe80::d8f2:fde4:5e1d:afec]) by SN6PR12MB2767.namprd12.prod.outlook.com ([fe80::d8f2:fde4:5e1d:afec%3]) with mapi id 15.20.3499.027; Thu, 29 Oct 2020 17:06:47 +0000 Date: Thu, 29 Oct 2020 17:06:38 +0000 From: Ashish Kalra To: Tobin Feldman-Fitzthum Cc: devel@edk2.groups.io, dovmurik@linux.vnet.ibm.com, Dov.Murik1@il.ibm.com, brijesh.singh@amd.com, tobin@ibm.com, david.kaplan@amd.com, jon.grimm@amd.com, thomas.lendacky@amd.com, jejb@linux.ibm.com, frankeh@us.ibm.com Subject: Re: RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept Message-ID: <20201029170638.GA16080@ashkalra_ubuntu_server> References: In-Reply-To: User-Agent: Mutt/1.9.4 (2018-02-28) X-Originating-IP: [165.204.77.1] X-ClientProxiedBy: DM5PR13CA0004.namprd13.prod.outlook.com (2603:10b6:3:23::14) To SN6PR12MB2767.namprd12.prod.outlook.com (2603:10b6:805:75::23) Return-Path: ashish.kalra@amd.com MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 Received: from ashkalra_ubuntu_server (165.204.77.1) by DM5PR13CA0004.namprd13.prod.outlook.com (2603:10b6:3:23::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3499.7 via Frontend Transport; Thu, 29 Oct 2020 17:06:46 +0000 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-HT: Tenant X-MS-Office365-Filtering-Correlation-Id: b1626317-ef56-42ec-dc50-08d87c2d04c3 X-MS-TrafficTypeDiagnostic: SN1PR12MB2367: X-MS-Exchange-Transport-Forked: True X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:10000; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: MrlR1MjWqitzmyd758SJ2qmfQNKciwkgR+zPiZ3IpwwrgdeuO3ecaFQVgUX0P6asjM39eZTGAL/MYbnR6t0BnruQONYWsr9QZNYBy4fWcjeuDO325GzRBKv8k0+qLchDnWL3Q9+c+fUz78ACX8Clb5eSdG6YVs9XLf4nkS31G8UpPrf5dxlm7hsSZJfNfluzkPoAMmy0ihZRhJSBMjiPl0Jbn3JFtxVoSavN4n4+dOt5MvuGfFuUaB/6E5COFrq+5kT57mHv4qnf5JEWh6Bbg3kSuFBqqviHAXXMBJQ0xYLk1HyG1TYuX6lwU6JPAeXwgqBkcIQaBlDkEX2UWOmJYWAm+EzNzgv/KupeTwNgx1TbreDgSgoNQRhGnoJSvfJvciDenKImoV3MEgdfzMIUSrVAR+pZloxk1yYAeCq5qpMis2/+OyfaukLdV4oB1QEch8OBsWw82T1MjGdbA5lKTA== X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:SN6PR12MB2767.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(4636009)(136003)(366004)(39860400002)(346002)(396003)(376002)(6916009)(16526019)(83380400001)(52116002)(8936002)(478600001)(186003)(8676002)(956004)(33656002)(6496006)(44832011)(45080400002)(316002)(6666004)(66946007)(4326008)(86362001)(66556008)(2906002)(26005)(66476007)(55016002)(5660300002)(33716001)(9686003)(30864003)(1076003)(206463001);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData: NSvF8UqaI9eH+uWtxvnO4AujuuFaH4W8IXHBaXHiNIKzShlHHLjy1SYuVvcXvRsq1BlPFn/pEhimEqIHZNX4kfHjQNen3b98hA9wg3MofiTO+LnT0ubOFq3f4EtRdYuHeahRZ/Ltp4DNZI4/lBm/Rdjs54mYAqf6FQrFlNRfSv6RFgvAebK8U0U6eya3Lx925IyOMI6WtRA5bG8ZgBCQ97esgAeNPqC3SKekViXh+MIey1y4Dc6zKqtbXE+NkOQ+d5R8heB0zLn9utl6suC2OG47D6YsZsop056FMQWrymkdT1/LrhbJmgVB/dft9Z+79z/oteMAqWJ1tSK17eDZAC8o89FdurtRf8VRNpgUwcIuJKlu4WfMpT+H1QdULzEkmoiqRbKBVRNrxof/6uHufMokMeeoEOTBDMDtJ3jR0GDkU2uSO0woTFNXRnN6LQVS/oUTxoGuCPSc8rdFELEIliU31RZkEeDJTaa9rjNWwqQjGaOvuxXoYOmxTuCS8bk1hWXejQ6E6pPxT3k8X08GD/LtP+6gwnrMxZmFprLE86UGh+AdbiXF42bwWhSKcAbUVynoCDuufGHoOEjeY1mO10bRw7eryXFBz7obvFvOpBHBs4IvbRsTSOsYbHgIfOj//k1kWkGa6oB7dTBcIFZi/A== X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-Network-Message-Id: b1626317-ef56-42ec-dc50-08d87c2d04c3 X-MS-Exchange-CrossTenant-AuthSource: SN6PR12MB2767.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 29 Oct 2020 17:06:47.1890 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: TlWNgoKXFHRMC6JDicKMqJZZmJ03G2BwCAZNQ3+KmZsSPAFG+c9H0iLoO2+g5yrE7munhXvY4pzXOpa4+igQTg== X-MS-Exchange-Transport-CrossTenantHeadersStamped: SN1PR12MB2367 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hello Tobin, On Wed, Oct 28, 2020 at 03:31:44PM -0400, Tobin Feldman-Fitzthum wrote: > Hello, > > Dov Murik. James Bottomley, Hubertus Franke, and I have been working on a > plan for fast live migration of SEV and SEV-ES (and SEV-SNP when it's out > and even hopefully Intel TDX) VMs. We have developed an approach that we > believe is feasible and a demonstration that shows our solution to the most > difficult part of the problem. In short, we have implemented a UEFI > Application that can resume from a VM snapshot. We think this is the crux of > SEV-ES live migration. After describing the context of our demo and how it > works, we explain how it can be extended to a full SEV-ES migration. Our > goal is to show that fast SEV and SEV-ES live migration can be implemented > in OVMF with minimal kernel changes. We provide a blueprint for doing so. > > Typically the hypervisor facilitates live migration. AMD SEV excludes the > hypervisor from the trust domain of the guest. When a hypervisor (HV) > examines the memory of an SEV guest, it will find only a ciphertext. If the > HV moves the memory of an SEV guest, the ciphertext will be invalidated. > Furthermore, with SEV-ES the hypervisor is largely unable to access guest > CPU state. Thus, fast migration of SEV VMs requires support from inside the > trust domain, i.e. the guest. > > One approach is to add support for SEV Migration to the Linux kernel. This > would allow the guest to encrypt/decrypt its own memory with a transport > key. This approach has met some resistance. We propose a similar approach > implemented not in Linux, but in firmware, specifically OVMF. Since OVMF > runs inside the guest, it has access to the guest memory and CPU state. OVMF > should be able to perform the manipulations required for live migration of > SEV and SEV-ES guests. > > The biggest challenge of this approach involves migrating the CPU state of > an SEV-ES guest. In a normal (non-SEV migration) the HV sets the CPU state > of the target before the target begins executing. In our approach, the HV > starts the target and OVMF must resume to whatever state the source was in. > We believe this to be the crux (or at least the most difficult part) of live > migration for SEV and we hope that by demonstrating resume from EFI, we can > show that our approach is generally feasible. > > Our demo can be found at . > The tooling repository is the best starting point. It contains documentation > about the project and the scripts needed to run the demo. There are two more > repos associated with the project. One is a modified edk2 tree that contains > our modified OVMF. The other is a modified qemu, that has a couple of > temporary changes needed for the demo. Our demonstration is aimed only at > resuming from a VM snapshot in OVMF. We provide the source CPU state and > source memory to the destination using temporary plumbing that violates the > SEV trust model. We explain the setup in more depth in README.md. We are > showing only that OVMF can resume from a VM snapshot. At the end we will > describe our plan for transferring CPU state and memory from source to > guest. To be clear, the temporary tooling used for this demo isn't built for > encrypted VMs, but below we explain how this demo applies to and can be > extended to encrypted VMs. > > We Implemented our resume code in a very similar fashion to the recommended > S3 resume code. When the HV sets the CPU state of a guest, it can do so when > the guest is not executing. Setting the state from inside the guest is a > delicate operation. There is no way to atomically set all of the CPU state > from inside the guest. Instead, we must set most registers individually and > account for changes in control flow that doing so might cause. We do this > with a three-phase trampoline. OVMF calls phase 1, which runs on the OVMF > map. Phase 1 sets up phase 2 and jumps to it. Phase 2 switches to an > intermediate map that reconciles the OVMF map and the source map. Phase 3 > switches to the source map, restores the registers, and returns into > execution of the source. We will go backwards through these phases in more > depth. > > The last thing that resume to EFI does is return. Specifically, we use > IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a > temporary stack and restores them atomically, thus returning to source > execution. Prior to returning, we must manually restore most other registers > to the values they had on the source. One particularly significant register > is CR3. When we return to Linux, CR3 must be set to the source CR3 or the > first instruction executed in Linux will cause a page fault. The code that > we use to restore the registers and return must be mapped in the source page > table or we would get a page fault executing the instructions prior to > returning into Linux. The value of CR3 is so significant, that it defines > the three phases of the trampoline. Phase 3 begins when CR3 is set to the > source CR3. After setting CR3, we set all the other registers and return. > > Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, meaning > that virtual addresses are the same as physical addresses. The kernel page > table uses an offset mapping, meaning that virtual addresses differ from > physical addresses by a constant (for the most part). Crucially, this means > that the virtual address of the page that is executed by phase 3 differs > between the OVMF map and the source map. If we are executing code mapped in > OVMF and we change CR3 to point to the source map, although the page may be > mapped in the source map, the virtual address will be different, and we will > face undefined behavior. To fix this, we construct intermediate page tables > that map the pages for phase 2 and 3 to the virtual address expected in OVMF > and to the virtual address expected in the source map. Thus, we can switch > CR3 from OVMF's map to the intermediate map and then from the intermediate > map to the source map. Phase 2 is much shorter than phase 3. Phase 2 is > mainly responsible for switching to the intermediate map, flushing the TLB, > and jumping to phase 3. > > Fortunately phase 1 is even simpler than phase 2. Phase 1 has two duties. > First, since phase 2 and 3 operate without a stack and can't access values > defined in OVMF (such as the addresses of the pages containing phase 2 and > 3), phase 1 must pass these values to phase 2 by putting them in registers. > Second, phase 1 must start phase 2 by jumping to it. > > Given that we can resume to a snapshot in OVMF, we should be able to migrate > an SEV guest as long as we can securely communicate the VM snapshot from > source to destination. For our demo, we do this with a handful of QMP > commands. More sophisticated methods are required for a production > implementation. > > When we refer to a snapshot, what we really mean is the device state, > memory, and CPU state of a guest. In live migration this is transmitted > dynamically as opposed to being saved and restored. Device state is not > protected by SEV and can be handled entirely by the HV. Memory, on the other > hand, cannot be handled only by the HV. As mentioned previously, memory > needs to be encrypted with a transport key. A Migration Handler on the > source will coordinate with the HV to encrypt pages and transmit them to the > destination. The destination HV will receive the pages over the network and > pass them to the Migration Handler in the target VM so they can be > decrypted. This transmission will occur continuously until the memory of the > source and target converges. > > Plain SEV does not protect the CPU state of the guest and therefore does not > require any special mechanism for transmission of the CPU state. We plan to > implement an end-to-end migration with plain SEV first. In SEV-ES, the PSP > (platform security processor) encrypts CPU state on each VMExit. The > encrypted state is stored in memory. Normally this memory (known as the > VMSA) is not mapped into the guest, but we can add an entry to the nested > page tables that will expose the VMSA to the guest. I have a question here, is there any kind of integrity protection on the CPU state when the target VM is resumed after nigration, for example, if there is a malicious hypervisor which maps a page with subverted CPU state on the nested page tables, what prevents the target VM to resume execution on a subverted or compromised CPU state ? Thanks, Ashish > This means that when the > guest VMExits, the CPU state will be saved to guest memory. With the CPU > state in guest memory, it can be transmitted to the target using the method > described above. > > In addition to the changes needed in OVMF to resume the VM, the transmission > of the VM from source to target will require a new code path in the > hypervisor. There will also need to be a few minor changes to Linux (adding > a mapping for our Phase 3 pages). Despite all the moving pieces, we believe > that this is a feasible approach for supporting live migration for SEV and > SEV-ES. > > For the sake of brevity, we have left out a few issues, including SMP > support, generation of the intermediate mappings, and more. We have included > some notes about these issues in the COMPLICATIONS.md file. We also have an > outline of an end-to-end implementation of live migration for SEV-ES in > END-TO-END.md. See README.md for info on how to run the demo. While this is > not a full migration, we hope to show that fast live migration with SEV and > SEV-ES is possible without major kernel changes. > > -Tobin >