From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from ma1-aaemail-dr-lapp01.apple.com (ma1-aaemail-dr-lapp01.apple.com [17.171.2.60]) by mx.groups.io with SMTP id smtpd.web10.9441.1582862646001007139 for ; Thu, 27 Feb 2020 20:04:06 -0800 Authentication-Results: mx.groups.io; dkim=pass header.i=@apple.com header.s=20180706 header.b=oCnFE4tN; spf=pass (domain: apple.com, ip: 17.171.2.60, mailfrom: afish@apple.com) Received: from pps.filterd (ma1-aaemail-dr-lapp01.apple.com [127.0.0.1]) by ma1-aaemail-dr-lapp01.apple.com (8.16.0.27/8.16.0.27) with SMTP id 01S42FVf035279; Thu, 27 Feb 2020 20:04:04 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=apple.com; h=sender : from : message-id : content-type : mime-version : subject : date : in-reply-to : cc : to : references; s=20180706; bh=sogiF3MOgiVYTzCyPQlgB56fCHd1LMGlZUMDFcQdbi4=; b=oCnFE4tNKlrUsoveFRL6Os8+3UmJ8+JszSyxAC/NTiV0AFly0pOwhbbzD5wXo6gmCD50 93XqVxFrCYpW3egcLoL69IqXgeUJxcKbv6cahZZRDRf6t70Ydqk80MQSJQU8/imTmHMb ikkOdZwXIlEGcQII2urDfj4R3lZKpCPMgqr4eRDvYXbqpg1Vlq/nMg4Nx8lypRdRQJ3b amsFYIE6LobuyUTID86rFLAH670sixl9B55wna12sQfjAqHXabI7SzyFzOUmNGULkZ1N 8kfDE5Hmt9x683i/o/rIygD+/nOCFWAQxdD+L92+96/yiyl9Nod9ApAfUgdEMT9xAQ94 yg== Received: from rn-mailsvcp-mta-lapp02.rno.apple.com (rn-mailsvcp-mta-lapp02.rno.apple.com [10.225.203.150]) by ma1-aaemail-dr-lapp01.apple.com with ESMTP id 2yepth55hn-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO); Thu, 27 Feb 2020 20:04:04 -0800 Received: from rn-mailsvcp-mmp-lapp02.rno.apple.com (rn-mailsvcp-mmp-lapp02.rno.apple.com [17.179.253.15]) by rn-mailsvcp-mta-lapp02.rno.apple.com (Oracle Communications Messaging Server 8.1.0.1.20190704 64bit (built Jul 4 2019)) with ESMTPS id <0Q6E00BUU8MR5V30@rn-mailsvcp-mta-lapp02.rno.apple.com>; Thu, 27 Feb 2020 20:04:03 -0800 (PST) Received: from process_milters-daemon.rn-mailsvcp-mmp-lapp02.rno.apple.com by rn-mailsvcp-mmp-lapp02.rno.apple.com (Oracle Communications Messaging Server 8.1.0.1.20190704 64bit (built Jul 4 2019)) id <0Q6E000008KRDP00@rn-mailsvcp-mmp-lapp02.rno.apple.com>; Thu, 27 Feb 2020 20:04:03 -0800 (PST) X-Va-A: X-Va-T-CD: 08777febe38bb384cc57fda39d0586b7 X-Va-E-CD: 74fbc9fcbd3d4b0e941105e5641a1eeb X-Va-R-CD: 7f28ace2b24f1e656a1dc26e6a401e24 X-Va-CD: 0 X-Va-ID: a3b58b51-851a-456c-9a0b-8d01320425de X-V-A: X-V-T-CD: 08777febe38bb384cc57fda39d0586b7 X-V-E-CD: 74fbc9fcbd3d4b0e941105e5641a1eeb X-V-R-CD: 7f28ace2b24f1e656a1dc26e6a401e24 X-V-CD: 0 X-V-ID: 130bf62f-9d7e-48f3-bcc6-5289b1940415 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.138,18.0.572 definitions=2020-02-27_08:2020-02-26,2020-02-27 signatures=0 Received: from [17.235.11.246] by rn-mailsvcp-mmp-lapp02.rno.apple.com (Oracle Communications Messaging Server 8.1.0.1.20190704 64bit (built Jul 4 2019)) with ESMTPSA id <0Q6E00NYI8MPIM10@rn-mailsvcp-mmp-lapp02.rno.apple.com>; Thu, 27 Feb 2020 20:04:02 -0800 (PST) Sender: afish@apple.com From: "Andrew Fish" Message-id: <284BFC25-8534-4147-8616-DE7C410DB681@apple.com> MIME-version: 1.0 (Mac OS X Mail 13.0 \(3594.4.17\)) Subject: Re: [edk2-devel] A problem with live migration of UEFI virtual machines Date: Thu, 27 Feb 2020 20:04:00 -0800 In-reply-to: <6666a886-720d-1ead-8f7e-13e65dcaaeb4@redhat.com> Cc: wuchenye1995 , zhoujianjay , =?utf-8?Q?Alex_Benn=C3=A9e?= , berrange@redhat.com, "Dr. David Alan Gilbert" , qemu-devel@nongnu.org, discuss To: devel@edk2.groups.io, lersek@redhat.com References: <87sgjhxbtc.fsf@zen.linaroharston> <20200224152810.GX635661@redhat.com> <8b0ec286-9322-ee00-3729-6ec7ee8260a6@redhat.com> <3E8BB07B-8730-4AB8-BCB6-EA183FB589C5@apple.com> <465a5a84-cac4-de39-8956-e38771807450@redhat.com> <8F42F6F1-A65D-490D-9F2F-E12746870B29@apple.com> <6666a886-720d-1ead-8f7e-13e65dcaaeb4@redhat.com> X-Mailer: Apple Mail (2.3594.4.17) X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.138,18.0.572 definitions=2020-02-27_08:2020-02-26,2020-02-27 signatures=0 Content-type: multipart/alternative; boundary="Apple-Mail=_656EDCE8-0EB3-47A4-A1F9-C2C49993C4D8" --Apple-Mail=_656EDCE8-0EB3-47A4-A1F9-C2C49993C4D8 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii > On Feb 26, 2020, at 1:42 AM, Laszlo Ersek wrote: >=20 > Hi Andrew, >=20 > On 02/25/20 22:35, Andrew Fish wrote: >=20 >> Laszlo, >>=20 >> The FLASH offsets changing breaking things makes sense. >>=20 >> I now realize this is like updating the EFI ROM without rebooting the >> system. Thus changes in how the new EFI code works is not the issue. >>=20 >> Is this migration event visible to the firmware? Traditionally the >> NVRAM is a region in the FD so if you update the FD you have to skip >> NVRAM region or save and restore it. Is that activity happening in >> this case? Even if the ROM layout does not change how do you not lose >> the contents of the NVRAM store when the live migration happens? Sorry >> if this is a remedial question but I'm trying to learn how this >> migration works. >=20 > With live migration, the running guest doesn't notice anything. This is > a general requirement for live migration (regardless of UEFI or flash). >=20 > You are very correct to ask about "skipping" the NVRAM region. With the > approach that OvmfPkg originally supported, live migration would simply > be unfeasible. The "build" utility would produce a single (unified) > OVMF.fd file, which would contain both NVRAM and executable regions, and > the guest's variable updates would modify the one file that would exist. > This is inappropriate even without considering live migration, because > OVMF binary upgrades (package updates) on the virtualization host would > force guests to lose their private variable stores (NVRAMs). >=20 > Therefore, the "build" utility produces "split" files too, in addition > to the unified OVMF.fd file. Namely, OVMF_CODE.fd and OVMF_VARS.fd. > OVMF.fd is simply the concatenation of the latter two. >=20 > $ cat OVMF_VARS.fd OVMF_CODE.fd | cmp - OVMF.fd > [prints nothing] Laszlo, Thanks for the detailed explanation.=20 Maybe I was overcomplicating this. Given your explanation I think the part= I'm missing is OVMF is implying FLASH layout, in this split model, based o= n the size of the OVMF_CODE.fd and OVMF_VARS.fd. Given that if OVMF_CODE.f= d gets bigger the variable address changes from a QEMU point of view. So ba= sically it is the QEMU API that is making assumptions about the relative l= ayout of the FD in the split model that makes a migration to larger ROM not= work. Basically the -pflash API does not support changing the size of the = ROM without moving NVRAM given the way it is currently defined.=20 Given the above it seems like the 2 options are: 1) Pad OVMF_CODE.fd to be very large so there is room to grow. 2) Add some feature to QUEM that allows the variable store address to not = be based on OVMF_CODE.fd size.=20 I did see this [1] and combined with your email I either understand, or I'= m still confused? :) I'm not saying we need to change anything, I'm just trying to make sure I = understand how OVMF and QEMU are tied to together.=20 [1] https://www.redhat.com/archives/libvir-list/2019-January/msg01031.html Thanks, Andrew Fish >=20 > When you define a new domain (VM) on a virtualization host, the domain > definition saves a reference (pathname) to the OVMF_CODE.fd file. > However, the OVMF_VARS.fd file (the variable store *template*) is not > directly referenced; instead, it is *copied* into a separate (private) > file for the domain. >=20 > Furthermore, once booted, guest has two flash chips, one that maps the > firmware executable OVMF_CODE.fd read-only, and another pflash chip that > maps its private varstore file read-write. >=20 > This makes it possible to upgrade OVMF_CODE.fd and OVMF_VARS.fd (via > package upgrades on the virt host) without messing with varstores that > were earlier instantiated from OVMF_VARS.fd. What's important here is > that the various constants in the new (upgraded) OVMF_CODE.fd file > remain compatible with the *old* OVMF_VARS.fd structure, across package > upgrades. >=20 > If that's not possible for introducing e.g. a new feature, then the > package upgrade must not overwrite the OVMF_CODE.fd file in place, but > must provide an additional firmware binary. This firmware binary can > then only be used by freshly defined domains (old domains cannot be > switched over). Old domains can be switched over manually -- and only if > the sysadmin decides it is OK to lose the current variable store > contents. Then the old varstore file for the domain is deleted > (manually), the domain definition is updated, and then a new (logically > empty, pristine) varstore can be created from the *new* OVMF_2_VARS.fd > that matches the *new* OVMF_2_CODE.fd. >=20 >=20 > During live migration, the "RAM-like" contents of both pflash chips are > migrated (the guest-side view of both chips remains the same, including > the case when the writeable chip happens to be in "programming mode", > i.e., during a UEFI variable write through the Fault Tolerant Write and > Firmware Volume Block(2) protocols). >=20 > Once live migration completes, QEMU dumps the full contents of the > writeable chip to the backing file (on the destination host). Going > forward, flash writes from within the guest are reflected to said > host-side file on-line, just like it happened on the source host before > live migration. If the file backing the r/w pflash chip is on NFS > (shared by both src and dst hosts), then this one-time dumping when the > migration completes is superfluous, but it's also harmless. >=20 > The interesting question is, what happens when you power down the VM on > the destination host (=3D post migration), and launch it again there, fr= om > zero. In that case, the firmware executable file comes from the > *destination host* (it was never persistently migrated from the source > host, i.e. never written out on the dst). It simply comes from the OVMF > package that had been installed on the destination host, by the > sysadmin. However, the varstore pflash does reflect the permanent result > of the previous migration. So this is where things can fall apart, if > both firmware binaries (on the src host and on the dst host) don't agree > about the internal structure of the varstore pflash. >=20 > Thanks > Laszlo >=20 >=20 >=20 >=20 --Apple-Mail=_656EDCE8-0EB3-47A4-A1F9-C2C49993C4D8 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii
On Feb 26= , 2020, at 1:42 AM, Laszlo Ersek <lersek@redhat.com> wrote:

Hi Andrew,

On 02/25/20 22:35, Andrew Fish wrote:

Laszlo,

The FLASH offsets changing breaking things makes sense.

I now realize this is like updating the EFI ROM without rebootin= g the
system.  Thus changes in how the new EFI code work= s is not the issue.

Is this migration event vi= sible to the firmware? Traditionally the
NVRAM is a region in= the FD so if you update the FD you have to skip
NVRAM region= or save and restore it. Is that activity happening in
this c= ase? Even if the ROM layout does not change how do you not lose
the contents of the NVRAM store when the live migration happens? Sorryif this is a remedial question but I'm trying to learn how thi= s
migration works.

= With live migration, the running guest doesn't notice anything. This is
a general requirement for live migration (regardless of UEFI or = flash).

You are very correct to ask about "ski= pping" the NVRAM region. With the
approach that OvmfPkg origi= nally supported, live migration would simply
be unfeasible. T= he "build" utility would produce a single (unified)
OVMF.fd f= ile, which would contain both NVRAM and executable regions, and
the guest's variable updates would modify the one file that would exist.=
This is inappropriate even without considering live migratio= n, because
OVMF binary upgrades (package updates) on the virt= ualization host would
force guests to lose their private vari= able stores (NVRAMs).

There= fore, the "build" utility produces "split" files too, in addition
to the unified OVMF.fd file. Namely, OVMF_CODE.fd and OVMF_VARS.fd.<= br class=3D"">OVMF.fd is simply the concatenation of the latter two.

$ cat OVMF_VARS.fd OVMF_CODE.fd | cmp - OVMF.fd
[prints nothing]


Laszlo,

Thanks for the detailed explanation. 
=
Maybe I was overcomplicating this. Given your exp= lanation I think the part I'm missing is OVMF is implying FLASH layout, in = this split model, based on the size of the OVMF_CODE.fd and OVMF_VARS.fd.  Given that if OVMF_CODE.fd gets bigger the variable address changes from a QEMU point= of view. So basically it is the QEMU  API that is making assumptions = about the relative layout of the FD in the split model that makes a migrati= on to larger ROM not work. Basically the -pflash API does not support chang= ing the size of the ROM without moving NVRAM given the way it is currently = defined. 

Given the above it seems= like the 2 options are:
1) Pad OVMF_CODE.fd to be very large so there is room to grow.
2) Add some feature to QUEM t= hat allows the variable store address to not be based on OVMF_CODE.fd size. 
<= span style=3D"caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); background-co= lor: rgb(255, 255, 255);" class=3D"">
I did see this [1] and combined with your = email I either understand, or I'm still confused? :)

I'm not saying we need to change anything, I'm just tr= ying to make sure I understand how OVMF and QEMU are tied to together. = ;



When you define a new domain (VM) on = a virtualization host, the domain
definition saves a referenc= e (pathname) to the OVMF_CODE.fd file.
However, the OVMF_VARS= .fd file (the variable store *template*) is not
directly refe= renced; instead, it is *copied* into a separate (private)
fil= e for the domain.

Furthermore, once booted, gu= est has two flash chips, one that maps the
firmware executabl= e OVMF_CODE.fd read-only, and another pflash chip that
maps i= ts private varstore file read-write.

This make= s it possible to upgrade OVMF_CODE.fd and OVMF_VARS.fd (via
p= ackage upgrades on the virt host) without messing with varstores that
were earlier instantiated from OVMF_VARS.fd. What's important here= is
that the various constants in the new (upgraded) OVMF_COD= E.fd file
remain compatible with the *old* OVMF_VARS.fd struc= ture, across package
upgrades.

I= f that's not possible for introducing e.g. a new feature, then the
package upgrade must not overwrite the OVMF_CODE.fd file in place, b= ut
must provide an additional firmware binary. This firmware = binary can
then only be used by freshly defined domains (old = domains cannot be
switched over). Old domains can be switched= over manually -- and only if
the sysadmin decides it is OK t= o lose the current variable store
contents. Then the old vars= tore file for the domain is deleted
(manually), the domain de= finition is updated, and then a new (logically
empty, pristin= e) varstore can be created from the *new* OVMF_2_VARS.fd
that= matches the *new* OVMF_2_CODE.fd.


During live migration, the "RAM-like" contents of both pflash chips = are
migrated (the guest-side view of both chips remains the s= ame, including
the case when the writeable chip happens to be= in "programming mode",
i.e., during a UEFI variable write th= rough the Fault Tolerant Write and
Firmware Volume Block(2) p= rotocols).

Once live migration completes, QEMU= dumps the full contents of the
writeable chip to the backing= file (on the destination host). Going
forward, flash writes = from within the guest are reflected to said
host-side file on= -line, just like it happened on the source host before
live m= igration. If the file backing the r/w pflash chip is on NFS
(= shared by both src and dst hosts), then this one-time dumping when the
migration completes is superfluous, but it's also harmless.

The interesting question is, what happens when you = power down the VM on
the destination host (=3D post migration= ), and launch it again there, from
zero. In that case, the fi= rmware executable file comes from the
*destination host* (it = was never persistently migrated from the source
host, i.e. ne= ver written out on the dst). It simply comes from the OVMF
pa= ckage that had been installed on the destination host, by the
sysadmin. However, the varstore pflash does reflect the permanent resultof the previous migration. So this is where things can fall ap= art, if
both firmware binaries (on the src host and on the ds= t host) don't agree
about the internal structure of the varst= ore pflash.

Thanks
Laszlo




<= /div>

--Apple-Mail=_656EDCE8-0EB3-47A4-A1F9-C2C49993C4D8--