From: "Zhoujian (jay)" <jianjay.zhou@huawei.com>
To: Laszlo Ersek <lersek@redhat.com>, Andrew Fish <afish@apple.com>,
"devel@edk2.groups.io" <devel@edk2.groups.io>
Cc: "berrange@redhat.com" <berrange@redhat.com>,
"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
zhoujianjay <zhoujianjay@gmail.com>,
discuss <discuss@edk2.groups.io>,
"Alex Bennée" <alex.bennee@linaro.org>,
wuchenye1995 <wuchenye1995@gmail.com>,
"Huangweidong (C)" <weidong.huang@huawei.com>,
"wangxin (U)" <wangxinxin.wang@huawei.com>
Subject: Re: [edk2-devel] A problem with live migration of UEFI virtual machines
Date: Fri, 28 Feb 2020 03:20:58 +0000 [thread overview]
Message-ID: <B2D15215269B544CADD246097EACE7474BB28B35@dggemm508-mbx.china.huawei.com> (raw)
In-Reply-To: <6666a886-720d-1ead-8f7e-13e65dcaaeb4@redhat.com>
Hi Laszlo,
> -----Original Message-----
> From: Qemu-devel
> [mailto:qemu-devel-bounces+jianjay.zhou=huawei.com@nongnu.org] On Behalf
> Of Laszlo Ersek
> Sent: Wednesday, February 26, 2020 5:42 PM
> To: Andrew Fish <afish@apple.com>; devel@edk2.groups.io
> Cc: berrange@redhat.com; qemu-devel@nongnu.org; Dr. David Alan Gilbert
> <dgilbert@redhat.com>; zhoujianjay <zhoujianjay@gmail.com>; discuss
> <discuss@edk2.groups.io>; Alex Bennée <alex.bennee@linaro.org>;
> wuchenye1995 <wuchenye1995@gmail.com>
> Subject: Re: [edk2-devel] A problem with live migration of UEFI virtual machines
>
> Hi Andrew,
>
> On 02/25/20 22:35, Andrew Fish wrote:
>
> > Laszlo,
> >
> > The FLASH offsets changing breaking things makes sense.
> >
> > I now realize this is like updating the EFI ROM without rebooting the
> > system. Thus changes in how the new EFI code works is not the issue.
> >
> > Is this migration event visible to the firmware? Traditionally the
> > NVRAM is a region in the FD so if you update the FD you have to skip
> > NVRAM region or save and restore it. Is that activity happening in
> > this case? Even if the ROM layout does not change how do you not lose
> > the contents of the NVRAM store when the live migration happens? Sorry
> > if this is a remedial question but I'm trying to learn how this
> > migration works.
>
> With live migration, the running guest doesn't notice anything. This is a general
> requirement for live migration (regardless of UEFI or flash).
>
> You are very correct to ask about "skipping" the NVRAM region. With the
> approach that OvmfPkg originally supported, live migration would simply be
> unfeasible. The "build" utility would produce a single (unified) OVMF.fd file, which
> would contain both NVRAM and executable regions, and the guest's variable
> updates would modify the one file that would exist.
> This is inappropriate even without considering live migration, because OVMF
> binary upgrades (package updates) on the virtualization host would force guests
> to lose their private variable stores (NVRAMs).
>
> Therefore, the "build" utility produces "split" files too, in addition to the unified
> OVMF.fd file. Namely, OVMF_CODE.fd and OVMF_VARS.fd.
> OVMF.fd is simply the concatenation of the latter two.
>
> $ cat OVMF_VARS.fd OVMF_CODE.fd | cmp - OVMF.fd [prints nothing]
>
> When you define a new domain (VM) on a virtualization host, the domain
> definition saves a reference (pathname) to the OVMF_CODE.fd file.
> However, the OVMF_VARS.fd file (the variable store *template*) is not directly
> referenced; instead, it is *copied* into a separate (private) file for the domain.
>
> Furthermore, once booted, guest has two flash chips, one that maps the
> firmware executable OVMF_CODE.fd read-only, and another pflash chip that
> maps its private varstore file read-write.
>
> This makes it possible to upgrade OVMF_CODE.fd and OVMF_VARS.fd (via
> package upgrades on the virt host) without messing with varstores that were
> earlier instantiated from OVMF_VARS.fd. What's important here is that the
> various constants in the new (upgraded) OVMF_CODE.fd file remain compatible
> with the *old* OVMF_VARS.fd structure, across package upgrades.
>
> If that's not possible for introducing e.g. a new feature, then the package
> upgrade must not overwrite the OVMF_CODE.fd file in place, but must provide an
> additional firmware binary. This firmware binary can then only be used by freshly
> defined domains (old domains cannot be switched over). Old domains can be
> switched over manually -- and only if the sysadmin decides it is OK to lose the
> current variable store contents. Then the old varstore file for the domain is
> deleted (manually), the domain definition is updated, and then a new (logically
> empty, pristine) varstore can be created from the *new* OVMF_2_VARS.fd that
> matches the *new* OVMF_2_CODE.fd.
>
>
> During live migration, the "RAM-like" contents of both pflash chips are migrated
> (the guest-side view of both chips remains the same, including the case when the
> writeable chip happens to be in "programming mode", i.e., during a UEFI variable
> write through the Fault Tolerant Write and Firmware Volume Block(2) protocols).
>
> Once live migration completes, QEMU dumps the full contents of the writeable
> chip to the backing file (on the destination host). Going forward, flash writes from
> within the guest are reflected to said host-side file on-line, just like it happened
> on the source host before live migration. If the file backing the r/w pflash chip is
> on NFS (shared by both src and dst hosts), then this one-time dumping when the
> migration completes is superfluous, but it's also harmless.
>
> The interesting question is, what happens when you power down the VM on the
> destination host (= post migration), and launch it again there, from zero. In that
> case, the firmware executable file comes from the *destination host* (it was
> never persistently migrated from the source host, i.e. never written out on the
> dst). It simply comes from the OVMF package that had been installed on the
> destination host, by the sysadmin. However, the varstore pflash does reflect the
> permanent result of the previous migration. So this is where things can fall apart,
> if both firmware binaries (on the src host and on the dst host) don't agree about
> the internal structure of the varstore pflash.
>
Hi Laszlo,
I found an ealier thread that you said there're 4 options to use ovmf:
https://lists.gnu.org/archive/html/qemu-discuss/2018-04/msg00045.html
Excerpt:
"(1) If you map the unified image with -bios, all of that becomes ROM --
read-only memory.
(2) If you map the unified image with -pflash, all of that becomes
read-write MMIO.
(3) If you use the split images (OVMF_CODE.fd and a copy of
OVMF_VARS.fd), and map then as flash chips, then the top part
(OVMF_CODE.fd, consisting of SECFV and FVMAIN_COMPACT) becomes
read-only flash (MMIO), and the bottom part (copy of OVMF_VARS.fd,
consisting of FTW Spare, FTW Work, Event log, and NV store) becomes
read-write flash (MMIO).
(4) If you use -bios with OVMF_CODE.fd only, then the top part will be
ROM, and the bottom part will be "black hole" MMIO."
I think you're talking about the option (2)(acceptable) and option (3)
(best solution) in this thread, and I agree.
I'm wondering will it be different about ancient option (1) with live
migration. You tried add -DMEM_VARSTORE_EMU_ENABLE=FALSE
build flag to disable -bios support, but Option (1) may be used for the
old VMs started several years ago running on the cloud...
With developing new features, the size of OVMF.fd is becoming larger
and larger, that seems to be the trend. It would be nice if it could be
hot-updated to the new version. As Daniel said, could it feasible to add
zero-padding to the firmware images? Things are a little different here,
i.e. the size of src and dest are 2M and 4M respectively, copy the source
2M to the dest side, and then add zero-padding to the end of the image
to round it upto 4 MB at the dest side (With some modification of
qemu_ram_resize in QEMU to avoid length mismatch error report)?
The physical address assigned to ovmf region will change from
0xffe00000 - 0xffffffff to 0xffc00000 - 0xffffffff, after the OS has
started I see this range will be recycled and assigned to other PCI
devices(using the command "cat /proc/iomem") by guest OS. So,
this range change seems that will not affect the guest I think.
But if the code of OVMF is running when paused at the src side
then will it be continued to run at the dest side, I'm not sure...
So, may I ask that would it be feasible or compatible for option (1)
when live migration between different ovmf sizes? Thanks.
Regards,
Jay Zhou
next prev parent reply other threads:[~2020-02-28 3:21 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-02-11 17:06 A problem with live migration of UEFI virtual machines "wuchenye1995
2020-02-11 17:39 ` Alex Bennée
2020-02-24 15:28 ` Daniel P. Berrangé
2020-02-25 17:53 ` [edk2-devel] " Laszlo Ersek
2020-02-25 18:56 ` Andrew Fish
2020-02-25 20:40 ` Laszlo Ersek
2020-02-25 21:35 ` Andrew Fish
2020-02-26 9:42 ` Laszlo Ersek
2020-02-28 3:20 ` Zhoujian (jay) [this message]
2020-02-28 11:29 ` Laszlo Ersek
2020-02-28 4:04 ` Andrew Fish
2020-02-28 11:47 ` Laszlo Ersek
2020-02-28 11:50 ` Laszlo Ersek
2020-03-02 12:32 ` Dr. David Alan Gilbert
-- strict thread matches above, loose matches on Subject: below --
2020-02-10 4:39 wuchenye1995
2020-02-10 20:20 ` [edk2-devel] " Laszlo Ersek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-list from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=B2D15215269B544CADD246097EACE7474BB28B35@dggemm508-mbx.china.huawei.com \
--to=devel@edk2.groups.io \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox