public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed
* OvmfPkg: VM crashed trying to write to RO memory from CommonInterruptEntry
@ 2016-11-22 12:57 Evgeny Yakovlev
  2016-11-22 13:41 ` Laszlo Ersek
  0 siblings, 1 reply; 9+ messages in thread
From: Evgeny Yakovlev @ 2016-11-22 12:57 UTC (permalink / raw)
  To: edk2-devel; +Cc: eyakovlev, lersek, den, rkagan

We are running windows UEFI-based VMs on QEMU/KVM with OvmfPkg.

Very rarely we are experiencing a crash when VM tries to write to RO memory
very early during UEFI boot process.

Crash happens when VM tries to execute this code in interrupt handler:
https://github.com/tianocore/edk2/blob/master/UefiCpuPkg/Library/CpuExceptionHandlerLib/X64/ExceptionHandlerAsm.asm#L244-L246


fxsave [rdi], where RDI = 0xffe60

Which is bad - it points to ISA BIOS F-segment area.

This memory was mapped by qemu for read only access, which is reflected in
KVM EPT:
00000000000e0000-00000000000fffff (prio 1, R-): isa-bios

This is a very early IRQ0 interrupt, presumably during early initialization
phase (Sec or Pei).

Looks like CommonInterruptHandler does not switch to a separate stack and
works on interrupted context's stack, which was fairly close to 1MB
boundary when IRQ0 fired (RSP around 1002c0). When CommonInterruptEntry
reached highlighted code it subtracted 512 bytes from current RSP which
dropped to 0xffe60, below 1MB and into QEMU RO region.

We were figuring out how to best fix this. Possible solutions are to switch
to a separate stack in CommonInterruptEntry, relocate early OvmfPkg stack
to somewhere farther away from 1MB, to run with interrupts disabled until
we reach a later phase or maybe something else.

Any comments would be very appreciated!


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: OvmfPkg: VM crashed trying to write to RO memory from CommonInterruptEntry
  2016-11-22 12:57 OvmfPkg: VM crashed trying to write to RO memory from CommonInterruptEntry Evgeny Yakovlev
@ 2016-11-22 13:41 ` Laszlo Ersek
  2016-11-22 13:58   ` Evgeny Yakovlev
  0 siblings, 1 reply; 9+ messages in thread
From: Laszlo Ersek @ 2016-11-22 13:41 UTC (permalink / raw)
  To: Evgeny Yakovlev, edk2-devel; +Cc: eyakovlev, den, Jeff Fan

Hello Evgeny,

On 11/22/16 13:57, Evgeny Yakovlev wrote:
> We are running windows UEFI-based VMs on QEMU/KVM with OvmfPkg.
> 
> Very rarely we are experiencing a crash when VM tries to write to RO memory
> very early during UEFI boot process.
> 
> Crash happens when VM tries to execute this code in interrupt handler:
> https://github.com/tianocore/edk2/blob/master/UefiCpuPkg/Library/CpuExceptionHandlerLib/X64/ExceptionHandlerAsm.asm#L244-L246
> 
> 
> fxsave [rdi], where RDI = 0xffe60
> 
> Which is bad - it points to ISA BIOS F-segment area.
> 
> This memory was mapped by qemu for read only access, which is reflected in
> KVM EPT:
> 00000000000e0000-00000000000fffff (prio 1, R-): isa-bios
> 
> This is a very early IRQ0 interrupt, presumably during early initialization
> phase (Sec or Pei).
> 
> Looks like CommonInterruptHandler does not switch to a separate stack and
> works on interrupted context's stack, which was fairly close to 1MB
> boundary when IRQ0 fired (RSP around 1002c0). When CommonInterruptEntry
> reached highlighted code it subtracted 512 bytes from current RSP which
> dropped to 0xffe60, below 1MB and into QEMU RO region.
> 
> We were figuring out how to best fix this. Possible solutions are to switch
> to a separate stack in CommonInterruptEntry, relocate early OvmfPkg stack
> to somewhere farther away from 1MB, to run with interrupts disabled until
> we reach a later phase or maybe something else.
> 
> Any comments would be very appreciated!

I wonder if you started to see this issue very recently.

I suspect (hope!) that the symptoms you are experiencing are a
consequence of a bug in UefiCpuPkg that I've debugged and fixed just
today. (I hope to post the patches today.)

While testing those patches on your end will of course tell us if your
issue has the same root cause, you could gather a few more symptoms even
before I get around posting the patches. The bug that I'm working on has
extremely varied crash symptoms (basically the APs wander off into the
weeds), and some of those symptoms have involved CpuExceptionHandlerLib.
The point is, by the time we get into CpuExceptionHandlerLib, all is
lost -- it is executing on an AP whose state is corrupt anyway. The
fxsave symptom is a red herring, most likely.

CpuExceptionHandlerLib works fine otherwise, especially when invoked
from the BSP -- we've used the output dumped by CpuExceptionHandlerLib
to the serial port several times to track down issues.

So, my request is that you please capture the OVMF debug log (please see
the "OvmfPkg/README" file for how). I'm curious if it crashes where and
how I suspect it crashes.

Also, it would help if you provided
- your host CPU model,
- the host kernel (KVM) version,
- the guest CPU model,
- the guest CPU topology.

Thanks!
Laszlo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: OvmfPkg: VM crashed trying to write to RO memory from CommonInterruptEntry
  2016-11-22 13:41 ` Laszlo Ersek
@ 2016-11-22 13:58   ` Evgeny Yakovlev
  2016-11-22 16:58     ` Laszlo Ersek
  0 siblings, 1 reply; 9+ messages in thread
From: Evgeny Yakovlev @ 2016-11-22 13:58 UTC (permalink / raw)
  To: Laszlo Ersek; +Cc: edk2-devel, eyakovlev, den, Jeff Fan

Wow, that is more than i expected :)

> I wonder if you started to see this issue very recently.
Very recently, however we use a pretty old OVMF build, circa 2015

>  OVMF debug log
Sorry, we hadn't had it enabled when VM crashed and these crashes are very
rare. We will try to capture it when it happens again

> - your host CPU model,
cpu family      : 6
model           : 42
model name      : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
stepping        : 7

> - the host kernel (KVM) version,
Our kernel is roughly based on RHEL7.2 (kernel version 3.10.0-327.36.1). We
also have some upstream KVM patches backported.

> - the guest CPU model,
-cpu
SandyBridge,+vme,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+monitor,+ds_cpl,+smx,+est,+tm2,+xtpr,+pdcm,+pcid,+osxsave,-arat,-xsaveopt,-xgetbv1,-vmx,-xsavec,hv_time,hv_relaxed,hv_vapic,hv_spinlocks=0x1fff,hv_vpindex,hv_runtime,hv_synic,hv_stimer,hv_reset,hv_crash

> - the guest CPU topology.
8 sockets, 1 core per socket, 1 thread per core

Hope that helps!

2016-11-22 16:41 GMT+03:00 Laszlo Ersek <lersek@redhat.com>:

> Hello Evgeny,
>
> On 11/22/16 13:57, Evgeny Yakovlev wrote:
> > We are running windows UEFI-based VMs on QEMU/KVM with OvmfPkg.
> >
> > Very rarely we are experiencing a crash when VM tries to write to RO
> memory
> > very early during UEFI boot process.
> >
> > Crash happens when VM tries to execute this code in interrupt handler:
> > https://github.com/tianocore/edk2/blob/master/UefiCpuPkg/Library/
> CpuExceptionHandlerLib/X64/ExceptionHandlerAsm.asm#L244-L246
> >
> >
> > fxsave [rdi], where RDI = 0xffe60
> >
> > Which is bad - it points to ISA BIOS F-segment area.
> >
> > This memory was mapped by qemu for read only access, which is reflected
> in
> > KVM EPT:
> > 00000000000e0000-00000000000fffff (prio 1, R-): isa-bios
> >
> > This is a very early IRQ0 interrupt, presumably during early
> initialization
> > phase (Sec or Pei).
> >
> > Looks like CommonInterruptHandler does not switch to a separate stack and
> > works on interrupted context's stack, which was fairly close to 1MB
> > boundary when IRQ0 fired (RSP around 1002c0). When CommonInterruptEntry
> > reached highlighted code it subtracted 512 bytes from current RSP which
> > dropped to 0xffe60, below 1MB and into QEMU RO region.
> >
> > We were figuring out how to best fix this. Possible solutions are to
> switch
> > to a separate stack in CommonInterruptEntry, relocate early OvmfPkg stack
> > to somewhere farther away from 1MB, to run with interrupts disabled until
> > we reach a later phase or maybe something else.
> >
> > Any comments would be very appreciated!
>
> I wonder if you started to see this issue very recently.
>
> I suspect (hope!) that the symptoms you are experiencing are a
> consequence of a bug in UefiCpuPkg that I've debugged and fixed just
> today. (I hope to post the patches today.)
>
> While testing those patches on your end will of course tell us if your
> issue has the same root cause, you could gather a few more symptoms even
> before I get around posting the patches. The bug that I'm working on has
> extremely varied crash symptoms (basically the APs wander off into the
> weeds), and some of those symptoms have involved CpuExceptionHandlerLib.
> The point is, by the time we get into CpuExceptionHandlerLib, all is
> lost -- it is executing on an AP whose state is corrupt anyway. The
> fxsave symptom is a red herring, most likely.
>
> CpuExceptionHandlerLib works fine otherwise, especially when invoked
> from the BSP -- we've used the output dumped by CpuExceptionHandlerLib
> to the serial port several times to track down issues.
>
> So, my request is that you please capture the OVMF debug log (please see
> the "OvmfPkg/README" file for how). I'm curious if it crashes where and
> how I suspect it crashes.
>
> Also, it would help if you provided
> - your host CPU model,
> - the host kernel (KVM) version,
> - the guest CPU model,
> - the guest CPU topology.
>
> Thanks!
> Laszlo
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: OvmfPkg: VM crashed trying to write to RO memory from CommonInterruptEntry
  2016-11-22 13:58   ` Evgeny Yakovlev
@ 2016-11-22 16:58     ` Laszlo Ersek
  2016-11-23  8:37       ` Evgeny Yakovlev
  2016-11-23 14:31       ` Evgeny Yakovlev
  0 siblings, 2 replies; 9+ messages in thread
From: Laszlo Ersek @ 2016-11-22 16:58 UTC (permalink / raw)
  To: Evgeny Yakovlev; +Cc: edk2-devel, eyakovlev, den, Jeff Fan

On 11/22/16 14:58, Evgeny Yakovlev wrote:
> Wow, that is more than i expected :)
> 
>> I wonder if you started to see this issue very recently.
> Very recently, however we use a pretty old OVMF build, circa 2015

Ugh. Please update OVMF first... A whole lot of things has changed in
edk2 in this year.

> 
>>  OVMF debug log
> Sorry, we hadn't had it enabled when VM crashed and these crashes are very
> rare. We will try to capture it when it happens again
> 
>> - your host CPU model,
> cpu family      : 6
> model           : 42
> model name      : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
> stepping        : 7
> 
>> - the host kernel (KVM) version,
> Our kernel is roughly based on RHEL7.2 (kernel version 3.10.0-327.36.1). We
> also have some upstream KVM patches backported.
> 
>> - the guest CPU model,
> -cpu
> SandyBridge,+vme,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+monitor,+ds_cpl,+smx,+est,+tm2,+xtpr,+pdcm,+pcid,+osxsave,-arat,-xsaveopt,-xgetbv1,-vmx,-xsavec,hv_time,hv_relaxed,hv_vapic,hv_spinlocks=0x1fff,hv_vpindex,hv_runtime,hv_synic,hv_stimer,hv_reset,hv_crash
> 
>> - the guest CPU topology.
> 8 sockets, 1 core per socket, 1 thread per core
> 
> Hope that helps!

The fact that you are using 8 VCPUs is definitely relevant. However, I
don't think it would make sense to try to analyze any errors with an
OVMF / edk2 tree this old. Please try to reproduce the issue with a
fresh build from master.

Thanks!
Laszlo

> 2016-11-22 16:41 GMT+03:00 Laszlo Ersek <lersek@redhat.com>:
> 
>> Hello Evgeny,
>>
>> On 11/22/16 13:57, Evgeny Yakovlev wrote:
>>> We are running windows UEFI-based VMs on QEMU/KVM with OvmfPkg.
>>>
>>> Very rarely we are experiencing a crash when VM tries to write to RO
>> memory
>>> very early during UEFI boot process.
>>>
>>> Crash happens when VM tries to execute this code in interrupt handler:
>>> https://github.com/tianocore/edk2/blob/master/UefiCpuPkg/Library/
>> CpuExceptionHandlerLib/X64/ExceptionHandlerAsm.asm#L244-L246
>>>
>>>
>>> fxsave [rdi], where RDI = 0xffe60
>>>
>>> Which is bad - it points to ISA BIOS F-segment area.
>>>
>>> This memory was mapped by qemu for read only access, which is reflected
>> in
>>> KVM EPT:
>>> 00000000000e0000-00000000000fffff (prio 1, R-): isa-bios
>>>
>>> This is a very early IRQ0 interrupt, presumably during early
>> initialization
>>> phase (Sec or Pei).
>>>
>>> Looks like CommonInterruptHandler does not switch to a separate stack and
>>> works on interrupted context's stack, which was fairly close to 1MB
>>> boundary when IRQ0 fired (RSP around 1002c0). When CommonInterruptEntry
>>> reached highlighted code it subtracted 512 bytes from current RSP which
>>> dropped to 0xffe60, below 1MB and into QEMU RO region.
>>>
>>> We were figuring out how to best fix this. Possible solutions are to
>> switch
>>> to a separate stack in CommonInterruptEntry, relocate early OvmfPkg stack
>>> to somewhere farther away from 1MB, to run with interrupts disabled until
>>> we reach a later phase or maybe something else.
>>>
>>> Any comments would be very appreciated!
>>
>> I wonder if you started to see this issue very recently.
>>
>> I suspect (hope!) that the symptoms you are experiencing are a
>> consequence of a bug in UefiCpuPkg that I've debugged and fixed just
>> today. (I hope to post the patches today.)
>>
>> While testing those patches on your end will of course tell us if your
>> issue has the same root cause, you could gather a few more symptoms even
>> before I get around posting the patches. The bug that I'm working on has
>> extremely varied crash symptoms (basically the APs wander off into the
>> weeds), and some of those symptoms have involved CpuExceptionHandlerLib.
>> The point is, by the time we get into CpuExceptionHandlerLib, all is
>> lost -- it is executing on an AP whose state is corrupt anyway. The
>> fxsave symptom is a red herring, most likely.
>>
>> CpuExceptionHandlerLib works fine otherwise, especially when invoked
>> from the BSP -- we've used the output dumped by CpuExceptionHandlerLib
>> to the serial port several times to track down issues.
>>
>> So, my request is that you please capture the OVMF debug log (please see
>> the "OvmfPkg/README" file for how). I'm curious if it crashes where and
>> how I suspect it crashes.
>>
>> Also, it would help if you provided
>> - your host CPU model,
>> - the host kernel (KVM) version,
>> - the guest CPU model,
>> - the guest CPU topology.
>>
>> Thanks!
>> Laszlo
>>
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org
> https://lists.01.org/mailman/listinfo/edk2-devel
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: OvmfPkg: VM crashed trying to write to RO memory from CommonInterruptEntry
  2016-11-22 16:58     ` Laszlo Ersek
@ 2016-11-23  8:37       ` Evgeny Yakovlev
  2016-11-23 16:54         ` Laszlo Ersek
  2016-11-23 14:31       ` Evgeny Yakovlev
  1 sibling, 1 reply; 9+ messages in thread
From: Evgeny Yakovlev @ 2016-11-23  8:37 UTC (permalink / raw)
  To: Laszlo Ersek; +Cc: edk2-devel, eyakovlev, den, Jeff Fan

You are right of course about the old tree, no objections here. I will try
to advocate for an update however i am pretty sure we're stuck with our
version for some time at least.

Still, my original question was about is it normal for OVMF Sec/Pei stage
to have its stack so close to 0x100000 and/or why interrupt handler in
UefiCpuPkg/Library/CpuExceptionHandlerLib/X64 does not switch to a separate
stack.
Code in UefiCpuPkg/Library/CpuExceptionHandlerLib/X64 hasn't been touched
for 2 years so our version is still relevant.

2016-11-22 19:58 GMT+03:00 Laszlo Ersek <lersek@redhat.com>:

> On 11/22/16 14:58, Evgeny Yakovlev wrote:
> > Wow, that is more than i expected :)
> >
> >> I wonder if you started to see this issue very recently.
> > Very recently, however we use a pretty old OVMF build, circa 2015
>
> Ugh. Please update OVMF first... A whole lot of things has changed in
> edk2 in this year.
>
> >
> >>  OVMF debug log
> > Sorry, we hadn't had it enabled when VM crashed and these crashes are
> very
> > rare. We will try to capture it when it happens again
> >
> >> - your host CPU model,
> > cpu family      : 6
> > model           : 42
> > model name      : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
> > stepping        : 7
> >
> >> - the host kernel (KVM) version,
> > Our kernel is roughly based on RHEL7.2 (kernel version 3.10.0-327.36.1).
> We
> > also have some upstream KVM patches backported.
> >
> >> - the guest CPU model,
> > -cpu
> > SandyBridge,+vme,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+
> monitor,+ds_cpl,+smx,+est,+tm2,+xtpr,+pdcm,+pcid,+
> osxsave,-arat,-xsaveopt,-xgetbv1,-vmx,-xsavec,hv_time,
> hv_relaxed,hv_vapic,hv_spinlocks=0x1fff,hv_vpindex,
> hv_runtime,hv_synic,hv_stimer,hv_reset,hv_crash
> >
> >> - the guest CPU topology.
> > 8 sockets, 1 core per socket, 1 thread per core
> >
> > Hope that helps!
>
> The fact that you are using 8 VCPUs is definitely relevant. However, I
> don't think it would make sense to try to analyze any errors with an
> OVMF / edk2 tree this old. Please try to reproduce the issue with a
> fresh build from master.
>
> Thanks!
> Laszlo
>
> > 2016-11-22 16:41 GMT+03:00 Laszlo Ersek <lersek@redhat.com>:
> >
> >> Hello Evgeny,
> >>
> >> On 11/22/16 13:57, Evgeny Yakovlev wrote:
> >>> We are running windows UEFI-based VMs on QEMU/KVM with OvmfPkg.
> >>>
> >>> Very rarely we are experiencing a crash when VM tries to write to RO
> >> memory
> >>> very early during UEFI boot process.
> >>>
> >>> Crash happens when VM tries to execute this code in interrupt handler:
> >>> https://github.com/tianocore/edk2/blob/master/UefiCpuPkg/Library/
> >> CpuExceptionHandlerLib/X64/ExceptionHandlerAsm.asm#L244-L246
> >>>
> >>>
> >>> fxsave [rdi], where RDI = 0xffe60
> >>>
> >>> Which is bad - it points to ISA BIOS F-segment area.
> >>>
> >>> This memory was mapped by qemu for read only access, which is reflected
> >> in
> >>> KVM EPT:
> >>> 00000000000e0000-00000000000fffff (prio 1, R-): isa-bios
> >>>
> >>> This is a very early IRQ0 interrupt, presumably during early
> >> initialization
> >>> phase (Sec or Pei).
> >>>
> >>> Looks like CommonInterruptHandler does not switch to a separate stack
> and
> >>> works on interrupted context's stack, which was fairly close to 1MB
> >>> boundary when IRQ0 fired (RSP around 1002c0). When CommonInterruptEntry
> >>> reached highlighted code it subtracted 512 bytes from current RSP which
> >>> dropped to 0xffe60, below 1MB and into QEMU RO region.
> >>>
> >>> We were figuring out how to best fix this. Possible solutions are to
> >> switch
> >>> to a separate stack in CommonInterruptEntry, relocate early OvmfPkg
> stack
> >>> to somewhere farther away from 1MB, to run with interrupts disabled
> until
> >>> we reach a later phase or maybe something else.
> >>>
> >>> Any comments would be very appreciated!
> >>
> >> I wonder if you started to see this issue very recently.
> >>
> >> I suspect (hope!) that the symptoms you are experiencing are a
> >> consequence of a bug in UefiCpuPkg that I've debugged and fixed just
> >> today. (I hope to post the patches today.)
> >>
> >> While testing those patches on your end will of course tell us if your
> >> issue has the same root cause, you could gather a few more symptoms even
> >> before I get around posting the patches. The bug that I'm working on has
> >> extremely varied crash symptoms (basically the APs wander off into the
> >> weeds), and some of those symptoms have involved CpuExceptionHandlerLib.
> >> The point is, by the time we get into CpuExceptionHandlerLib, all is
> >> lost -- it is executing on an AP whose state is corrupt anyway. The
> >> fxsave symptom is a red herring, most likely.
> >>
> >> CpuExceptionHandlerLib works fine otherwise, especially when invoked
> >> from the BSP -- we've used the output dumped by CpuExceptionHandlerLib
> >> to the serial port several times to track down issues.
> >>
> >> So, my request is that you please capture the OVMF debug log (please see
> >> the "OvmfPkg/README" file for how). I'm curious if it crashes where and
> >> how I suspect it crashes.
> >>
> >> Also, it would help if you provided
> >> - your host CPU model,
> >> - the host kernel (KVM) version,
> >> - the guest CPU model,
> >> - the guest CPU topology.
> >>
> >> Thanks!
> >> Laszlo
> >>
> > _______________________________________________
> > edk2-devel mailing list
> > edk2-devel@lists.01.org
> > https://lists.01.org/mailman/listinfo/edk2-devel
> >
>
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: OvmfPkg: VM crashed trying to write to RO memory from CommonInterruptEntry
  2016-11-22 16:58     ` Laszlo Ersek
  2016-11-23  8:37       ` Evgeny Yakovlev
@ 2016-11-23 14:31       ` Evgeny Yakovlev
  2016-11-23 16:38         ` Laszlo Ersek
  1 sibling, 1 reply; 9+ messages in thread
From: Evgeny Yakovlev @ 2016-11-23 14:31 UTC (permalink / raw)
  To: Laszlo Ersek; +Cc: edk2-devel, eyakovlev, den, Jeff Fan

Looks like we're actually based on OVMF tree for RHEL7.2:
ovmf/rhel-20150414-2.gitc9e5618.el7, ovmf/rhel-srpm-7.2

So maybe this affects those deployments as well

2016-11-22 19:58 GMT+03:00 Laszlo Ersek <lersek@redhat.com>:

> On 11/22/16 14:58, Evgeny Yakovlev wrote:
> > Wow, that is more than i expected :)
> >
> >> I wonder if you started to see this issue very recently.
> > Very recently, however we use a pretty old OVMF build, circa 2015
>
> Ugh. Please update OVMF first... A whole lot of things has changed in
> edk2 in this year.
>
> >
> >>  OVMF debug log
> > Sorry, we hadn't had it enabled when VM crashed and these crashes are
> very
> > rare. We will try to capture it when it happens again
> >
> >> - your host CPU model,
> > cpu family      : 6
> > model           : 42
> > model name      : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
> > stepping        : 7
> >
> >> - the host kernel (KVM) version,
> > Our kernel is roughly based on RHEL7.2 (kernel version 3.10.0-327.36.1).
> We
> > also have some upstream KVM patches backported.
> >
> >> - the guest CPU model,
> > -cpu
> > SandyBridge,+vme,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+
> monitor,+ds_cpl,+smx,+est,+tm2,+xtpr,+pdcm,+pcid,+
> osxsave,-arat,-xsaveopt,-xgetbv1,-vmx,-xsavec,hv_time,
> hv_relaxed,hv_vapic,hv_spinlocks=0x1fff,hv_vpindex,
> hv_runtime,hv_synic,hv_stimer,hv_reset,hv_crash
> >
> >> - the guest CPU topology.
> > 8 sockets, 1 core per socket, 1 thread per core
> >
> > Hope that helps!
>
> The fact that you are using 8 VCPUs is definitely relevant. However, I
> don't think it would make sense to try to analyze any errors with an
> OVMF / edk2 tree this old. Please try to reproduce the issue with a
> fresh build from master.
>
> Thanks!
> Laszlo
>
> > 2016-11-22 16:41 GMT+03:00 Laszlo Ersek <lersek@redhat.com>:
> >
> >> Hello Evgeny,
> >>
> >> On 11/22/16 13:57, Evgeny Yakovlev wrote:
> >>> We are running windows UEFI-based VMs on QEMU/KVM with OvmfPkg.
> >>>
> >>> Very rarely we are experiencing a crash when VM tries to write to RO
> >> memory
> >>> very early during UEFI boot process.
> >>>
> >>> Crash happens when VM tries to execute this code in interrupt handler:
> >>> https://github.com/tianocore/edk2/blob/master/UefiCpuPkg/Library/
> >> CpuExceptionHandlerLib/X64/ExceptionHandlerAsm.asm#L244-L246
> >>>
> >>>
> >>> fxsave [rdi], where RDI = 0xffe60
> >>>
> >>> Which is bad - it points to ISA BIOS F-segment area.
> >>>
> >>> This memory was mapped by qemu for read only access, which is reflected
> >> in
> >>> KVM EPT:
> >>> 00000000000e0000-00000000000fffff (prio 1, R-): isa-bios
> >>>
> >>> This is a very early IRQ0 interrupt, presumably during early
> >> initialization
> >>> phase (Sec or Pei).
> >>>
> >>> Looks like CommonInterruptHandler does not switch to a separate stack
> and
> >>> works on interrupted context's stack, which was fairly close to 1MB
> >>> boundary when IRQ0 fired (RSP around 1002c0). When CommonInterruptEntry
> >>> reached highlighted code it subtracted 512 bytes from current RSP which
> >>> dropped to 0xffe60, below 1MB and into QEMU RO region.
> >>>
> >>> We were figuring out how to best fix this. Possible solutions are to
> >> switch
> >>> to a separate stack in CommonInterruptEntry, relocate early OvmfPkg
> stack
> >>> to somewhere farther away from 1MB, to run with interrupts disabled
> until
> >>> we reach a later phase or maybe something else.
> >>>
> >>> Any comments would be very appreciated!
> >>
> >> I wonder if you started to see this issue very recently.
> >>
> >> I suspect (hope!) that the symptoms you are experiencing are a
> >> consequence of a bug in UefiCpuPkg that I've debugged and fixed just
> >> today. (I hope to post the patches today.)
> >>
> >> While testing those patches on your end will of course tell us if your
> >> issue has the same root cause, you could gather a few more symptoms even
> >> before I get around posting the patches. The bug that I'm working on has
> >> extremely varied crash symptoms (basically the APs wander off into the
> >> weeds), and some of those symptoms have involved CpuExceptionHandlerLib.
> >> The point is, by the time we get into CpuExceptionHandlerLib, all is
> >> lost -- it is executing on an AP whose state is corrupt anyway. The
> >> fxsave symptom is a red herring, most likely.
> >>
> >> CpuExceptionHandlerLib works fine otherwise, especially when invoked
> >> from the BSP -- we've used the output dumped by CpuExceptionHandlerLib
> >> to the serial port several times to track down issues.
> >>
> >> So, my request is that you please capture the OVMF debug log (please see
> >> the "OvmfPkg/README" file for how). I'm curious if it crashes where and
> >> how I suspect it crashes.
> >>
> >> Also, it would help if you provided
> >> - your host CPU model,
> >> - the host kernel (KVM) version,
> >> - the guest CPU model,
> >> - the guest CPU topology.
> >>
> >> Thanks!
> >> Laszlo
> >>
> > _______________________________________________
> > edk2-devel mailing list
> > edk2-devel@lists.01.org
> > https://lists.01.org/mailman/listinfo/edk2-devel
> >
>
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: OvmfPkg: VM crashed trying to write to RO memory from CommonInterruptEntry
  2016-11-23 14:31       ` Evgeny Yakovlev
@ 2016-11-23 16:38         ` Laszlo Ersek
  0 siblings, 0 replies; 9+ messages in thread
From: Laszlo Ersek @ 2016-11-23 16:38 UTC (permalink / raw)
  To: Evgeny Yakovlev; +Cc: edk2-devel, eyakovlev, den, Jeff Fan

On 11/23/16 15:31, Evgeny Yakovlev wrote:
> Looks like we're actually based on OVMF tree for RHEL7.2:
> ovmf/rhel-20150414-2.gitc9e5618.el7, ovmf/rhel-srpm-7.2
> 
> So maybe this affects those deployments as well

Maybe, but OVMF is Tech Preview (--> totally unsupported) even in
RHEL-7.3. :)

Laszlo

> 2016-11-22 19:58 GMT+03:00 Laszlo Ersek <lersek@redhat.com
> <mailto:lersek@redhat.com>>:
> 
>     On 11/22/16 14:58, Evgeny Yakovlev wrote:
>     > Wow, that is more than i expected :)
>     >
>     >> I wonder if you started to see this issue very recently.
>     > Very recently, however we use a pretty old OVMF build, circa 2015
> 
>     Ugh. Please update OVMF first... A whole lot of things has changed in
>     edk2 in this year.
> 
>     >
>     >>  OVMF debug log
>     > Sorry, we hadn't had it enabled when VM crashed and these crashes are very
>     > rare. We will try to capture it when it happens again
>     >
>     >> - your host CPU model,
>     > cpu family      : 6
>     > model           : 42
>     > model name      : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
>     > stepping        : 7
>     >
>     >> - the host kernel (KVM) version,
>     > Our kernel is roughly based on RHEL7.2 (kernel version 3.10.0-327.36.1). We
>     > also have some upstream KVM patches backported.
>     >
>     >> - the guest CPU model,
>     > -cpu
>     > SandyBridge,+vme,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+monitor,+ds_cpl,+smx,+est,+tm2,+xtpr,+pdcm,+pcid,+osxsave,-arat,-xsaveopt,-xgetbv1,-vmx,-xsavec,hv_time,hv_relaxed,hv_vapic,hv_spinlocks=0x1fff,hv_vpindex,hv_runtime,hv_synic,hv_stimer,hv_reset,hv_crash
>     >
>     >> - the guest CPU topology.
>     > 8 sockets, 1 core per socket, 1 thread per core
>     >
>     > Hope that helps!
> 
>     The fact that you are using 8 VCPUs is definitely relevant. However, I
>     don't think it would make sense to try to analyze any errors with an
>     OVMF / edk2 tree this old. Please try to reproduce the issue with a
>     fresh build from master.
> 
>     Thanks!
>     Laszlo
> 
>     > 2016-11-22 16:41 GMT+03:00 Laszlo Ersek <lersek@redhat.com
>     <mailto:lersek@redhat.com>>:
>     >
>     >> Hello Evgeny,
>     >>
>     >> On 11/22/16 13:57, Evgeny Yakovlev wrote:
>     >>> We are running windows UEFI-based VMs on QEMU/KVM with OvmfPkg.
>     >>>
>     >>> Very rarely we are experiencing a crash when VM tries to write to RO
>     >> memory
>     >>> very early during UEFI boot process.
>     >>>
>     >>> Crash happens when VM tries to execute this code in interrupt
>     handler:
>     >>>
>     https://github.com/tianocore/edk2/blob/master/UefiCpuPkg/Library/
>     <https://github.com/tianocore/edk2/blob/master/UefiCpuPkg/Library/>
>     >> CpuExceptionHandlerLib/X64/ExceptionHandlerAsm.asm#L244-L246
>     >>>
>     >>>
>     >>> fxsave [rdi], where RDI = 0xffe60
>     >>>
>     >>> Which is bad - it points to ISA BIOS F-segment area.
>     >>>
>     >>> This memory was mapped by qemu for read only access, which is
>     reflected
>     >> in
>     >>> KVM EPT:
>     >>> 00000000000e0000-00000000000fffff (prio 1, R-): isa-bios
>     >>>
>     >>> This is a very early IRQ0 interrupt, presumably during early
>     >> initialization
>     >>> phase (Sec or Pei).
>     >>>
>     >>> Looks like CommonInterruptHandler does not switch to a separate
>     stack and
>     >>> works on interrupted context's stack, which was fairly close to 1MB
>     >>> boundary when IRQ0 fired (RSP around 1002c0). When
>     CommonInterruptEntry
>     >>> reached highlighted code it subtracted 512 bytes from current
>     RSP which
>     >>> dropped to 0xffe60, below 1MB and into QEMU RO region.
>     >>>
>     >>> We were figuring out how to best fix this. Possible solutions are to
>     >> switch
>     >>> to a separate stack in CommonInterruptEntry, relocate early
>     OvmfPkg stack
>     >>> to somewhere farther away from 1MB, to run with interrupts
>     disabled until
>     >>> we reach a later phase or maybe something else.
>     >>>
>     >>> Any comments would be very appreciated!
>     >>
>     >> I wonder if you started to see this issue very recently.
>     >>
>     >> I suspect (hope!) that the symptoms you are experiencing are a
>     >> consequence of a bug in UefiCpuPkg that I've debugged and fixed just
>     >> today. (I hope to post the patches today.)
>     >>
>     >> While testing those patches on your end will of course tell us if
>     your
>     >> issue has the same root cause, you could gather a few more
>     symptoms even
>     >> before I get around posting the patches. The bug that I'm working
>     on has
>     >> extremely varied crash symptoms (basically the APs wander off
>     into the
>     >> weeds), and some of those symptoms have involved
>     CpuExceptionHandlerLib.
>     >> The point is, by the time we get into CpuExceptionHandlerLib, all is
>     >> lost -- it is executing on an AP whose state is corrupt anyway. The
>     >> fxsave symptom is a red herring, most likely.
>     >>
>     >> CpuExceptionHandlerLib works fine otherwise, especially when invoked
>     >> from the BSP -- we've used the output dumped by
>     CpuExceptionHandlerLib
>     >> to the serial port several times to track down issues.
>     >>
>     >> So, my request is that you please capture the OVMF debug log
>     (please see
>     >> the "OvmfPkg/README" file for how). I'm curious if it crashes
>     where and
>     >> how I suspect it crashes.
>     >>
>     >> Also, it would help if you provided
>     >> - your host CPU model,
>     >> - the host kernel (KVM) version,
>     >> - the guest CPU model,
>     >> - the guest CPU topology.
>     >>
>     >> Thanks!
>     >> Laszlo
>     >>
>     > _______________________________________________
>     > edk2-devel mailing list
>     > edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org>
>     > https://lists.01.org/mailman/listinfo/edk2-devel
>     <https://lists.01.org/mailman/listinfo/edk2-devel>
>     >
> 
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: OvmfPkg: VM crashed trying to write to RO memory from CommonInterruptEntry
  2016-11-23  8:37       ` Evgeny Yakovlev
@ 2016-11-23 16:54         ` Laszlo Ersek
  2016-12-07  9:11           ` Evgeny Yakovlev
  0 siblings, 1 reply; 9+ messages in thread
From: Laszlo Ersek @ 2016-11-23 16:54 UTC (permalink / raw)
  To: Evgeny Yakovlev; +Cc: edk2-devel, eyakovlev, den, Jeff Fan

On 11/23/16 09:37, Evgeny Yakovlev wrote:
> You are right of course about the old tree, no objections here. I will
> try to advocate for an update however i am pretty sure we're stuck with
> our version for some time at least.
> 
> Still, my original question was about is it normal for OVMF Sec/Pei
> stage to have its stack so close to 0x100000

The SEC phase, and the part of the PEI phase that runs before installing
and migrating to the permanent PEI RAM, use the range 8256KB..8288KB as
heap and stack (half/half, 16 KB and 16 KB). This range is above 8MB.

The address you specify is 1MB. I don't think we ever set up such a
stack intentionally.

It is possible that the failure occurs in one of the AP startup routines
(the APs start in real mode), and the exception handler is executed by
the AP.

To know more, the serial port and debug port outputs would be
interesting; but, I should note, based on past experience, when the APs
run into such issues in their startup routines, the final symptoms are
usually garbage / chaotic. These issues are usually fixed by preventing
the APs from wandering off into the woods in the first place (for
example, nailing down race conditions between BSP and APs), and edk2 has
seen a lot of improvements in that area, with the arrival of MpInitLib.

> and/or why interrupt
> handler in UefiCpuPkg/Library/CpuExceptionHandlerLib/X64 does not switch
> to a separate stack.

I hope Jeff can provide some insight here.

Thanks
Laszlo

> Code in UefiCpuPkg/Library/CpuExceptionHandlerLib/X64 hasn't been
> touched for 2 years so our version is still relevant.
> 
> 2016-11-22 19:58 GMT+03:00 Laszlo Ersek <lersek@redhat.com
> <mailto:lersek@redhat.com>>:
> 
>     On 11/22/16 14:58, Evgeny Yakovlev wrote:
>     > Wow, that is more than i expected :)
>     >
>     >> I wonder if you started to see this issue very recently.
>     > Very recently, however we use a pretty old OVMF build, circa 2015
> 
>     Ugh. Please update OVMF first... A whole lot of things has changed in
>     edk2 in this year.
> 
>     >
>     >>  OVMF debug log
>     > Sorry, we hadn't had it enabled when VM crashed and these crashes are very
>     > rare. We will try to capture it when it happens again
>     >
>     >> - your host CPU model,
>     > cpu family      : 6
>     > model           : 42
>     > model name      : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
>     > stepping        : 7
>     >
>     >> - the host kernel (KVM) version,
>     > Our kernel is roughly based on RHEL7.2 (kernel version 3.10.0-327.36.1). We
>     > also have some upstream KVM patches backported.
>     >
>     >> - the guest CPU model,
>     > -cpu
>     > SandyBridge,+vme,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+monitor,+ds_cpl,+smx,+est,+tm2,+xtpr,+pdcm,+pcid,+osxsave,-arat,-xsaveopt,-xgetbv1,-vmx,-xsavec,hv_time,hv_relaxed,hv_vapic,hv_spinlocks=0x1fff,hv_vpindex,hv_runtime,hv_synic,hv_stimer,hv_reset,hv_crash
>     >
>     >> - the guest CPU topology.
>     > 8 sockets, 1 core per socket, 1 thread per core
>     >
>     > Hope that helps!
> 
>     The fact that you are using 8 VCPUs is definitely relevant. However, I
>     don't think it would make sense to try to analyze any errors with an
>     OVMF / edk2 tree this old. Please try to reproduce the issue with a
>     fresh build from master.
> 
>     Thanks!
>     Laszlo
> 
>     > 2016-11-22 16:41 GMT+03:00 Laszlo Ersek <lersek@redhat.com
>     <mailto:lersek@redhat.com>>:
>     >
>     >> Hello Evgeny,
>     >>
>     >> On 11/22/16 13:57, Evgeny Yakovlev wrote:
>     >>> We are running windows UEFI-based VMs on QEMU/KVM with OvmfPkg.
>     >>>
>     >>> Very rarely we are experiencing a crash when VM tries to write to RO
>     >> memory
>     >>> very early during UEFI boot process.
>     >>>
>     >>> Crash happens when VM tries to execute this code in interrupt
>     handler:
>     >>>
>     https://github.com/tianocore/edk2/blob/master/UefiCpuPkg/Library/
>     <https://github.com/tianocore/edk2/blob/master/UefiCpuPkg/Library/>
>     >> CpuExceptionHandlerLib/X64/ExceptionHandlerAsm.asm#L244-L246
>     >>>
>     >>>
>     >>> fxsave [rdi], where RDI = 0xffe60
>     >>>
>     >>> Which is bad - it points to ISA BIOS F-segment area.
>     >>>
>     >>> This memory was mapped by qemu for read only access, which is
>     reflected
>     >> in
>     >>> KVM EPT:
>     >>> 00000000000e0000-00000000000fffff (prio 1, R-): isa-bios
>     >>>
>     >>> This is a very early IRQ0 interrupt, presumably during early
>     >> initialization
>     >>> phase (Sec or Pei).
>     >>>
>     >>> Looks like CommonInterruptHandler does not switch to a separate
>     stack and
>     >>> works on interrupted context's stack, which was fairly close to 1MB
>     >>> boundary when IRQ0 fired (RSP around 1002c0). When
>     CommonInterruptEntry
>     >>> reached highlighted code it subtracted 512 bytes from current
>     RSP which
>     >>> dropped to 0xffe60, below 1MB and into QEMU RO region.
>     >>>
>     >>> We were figuring out how to best fix this. Possible solutions are to
>     >> switch
>     >>> to a separate stack in CommonInterruptEntry, relocate early
>     OvmfPkg stack
>     >>> to somewhere farther away from 1MB, to run with interrupts
>     disabled until
>     >>> we reach a later phase or maybe something else.
>     >>>
>     >>> Any comments would be very appreciated!
>     >>
>     >> I wonder if you started to see this issue very recently.
>     >>
>     >> I suspect (hope!) that the symptoms you are experiencing are a
>     >> consequence of a bug in UefiCpuPkg that I've debugged and fixed just
>     >> today. (I hope to post the patches today.)
>     >>
>     >> While testing those patches on your end will of course tell us if
>     your
>     >> issue has the same root cause, you could gather a few more
>     symptoms even
>     >> before I get around posting the patches. The bug that I'm working
>     on has
>     >> extremely varied crash symptoms (basically the APs wander off
>     into the
>     >> weeds), and some of those symptoms have involved
>     CpuExceptionHandlerLib.
>     >> The point is, by the time we get into CpuExceptionHandlerLib, all is
>     >> lost -- it is executing on an AP whose state is corrupt anyway. The
>     >> fxsave symptom is a red herring, most likely.
>     >>
>     >> CpuExceptionHandlerLib works fine otherwise, especially when invoked
>     >> from the BSP -- we've used the output dumped by
>     CpuExceptionHandlerLib
>     >> to the serial port several times to track down issues.
>     >>
>     >> So, my request is that you please capture the OVMF debug log
>     (please see
>     >> the "OvmfPkg/README" file for how). I'm curious if it crashes
>     where and
>     >> how I suspect it crashes.
>     >>
>     >> Also, it would help if you provided
>     >> - your host CPU model,
>     >> - the host kernel (KVM) version,
>     >> - the guest CPU model,
>     >> - the guest CPU topology.
>     >>
>     >> Thanks!
>     >> Laszlo
>     >>
>     > _______________________________________________
>     > edk2-devel mailing list
>     > edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org>
>     > https://lists.01.org/mailman/listinfo/edk2-devel
>     <https://lists.01.org/mailman/listinfo/edk2-devel>
>     >
> 
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: OvmfPkg: VM crashed trying to write to RO memory from CommonInterruptEntry
  2016-11-23 16:54         ` Laszlo Ersek
@ 2016-12-07  9:11           ` Evgeny Yakovlev
  0 siblings, 0 replies; 9+ messages in thread
From: Evgeny Yakovlev @ 2016-12-07  9:11 UTC (permalink / raw)
  To: Laszlo Ersek; +Cc: edk2-devel, eyakovlev, den, Jeff Fan

Hi Laszlo

So i managed to reproduce this with debug logs this time and figured out
what happened.
tl;dr: KVM is spamming guest firmware with 8254 PIT timer IRQs after
accumulating a lot of missed timer expiration events while win 2k16 boot
loader was waiting for debugger connection on COM2.

As in previous bugs we see recursive entry into IRQ0 handler in OVMF. OVMF
IRQ0 handler workload is O(1) however it enables interrupts during normal
processing so reenters are possible. If we manually slow down IRQ0 handler
enough we can reproduce this easily.

Repro always happens strictly during UEFI BDS phase, when 2k16 boot loader
is already executing but still as a UEFI application (ExitBootServices
hasn't been called yet). During this time boot loader waits for debugger
connection on COM2 (installation ISO was configured this way).

While it does so a lot of IRQ3's are generated (COM2) and guest does not
react to IRQ0 interrupts which are generated by kvm-pit. KVM-PIT keeps
track of all non-acked interrupts. When any ACK comes KVM tries to
immediately reinject all accumulated pending interrupts without any delays:

kvm_pit_ack_irq:
248         /* in this case, we had multiple outstanding pit interrupts
249          * that we needed to inject.  Reinject
250          */
251         queue_kthread_work(&ps->pit->worker, &ps->pit->expired);


In one case KVM managed to accumulate 0x108b interrupts to reinject:
CPU 0/KVM-47696 [000] d... 63779.924496: kvm_pit_ack_irq_44:
(kvm_pit_ack_irq+0x2c/0x70 [kvm]) pending=108b

Normally OVMF configures PIT to generate IRQ0 each 10 milliseconds, however
guest now sees IRQ0 each 500-600 microseconds:
kvm-pit/47688-47704 [007] .... 63779.924512: kvm_pic_set_irq: chip 0 pin 0
(edge)
kvm-pit/47688-47704 [007] .... 63779.925132: kvm_pic_set_irq: chip 0 pin 0
(edge)

Which is not enough time for normal OVMF IRQ0 handler to exit so when it
enables interrupts IRQ0 handler is immediately reintered. After about 300
reinjects machine runs out of stack space and hits RO area below 1MB.

Just though you'd be interested to know!

2016-11-23 19:54 GMT+03:00 Laszlo Ersek <lersek@redhat.com>:

> On 11/23/16 09:37, Evgeny Yakovlev wrote:
> > You are right of course about the old tree, no objections here. I will
> > try to advocate for an update however i am pretty sure we're stuck with
> > our version for some time at least.
> >
> > Still, my original question was about is it normal for OVMF Sec/Pei
> > stage to have its stack so close to 0x100000
>
> The SEC phase, and the part of the PEI phase that runs before installing
> and migrating to the permanent PEI RAM, use the range 8256KB..8288KB as
> heap and stack (half/half, 16 KB and 16 KB). This range is above 8MB.
>
> The address you specify is 1MB. I don't think we ever set up such a
> stack intentionally.
>
> It is possible that the failure occurs in one of the AP startup routines
> (the APs start in real mode), and the exception handler is executed by
> the AP.
>
> To know more, the serial port and debug port outputs would be
> interesting; but, I should note, based on past experience, when the APs
> run into such issues in their startup routines, the final symptoms are
> usually garbage / chaotic. These issues are usually fixed by preventing
> the APs from wandering off into the woods in the first place (for
> example, nailing down race conditions between BSP and APs), and edk2 has
> seen a lot of improvements in that area, with the arrival of MpInitLib.
>
> > and/or why interrupt
> > handler in UefiCpuPkg/Library/CpuExceptionHandlerLib/X64 does not switch
> > to a separate stack.
>
> I hope Jeff can provide some insight here.
>
> Thanks
> Laszlo
>
> > Code in UefiCpuPkg/Library/CpuExceptionHandlerLib/X64 hasn't been
> > touched for 2 years so our version is still relevant.
> >
> > 2016-11-22 19:58 GMT+03:00 Laszlo Ersek <lersek@redhat.com
> > <mailto:lersek@redhat.com>>:
> >
> >     On 11/22/16 14:58, Evgeny Yakovlev wrote:
> >     > Wow, that is more than i expected :)
> >     >
> >     >> I wonder if you started to see this issue very recently.
> >     > Very recently, however we use a pretty old OVMF build, circa 2015
> >
> >     Ugh. Please update OVMF first... A whole lot of things has changed in
> >     edk2 in this year.
> >
> >     >
> >     >>  OVMF debug log
> >     > Sorry, we hadn't had it enabled when VM crashed and these crashes
> are very
> >     > rare. We will try to capture it when it happens again
> >     >
> >     >> - your host CPU model,
> >     > cpu family      : 6
> >     > model           : 42
> >     > model name      : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
> >     > stepping        : 7
> >     >
> >     >> - the host kernel (KVM) version,
> >     > Our kernel is roughly based on RHEL7.2 (kernel version
> 3.10.0-327.36.1). We
> >     > also have some upstream KVM patches backported.
> >     >
> >     >> - the guest CPU model,
> >     > -cpu
> >     > SandyBridge,+vme,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+
> monitor,+ds_cpl,+smx,+est,+tm2,+xtpr,+pdcm,+pcid,+
> osxsave,-arat,-xsaveopt,-xgetbv1,-vmx,-xsavec,hv_time,
> hv_relaxed,hv_vapic,hv_spinlocks=0x1fff,hv_vpindex,
> hv_runtime,hv_synic,hv_stimer,hv_reset,hv_crash
> >     >
> >     >> - the guest CPU topology.
> >     > 8 sockets, 1 core per socket, 1 thread per core
> >     >
> >     > Hope that helps!
> >
> >     The fact that you are using 8 VCPUs is definitely relevant. However,
> I
> >     don't think it would make sense to try to analyze any errors with an
> >     OVMF / edk2 tree this old. Please try to reproduce the issue with a
> >     fresh build from master.
> >
> >     Thanks!
> >     Laszlo
> >
> >     > 2016-11-22 16:41 GMT+03:00 Laszlo Ersek <lersek@redhat.com
> >     <mailto:lersek@redhat.com>>:
> >     >
> >     >> Hello Evgeny,
> >     >>
> >     >> On 11/22/16 13:57, Evgeny Yakovlev wrote:
> >     >>> We are running windows UEFI-based VMs on QEMU/KVM with OvmfPkg.
> >     >>>
> >     >>> Very rarely we are experiencing a crash when VM tries to write
> to RO
> >     >> memory
> >     >>> very early during UEFI boot process.
> >     >>>
> >     >>> Crash happens when VM tries to execute this code in interrupt
> >     handler:
> >     >>>
> >     https://github.com/tianocore/edk2/blob/master/UefiCpuPkg/Library/
> >     <https://github.com/tianocore/edk2/blob/master/UefiCpuPkg/Library/>
> >     >> CpuExceptionHandlerLib/X64/ExceptionHandlerAsm.asm#L244-L246
> >     >>>
> >     >>>
> >     >>> fxsave [rdi], where RDI = 0xffe60
> >     >>>
> >     >>> Which is bad - it points to ISA BIOS F-segment area.
> >     >>>
> >     >>> This memory was mapped by qemu for read only access, which is
> >     reflected
> >     >> in
> >     >>> KVM EPT:
> >     >>> 00000000000e0000-00000000000fffff (prio 1, R-): isa-bios
> >     >>>
> >     >>> This is a very early IRQ0 interrupt, presumably during early
> >     >> initialization
> >     >>> phase (Sec or Pei).
> >     >>>
> >     >>> Looks like CommonInterruptHandler does not switch to a separate
> >     stack and
> >     >>> works on interrupted context's stack, which was fairly close to
> 1MB
> >     >>> boundary when IRQ0 fired (RSP around 1002c0). When
> >     CommonInterruptEntry
> >     >>> reached highlighted code it subtracted 512 bytes from current
> >     RSP which
> >     >>> dropped to 0xffe60, below 1MB and into QEMU RO region.
> >     >>>
> >     >>> We were figuring out how to best fix this. Possible solutions
> are to
> >     >> switch
> >     >>> to a separate stack in CommonInterruptEntry, relocate early
> >     OvmfPkg stack
> >     >>> to somewhere farther away from 1MB, to run with interrupts
> >     disabled until
> >     >>> we reach a later phase or maybe something else.
> >     >>>
> >     >>> Any comments would be very appreciated!
> >     >>
> >     >> I wonder if you started to see this issue very recently.
> >     >>
> >     >> I suspect (hope!) that the symptoms you are experiencing are a
> >     >> consequence of a bug in UefiCpuPkg that I've debugged and fixed
> just
> >     >> today. (I hope to post the patches today.)
> >     >>
> >     >> While testing those patches on your end will of course tell us if
> >     your
> >     >> issue has the same root cause, you could gather a few more
> >     symptoms even
> >     >> before I get around posting the patches. The bug that I'm working
> >     on has
> >     >> extremely varied crash symptoms (basically the APs wander off
> >     into the
> >     >> weeds), and some of those symptoms have involved
> >     CpuExceptionHandlerLib.
> >     >> The point is, by the time we get into CpuExceptionHandlerLib, all
> is
> >     >> lost -- it is executing on an AP whose state is corrupt anyway.
> The
> >     >> fxsave symptom is a red herring, most likely.
> >     >>
> >     >> CpuExceptionHandlerLib works fine otherwise, especially when
> invoked
> >     >> from the BSP -- we've used the output dumped by
> >     CpuExceptionHandlerLib
> >     >> to the serial port several times to track down issues.
> >     >>
> >     >> So, my request is that you please capture the OVMF debug log
> >     (please see
> >     >> the "OvmfPkg/README" file for how). I'm curious if it crashes
> >     where and
> >     >> how I suspect it crashes.
> >     >>
> >     >> Also, it would help if you provided
> >     >> - your host CPU model,
> >     >> - the host kernel (KVM) version,
> >     >> - the guest CPU model,
> >     >> - the guest CPU topology.
> >     >>
> >     >> Thanks!
> >     >> Laszlo
> >     >>
> >     > _______________________________________________
> >     > edk2-devel mailing list
> >     > edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org>
> >     > https://lists.01.org/mailman/listinfo/edk2-devel
> >     <https://lists.01.org/mailman/listinfo/edk2-devel>
> >     >
> >
> >
>
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-12-07  9:11 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-22 12:57 OvmfPkg: VM crashed trying to write to RO memory from CommonInterruptEntry Evgeny Yakovlev
2016-11-22 13:41 ` Laszlo Ersek
2016-11-22 13:58   ` Evgeny Yakovlev
2016-11-22 16:58     ` Laszlo Ersek
2016-11-23  8:37       ` Evgeny Yakovlev
2016-11-23 16:54         ` Laszlo Ersek
2016-12-07  9:11           ` Evgeny Yakovlev
2016-11-23 14:31       ` Evgeny Yakovlev
2016-11-23 16:38         ` Laszlo Ersek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox