From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <lersek@redhat.com>
Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by ml01.01.org (Postfix) with ESMTPS id EEF3181E50
 for <edk2-devel@ml01.01.org>; Tue, 22 Nov 2016 05:41:26 -0800 (PST)
Received: from int-mx14.intmail.prod.int.phx2.redhat.com
 (int-mx14.intmail.prod.int.phx2.redhat.com [10.5.11.27])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.redhat.com (Postfix) with ESMTPS id 54EBB8EB3F;
 Tue, 22 Nov 2016 13:41:26 +0000 (UTC)
Received: from lacos-laptop-7.usersys.redhat.com (ovpn-116-82.phx2.redhat.com
 [10.3.116.82])
 by int-mx14.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id
 uAMDfOmN013418; Tue, 22 Nov 2016 08:41:25 -0500
To: Evgeny Yakovlev <insoreiges@gmail.com>, edk2-devel@ml01.01.org
References: <CAM0BJjTMEH4pqtmUU2wSSnDeSz6SqFRopJygBZ28mmBdhHE0ow@mail.gmail.com>
Cc: eyakovlev@virtuozzo.com, den@virtuozzo.com, Jeff Fan <jeff.fan@intel.com>
From: Laszlo Ersek <lersek@redhat.com>
Message-ID: <2340021c-4bcb-2622-07a8-6e6173f94d81@redhat.com>
Date: Tue, 22 Nov 2016 14:41:23 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Thunderbird/45.4.0
MIME-Version: 1.0
In-Reply-To: <CAM0BJjTMEH4pqtmUU2wSSnDeSz6SqFRopJygBZ28mmBdhHE0ow@mail.gmail.com>
X-Scanned-By: MIMEDefang 2.68 on 10.5.11.27
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16
 (mx1.redhat.com [10.5.110.26]); Tue, 22 Nov 2016 13:41:26 +0000 (UTC)
Subject: Re: OvmfPkg: VM crashed trying to write to RO memory from CommonInterruptEntry
X-BeenThere: edk2-devel@lists.01.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: EDK II Development  <edk2-devel.lists.01.org>
List-Unsubscribe: <https://lists.01.org/mailman/options/edk2-devel>,
 <mailto:edk2-devel-request@lists.01.org?subject=unsubscribe>
List-Archive: <http://lists.01.org/pipermail/edk2-devel/>
List-Post: <mailto:edk2-devel@lists.01.org>
List-Help: <mailto:edk2-devel-request@lists.01.org?subject=help>
List-Subscribe: <https://lists.01.org/mailman/listinfo/edk2-devel>,
 <mailto:edk2-devel-request@lists.01.org?subject=subscribe>
X-List-Received-Date: Tue, 22 Nov 2016 13:41:27 -0000
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit

Hello Evgeny,

On 11/22/16 13:57, Evgeny Yakovlev wrote:
> We are running windows UEFI-based VMs on QEMU/KVM with OvmfPkg.
> 
> Very rarely we are experiencing a crash when VM tries to write to RO memory
> very early during UEFI boot process.
> 
> Crash happens when VM tries to execute this code in interrupt handler:
> https://github.com/tianocore/edk2/blob/master/UefiCpuPkg/Library/CpuExceptionHandlerLib/X64/ExceptionHandlerAsm.asm#L244-L246
> 
> 
> fxsave [rdi], where RDI = 0xffe60
> 
> Which is bad - it points to ISA BIOS F-segment area.
> 
> This memory was mapped by qemu for read only access, which is reflected in
> KVM EPT:
> 00000000000e0000-00000000000fffff (prio 1, R-): isa-bios
> 
> This is a very early IRQ0 interrupt, presumably during early initialization
> phase (Sec or Pei).
> 
> Looks like CommonInterruptHandler does not switch to a separate stack and
> works on interrupted context's stack, which was fairly close to 1MB
> boundary when IRQ0 fired (RSP around 1002c0). When CommonInterruptEntry
> reached highlighted code it subtracted 512 bytes from current RSP which
> dropped to 0xffe60, below 1MB and into QEMU RO region.
> 
> We were figuring out how to best fix this. Possible solutions are to switch
> to a separate stack in CommonInterruptEntry, relocate early OvmfPkg stack
> to somewhere farther away from 1MB, to run with interrupts disabled until
> we reach a later phase or maybe something else.
> 
> Any comments would be very appreciated!

I wonder if you started to see this issue very recently.

I suspect (hope!) that the symptoms you are experiencing are a
consequence of a bug in UefiCpuPkg that I've debugged and fixed just
today. (I hope to post the patches today.)

While testing those patches on your end will of course tell us if your
issue has the same root cause, you could gather a few more symptoms even
before I get around posting the patches. The bug that I'm working on has
extremely varied crash symptoms (basically the APs wander off into the
weeds), and some of those symptoms have involved CpuExceptionHandlerLib.
The point is, by the time we get into CpuExceptionHandlerLib, all is
lost -- it is executing on an AP whose state is corrupt anyway. The
fxsave symptom is a red herring, most likely.

CpuExceptionHandlerLib works fine otherwise, especially when invoked
from the BSP -- we've used the output dumped by CpuExceptionHandlerLib
to the serial port several times to track down issues.

So, my request is that you please capture the OVMF debug log (please see
the "OvmfPkg/README" file for how). I'm curious if it crashes where and
how I suspect it crashes.

Also, it would help if you provided
- your host CPU model,
- the host kernel (KVM) version,
- the guest CPU model,
- the guest CPU topology.

Thanks!
Laszlo