Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS

public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed

* Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS
       [not found]                 ` <5b7352f4-4965-3ed5-3879-db871797be47@huawei.com>
@ 2017-03-29 10:36                   ` Achin Gupta
  2017-03-29 11:58                     ` Laszlo Ersek
       [not found]                     ` <CAMj-D2BT3ByY-iFrRVVK7y=G7zhRBtM031VgLn6JzwUE-WCdWQ@mail.gmail.com>
  0 siblings, 2 replies; 10+ messages in thread
From: Achin Gupta @ 2017-03-29 10:36 UTC (permalink / raw)
  To: gengdongjiu
  Cc: lersek, ard.biesheuvel, edk2-devel, qemu-devel, zhaoshenglong,
	James Morse, Christoffer Dall, xiexiuqi, Marc Zyngier,
	catalin.marinas, will.deacon, christoffer.dall, rkrcmar,
	suzuki.poulose, andre.przywara, mark.rutland, vladimir.murzin,
	linux-arm-kernel, kvmarm, kvm, linux-kernel, wangxiongfeng2,
	wuquanming, huangshaoyu, Leif.Lindholm, nd

Hi gengdongjiu,

On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>
> Hi Laszlo/Biesheuvel/Qemu developer,
>
>    Now I encounter a issue and want to consult with you in ARM64 platform， as described below:
>
>    when guest OS happen synchronous or asynchronous abort, kvm needs to send the error address to Qemu or UEFI through sigbus to dynamically generate APEI table. from my investigation, there are two ways:
>
>    (1) Qemu get the error address, and generate the APEI table, then notify UEFI to know this generation, then inject abort error to guest OS, guest OS read the APEI table.
>    (2) Qemu get the error address, and let UEFI to generate the APEI table, then inject abort error to guest OS, guest OS read the APEI table.

Just being pedantic! I don't think we are talking about creating the APEI table
dynamically here. The issue is: Once KVM has received an error that is destined
for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the error
into the guest OS, a CPER (Common Platform Error Record) has to be generated
corresponding to the error source (GHES corresponding to memory subsystem,
processor etc) to allow the guest OS to do anything meaningful with the
error. So who should create the CPER is the question.

At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error arrives
at EL3 and secure firmware (at EL3 or a lower secure exception level) is
responsible for creating the CPER. ARM is experimenting with using a Standalone
MM EDK2 image in the secure world to do the CPER creation. This will avoid
adding the same code in ARM TF in EL3 (better for security). The error will then
be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM Trusted
Firmware.

Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
interface (as discussed with Christoffer below). So it should generate the CPER
before injecting the error.

This is corresponds to (1) above apart from notifying UEFI (I am assuming you
mean guest UEFI). At this time, the guest OS already knows where to pick up the
CPER from through the HEST. Qemu has to create the CPER and populate its address
at the address exported in the HEST. Guest UEFI should not be involved in this
flow. Its job was to create the HEST at boot and that has been done by this
stage.

Qemu folk will be able to add but it looks like support for CPER generation will
need to be added to Qemu. We need to resolve this.

Do shout if I am missing anything above.

cheers,
Achin


>
>
>    Do you think which modules generates the APEI table is better? UEFI or Qemu?
>
>
>
>
> On 2017/3/28 21:40, James Morse wrote:
> > Hi gengdongjiu,
> >
> > On 28/03/17 13:16, gengdongjiu wrote:
> >> On 2017/3/28 19:54, Achin Gupta wrote:
> >>> On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
> >>>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
> >>>>> On the host, part of UEFI is involved to generate the CPER records.
> >>>>> In a guest?, I don't know.
> >>>>> Qemu could generate the records, or drive some other component to do it.
> >>>>
> >>>> I think I am beginning to understand this a bit.  Since the guet UEFI
> >>>> instance is specifically built for the machine it runs on, QEMU's virt
> >>>> machine in this case, they could simply agree (by some contract) to
> >>>> place the records at some specific location in memory, and if the guest
> >>>> kernel asks its guest UEFI for that location, things should just work by
> >>>> having logic in QEMU to process error reports and populate guest memory.
> >>>>
> >>>> Is this how others see the world too?
> >>>
> >>> I think so!
> >>>
> >>> AFAIU, the memory where CPERs will reside should be specified in a GHES entry in
> >>> the HEST. Is this not the case with a guest kernel i.e. the guest UEFI creates a
> >>> HEST for the guest Kernel?
> >>>
> >>> If so, then the question is how the guest UEFI finds out where QEMU (acting as
> >>> EL3 firmware) will populate the CPERs. This could either be a contract between
> >>> the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask QEMU
> >>> where the memory is.
> >>
> >> whether invoke the guest UEFI will be complex? not see the advantage. it seems x86 Qemu
> >> directly generate the ACPI table, but I am not sure, we are checking the qemu
> > logical.
> >> let Qemu generate CPER record may be clear.
> >
> > At boot UEFI in the guest will need to make sure the areas of memory that may be
> > used for CPER records are reserved. Whether UEFI or Qemu decides where these are
> > needs deciding, (but probably not here)...
> >
> > At runtime, when an error has occurred, I agree it would be simpler (fewer
> > components involved) if Qemu generates the CPER records. But if UEFI made the
> > memory choice above they need to interact and it gets complicated again. The
> > CPER records are defined in the UEFI spec, so I would expect UEFI to contain
> > code to generate/parse them.
> >
> >
> > Thanks,
> >
> > James
> >
> >
> > .
> >
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS
  2017-03-29 10:36                   ` [PATCH] kvm: pass the virtual SEI syndrome to guest OS Achin Gupta
@ 2017-03-29 11:58                     ` Laszlo Ersek
       [not found]                       ` <20170329154539-mutt-send-email-mst@kernel.org>
  2017-04-06 12:35                       ` gengdongjiu
       [not found]                     ` <CAMj-D2BT3ByY-iFrRVVK7y=G7zhRBtM031VgLn6JzwUE-WCdWQ@mail.gmail.com>
  1 sibling, 2 replies; 10+ messages in thread
From: Laszlo Ersek @ 2017-03-29 11:58 UTC (permalink / raw)
  To: Achin Gupta, gengdongjiu
  Cc: ard.biesheuvel, edk2-devel, qemu-devel, zhaoshenglong,
	James Morse, Christoffer Dall, xiexiuqi, Marc Zyngier,
	catalin.marinas, will.deacon, christoffer.dall, rkrcmar,
	suzuki.poulose, andre.przywara, mark.rutland, vladimir.murzin,
	linux-arm-kernel, kvmarm, kvm, linux-kernel, wangxiongfeng2,
	wuquanming, huangshaoyu, Leif.Lindholm, nd, Michael Tsirkin,
	Igor Mammedov

(This ought to be one of the longest address lists I've ever seen :)
Thanks for the CC. I'm glad Shannon is already on the CC list. For good
measure, I'm adding MST and Igor.)

On 03/29/17 12:36, Achin Gupta wrote:
> Hi gengdongjiu,
> 
> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>>
>> Hi Laszlo/Biesheuvel/Qemu developer,
>>
>>    Now I encounter a issue and want to consult with you in ARM64 platform， as described below:
>>
>> when guest OS happen synchronous or asynchronous abort, kvm needs
>> to send the error address to Qemu or UEFI through sigbus to
>> dynamically generate APEI table. from my investigation, there are
>> two ways:
>>
>> (1) Qemu get the error address, and generate the APEI table, then
>> notify UEFI to know this generation, then inject abort error to
>> guest OS, guest OS read the APEI table.
>> (2) Qemu get the error address, and let UEFI to generate the APEI
>> table, then inject abort error to guest OS, guest OS read the APEI
>> table.
> 
> Just being pedantic! I don't think we are talking about creating the APEI table
> dynamically here. The issue is: Once KVM has received an error that is destined
> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the error
> into the guest OS, a CPER (Common Platform Error Record) has to be generated
> corresponding to the error source (GHES corresponding to memory subsystem,
> processor etc) to allow the guest OS to do anything meaningful with the
> error. So who should create the CPER is the question.
> 
> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error arrives
> at EL3 and secure firmware (at EL3 or a lower secure exception level) is
> responsible for creating the CPER. ARM is experimenting with using a Standalone
> MM EDK2 image in the secure world to do the CPER creation. This will avoid
> adding the same code in ARM TF in EL3 (better for security). The error will then
> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM Trusted
> Firmware.
> 
> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
> interface (as discussed with Christoffer below). So it should generate the CPER
> before injecting the error.
> 
> This is corresponds to (1) above apart from notifying UEFI (I am assuming you
> mean guest UEFI). At this time, the guest OS already knows where to pick up the
> CPER from through the HEST. Qemu has to create the CPER and populate its address
> at the address exported in the HEST. Guest UEFI should not be involved in this
> flow. Its job was to create the HEST at boot and that has been done by this
> stage.
> 
> Qemu folk will be able to add but it looks like support for CPER generation will
> need to be added to Qemu. We need to resolve this.
> 
> Do shout if I am missing anything above.

After reading this email, the use case looks *very* similar to what
we've just done with VMGENID for QEMU 2.9.

We have a facility between QEMU and the guest firmware, called "ACPI
linker/loader", with which QEMU instructs the firmware to

- allocate and download blobs into guest RAM (AcpiNVS type memory) --
ALLOCATE command,

- relocate pointers in those blobs, to fields in other (or the same)
blobs -- ADD_POINTER command,

- set ACPI table checksums -- ADD_CHECKSUM command,

- and send GPAs of fields within such blobs back to QEMU --
WRITE_POINTER command.

This is how I imagine we can map the facility to the current use case
(note that this is the first time I read about HEST / GHES / CPER):

    etc/acpi/tables                 etc/hardware_errors
    ================     ==========================================
                         +-----------+
    +--------------+     | address   |         +-> +--------------+
    |    HEST      +     | registers |         |   | Error Status |
    + +------------+     | +---------+         |   | Data Block 1 |
    | | GHES       | --> | | address | --------+   | +------------+
    | | GHES       | --> | | address | ------+     | |  CPER      |
    | | GHES       | --> | | address | ----+ |     | |  CPER      |
    | | GHES       | --> | | address | -+  | |     | |  CPER      |
    +-+------------+     +-+---------+  |  | |     +-+------------+
                                        |  | |
                                        |  | +---> +--------------+
                                        |  |       | Error Status |
                                        |  |       | Data Block 2 |
                                        |  |       | +------------+
                                        |  |       | |  CPER      |
                                        |  |       | |  CPER      |
                                        |  |       +-+------------+
                                        |  |
                                        |  +-----> +--------------+
                                        |          | Error Status |
                                        |          | Data Block 3 |
                                        |          | +------------+
                                        |          | |  CPER      |
                                        |          +-+------------+
                                        |
                                        +--------> +--------------+
                                                   | Error Status |
                                                   | Data Block 4 |
                                                   | +------------+
                                                   | |  CPER      |
                                                   | |  CPER      |
                                                   | |  CPER      |
                                                   +-+------------+

(1) QEMU generates the HEST ACPI table. This table goes in the current
"etc/acpi/tables" fw_cfg blob. Given N error sources, there will be N
GHES objects in the HEST.

(2) We introduce a new fw_cfg blob called "etc/hardware_errors". QEMU
also populates this blob.

(2a) Given N error sources, the (unnamed) table of address registers
will contain N address registers.

(2b) Given N error sources, the "etc/hardwre_errors" fw_cfg blob will
also contain N Error Status Data Blocks.

I don't know about the sizing (number of CPERs) each Error Status Data
Block has to contain, but I understand it is all pre-allocated as far as
the OS is concerned, which matches our capabilities well.

(3) QEMU generates the ACPI linker/loader script for the firmware, as
always.

(3a) The HEST table is part of "etc/acpi/tables", which the firmware
already allocates memory for, and downloads (because QEMU already
generates an ALLOCATE linker/loader command for it already).

(3b) QEMU will have to create another ALLOCATE command for the
"etc/hardware_errors" blob. The firmware allocates memory for this blob,
and downloads it.

(4) QEMU generates, in the ACPI linker/loader script for the firwmare, N
ADD_POINTER commands, which point the GHES."Error Status
Address" fields in the HEST table, to the corresponding address
registers in the downloaded "etc/hardware_errors" blob.

(5) QEMU generates an ADD_CHECKSUM command for the firmware, so that the
HEST table is correctly checksummed after executing the N ADD_POINTER
commands from (4).

(6) QEMU generates N ADD_POINTER commands for the firmware, pointing the
address registers (located in guest memory, in the downloaded
"etc/hardware_errors" blob) to the respective Error Status Data Blocks.

(7) (This is the trick.) For this step, we need a third, write-only
fw_cfg blob, called "etc/hardware_errors_addr". Through that blob, the
firmware can send back the guest-side allocation addresses to QEMU.

Namely, the "etc/hardware_errors_addr" blob contains N 8-byte entries.
QEMU generates N WRITE_POINTER commands for the firmware.

For error source K (0 <= K < N), QEMU instructs the firmware to
calculate the guest address of Error Status Data Block K, from the
QEMU-dictated offset within "etc/hardware_errors", and from the
guest-determined allocation base address for "etc/hardware_errors". The
firmware then writes the calculated address back to fw_cfg file
"etc/hardware_errors_addr", at offset K*8, according to the
WRITE_POINTER command.

This way QEMU will know the GPA of each Error Status Data Block.

(In fact this can be simplified to a single WRITE_POINTER command: the
address of the "address register table" can be sent back to QEMU as
well, which already contains all Error Status Data Block addresses.)

(8) When QEMU gets SIGBUS from the kernel -- I hope that's going to come
through a signalfd -- QEMU can format the CPER right into guest memory,
and then inject whatever interrupt (or assert whatever GPIO line) is
necessary for notifying the guest.

(9) This notification (in virtual hardware) can either be handled by the
guest kernel stand-alone, or else the guest kernel can invoke an ACPI
event handler method with it (which would be in the DSDT or one of the
SSDTs, also generated by QEMU). The ACPI event handler method could
invoke the specific guest kernel driver for errror handling via a
Notify() operation.

I'm attracted to the above design because:
- it would leave the firmware alone after OS boot, and
- it would leave the firmware blissfully ignorant about HEST, GHES,
CPER, and the like. (That's why QEMU's ACPI linker/loader was invented
in the first place.)

Thanks
Laszlo

>>    Do you think which modules generates the APEI table is better? UEFI or Qemu?
>>
>>
>>
>>
>> On 2017/3/28 21:40, James Morse wrote:
>>> Hi gengdongjiu,
>>>
>>> On 28/03/17 13:16, gengdongjiu wrote:
>>>> On 2017/3/28 19:54, Achin Gupta wrote:
>>>>> On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
>>>>>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
>>>>>>> On the host, part of UEFI is involved to generate the CPER records.
>>>>>>> In a guest?, I don't know.
>>>>>>> Qemu could generate the records, or drive some other component to do it.
>>>>>>
>>>>>> I think I am beginning to understand this a bit.  Since the guet UEFI
>>>>>> instance is specifically built for the machine it runs on, QEMU's virt
>>>>>> machine in this case, they could simply agree (by some contract) to
>>>>>> place the records at some specific location in memory, and if the guest
>>>>>> kernel asks its guest UEFI for that location, things should just work by
>>>>>> having logic in QEMU to process error reports and populate guest memory.
>>>>>>
>>>>>> Is this how others see the world too?
>>>>>
>>>>> I think so!
>>>>>
>>>>> AFAIU, the memory where CPERs will reside should be specified in a GHES entry in
>>>>> the HEST. Is this not the case with a guest kernel i.e. the guest UEFI creates a
>>>>> HEST for the guest Kernel?
>>>>>
>>>>> If so, then the question is how the guest UEFI finds out where QEMU (acting as
>>>>> EL3 firmware) will populate the CPERs. This could either be a contract between
>>>>> the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask QEMU
>>>>> where the memory is.
>>>>
>>>> whether invoke the guest UEFI will be complex? not see the advantage. it seems x86 Qemu
>>>> directly generate the ACPI table, but I am not sure, we are checking the qemu
>>> logical.
>>>> let Qemu generate CPER record may be clear.
>>>
>>> At boot UEFI in the guest will need to make sure the areas of memory that may be
>>> used for CPER records are reserved. Whether UEFI or Qemu decides where these are
>>> needs deciding, (but probably not here)...
>>>
>>> At runtime, when an error has occurred, I agree it would be simpler (fewer
>>> components involved) if Qemu generates the CPER records. But if UEFI made the
>>> memory choice above they need to interact and it gets complicated again. The
>>> CPER records are defined in the UEFI spec, so I would expect UEFI to contain
>>> code to generate/parse them.
>>>
>>>
>>> Thanks,
>>>
>>> James
>>>
>>>
>>> .
>>>
>>



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS
       [not found]                       ` <20170329154539-mutt-send-email-mst@kernel.org>
@ 2017-03-29 13:36                         ` Laszlo Ersek
  0 siblings, 0 replies; 10+ messages in thread
From: Laszlo Ersek @ 2017-03-29 13:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Achin Gupta, gengdongjiu, ard.biesheuvel, edk2-devel, qemu-devel,
	zhaoshenglong, James Morse, Christoffer Dall, xiexiuqi,
	Marc Zyngier, catalin.marinas, will.deacon, christoffer.dall,
	rkrcmar, suzuki.poulose, andre.przywara, mark.rutland,
	vladimir.murzin, linux-arm-kernel, kvmarm, kvm, linux-kernel,
	wangxiongfeng2, wuquanming, huangshaoyu, Leif.Lindholm, nd,
	Igor Mammedov

On 03/29/17 14:51, Michael S. Tsirkin wrote:
> On Wed, Mar 29, 2017 at 01:58:29PM +0200, Laszlo Ersek wrote:
>> (8) When QEMU gets SIGBUS from the kernel -- I hope that's going to come
>> through a signalfd -- QEMU can format the CPER right into guest memory,
>> and then inject whatever interrupt (or assert whatever GPIO line) is
>> necessary for notifying the guest.
> 
> I think I see a race condition potential - what if guest accesses
> CPER in guest memory while it's being written?

I'm not entirely sure about the data flow here (these parts of the ACPI
spec are particularly hard to read...), but I thought the OS wouldn't
look until it got a notification.

Or, are you concerned about the next CPER write by QEMU, while the OS is
reading the last one (and maybe the CPER area could wrap around?)

> 
> We can probably use another level of indirection to fix this:
> 
> allocate twice the space, add a pointer to where the valid
> table is located and update that after writing CPER completely.
> The pointer can be written atomically but also needs to
> be read atomically, so I suspect it should be a single byte
> as we don't know how are OSPMs implementing this.
> 

A-B-A problem? (Is that usually solved with a cookie or a wider
generation counter? But that would again require wider atomics.)

I do wonder though how this is handled on physical hardware. Assuming
the hardware error traps to the firmware first (which, on phys hw, is
responsible for depositing the CPER), in that scenario the phys firmware
would face the same issue (i.e., asynchronously interrupting the OS,
which could be reading the previously stored CPER).

Thanks,
Laszlo


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS
       [not found]                       ` <20170329144822.GA1020@cbox>
@ 2017-03-29 15:37                         ` Laszlo Ersek
  0 siblings, 0 replies; 10+ messages in thread
From: Laszlo Ersek @ 2017-03-29 15:37 UTC (permalink / raw)
  To: Christoffer Dall, gengdongjiu
  Cc: Achin Gupta, gengdongjiu, ard.biesheuvel, edk2-devel, qemu-devel,
	zhaoshenglong, James Morse, xiexiuqi, Marc Zyngier,
	catalin.marinas, will.deacon, christoffer.dall, rkrcmar,
	suzuki.poulose, andre.przywara, mark.rutland, vladimir.murzin,
	linux-arm-kernel, kvmarm, kvm, linux-kernel, wangxiongfeng2,
	wuquanming, huangshaoyu, Leif.Lindholm, nd

On 03/29/17 16:48, Christoffer Dall wrote:
> On Wed, Mar 29, 2017 at 10:36:51PM +0800, gengdongjiu wrote:
>> 2017-03-29 18:36 GMT+08:00, Achin Gupta <achin.gupta@arm.com>:

>>> Qemu is essentially fulfilling the role of secure firmware at the
>>> EL2/EL1 interface (as discussed with Christoffer below). So it
>>> should generate the CPER before injecting the error.
>>>
>>> This is corresponds to (1) above apart from notifying UEFI (I am
>>> assuming you mean guest UEFI). At this time, the guest OS already
>>> knows where to pick up the CPER from through the HEST. Qemu has
>>> to create the CPER and populate its address at the address
>>> exported in the HEST. Guest UEFI should not be involved in this 
>>> flow. Its job was to create the HEST at boot and that has been
>>> done by this stage.
>>
>> Sorry,  As I understand it, after Qemu generate the CPER table, it
>> should pass the CPER table to the guest UEFI, then Guest UEFI  place
>> this CPER table to the guest OS memory. In this flow, the Guest UEFI
>> should be involved, else the Guest OS can not see the CPER table.
>>
> 
> I think you need to explain the "pass the CPER table to the guest UEFI"
> concept in terms of what really happens, step by step, and when you say
> "then Guest UEFI place the CPER table to the guest OS memory", I'm
> curious who is running what code on the hardware when doing that.

I strongly suggest to keep the guest firmware's runtime involvement to
zero. Two reasons:

(1) As you explained above (... which I conveniently snipped), when you
inject an interrupt to the guest, the handler registered for that
interrupt will come from the guest kernel.

The only exception to this is when the platform provides a type of
interrupt whose handler can be registered and then locked down by the
firmware. On x86, this is the SMI.

In practice though,
- in OVMF (x86), we only do synchronous (software-initiated) SMIs (for
privileged UEFI varstore access),
- and in ArmVirtQemu (ARM / aarch64), none of the management mode stuff
exists at all.

I understand that the Platform Init 1.5 (or 1.6?) spec abstracted away
the MM (management mode) protocols from Intel SMM, but at this point
there is zero code in ArmVirtQemu for that. (And I'm unsure how much of
any eligible underlying hw emulation exists in QEMU.)

So you can't get the guest firmware to react to the injected interrupt
without the guest OS coming between first.

(2) Achin's description matches really-really closely what is possible,
and what should be done with QEMU, ArmVirtQemu, and the guest kernel.

In any solution for this feature, the firmware has to reserve some
memory from the OS at boot. The current facilities we have enable this.
As I described previously, the ACPI linker/loader actions can be mapped
more or less 1:1 to Achin's design. From a practical perspective, you
really want to keep the guest firmware as dumb as possible (meaning: as
generic as possible), and keep the ACPI specifics to the QEMU and the
guest kernel sides.

The error serialization actions -- the co-operation between guest kernel
and QEMU on the special memory areas -- that were mentioned earlier by
Michael and Punit look like a complication. But, IMO, they don't differ
from any other device emulation -- DMA actions in particular -- that
QEMU already does. Device models are what QEMU *does*. Read the command
block that the guest driver placed in guest memory, parse it, sanity
check it, verify it, execute it, write back the status code, inject an
interrupt (and/or let any polling guest driver notice it "soon after" --
use barriers as necessary).

Thus, I suggest to rely on the generic ACPI linker/loader interface
(between QEMU and guest firmware) *only* to make the firmware lay out
stuff (= reserve buffers, set up pointers, install QEMU's ACPI tables)
*at boot*. Then, at runtime, let the guest kernel and QEMU (the "device
model") talk to each other directly. Keep runtime firmware involvement
to zero.

You *really* don't want to debug three components at runtime, when you
can solve the thing with two. (Two components whose build systems won't
drive you mad, I should add.)

IMO, Achin's design nailed it. We can do that.

Laszlo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS
  2017-03-29 11:58                     ` Laszlo Ersek
       [not found]                       ` <20170329154539-mutt-send-email-mst@kernel.org>
@ 2017-04-06 12:35                       ` gengdongjiu
  2017-04-06 18:55                         ` Laszlo Ersek
  1 sibling, 1 reply; 10+ messages in thread
From: gengdongjiu @ 2017-04-06 12:35 UTC (permalink / raw)
  To: Laszlo Ersek, Achin Gupta
  Cc: ard.biesheuvel, edk2-devel, qemu-devel, zhaoshenglong,
	James Morse, Christoffer Dall, xiexiuqi, Marc Zyngier,
	catalin.marinas, will.deacon, christoffer.dall, rkrcmar,
	suzuki.poulose, andre.przywara, mark.rutland, vladimir.murzin,
	linux-arm-kernel, kvmarm, kvm, linux-kernel, wangxiongfeng2,
	wuquanming, huangshaoyu, Leif.Lindholm, nd, Michael Tsirkin,
	Igor Mammedov

Dear, Laszlo
   Thanks for your detailed explanation.

On 2017/3/29 19:58, Laszlo Ersek wrote:
> (This ought to be one of the longest address lists I've ever seen :)
> Thanks for the CC. I'm glad Shannon is already on the CC list. For good
> measure, I'm adding MST and Igor.)
> 
> On 03/29/17 12:36, Achin Gupta wrote:
>> Hi gengdongjiu,
>>
>> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>>>
>>> Hi Laszlo/Biesheuvel/Qemu developer,
>>>
>>>    Now I encounter a issue and want to consult with you in ARM64 platform， as described below:
>>>
>>> when guest OS happen synchronous or asynchronous abort, kvm needs
>>> to send the error address to Qemu or UEFI through sigbus to
>>> dynamically generate APEI table. from my investigation, there are
>>> two ways:
>>>
>>> (1) Qemu get the error address, and generate the APEI table, then
>>> notify UEFI to know this generation, then inject abort error to
>>> guest OS, guest OS read the APEI table.
>>> (2) Qemu get the error address, and let UEFI to generate the APEI
>>> table, then inject abort error to guest OS, guest OS read the APEI
>>> table.
>>
>> Just being pedantic! I don't think we are talking about creating the APEI table
>> dynamically here. The issue is: Once KVM has received an error that is destined
>> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the error
>> into the guest OS, a CPER (Common Platform Error Record) has to be generated
>> corresponding to the error source (GHES corresponding to memory subsystem,
>> processor etc) to allow the guest OS to do anything meaningful with the
>> error. So who should create the CPER is the question.
>>
>> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error arrives
>> at EL3 and secure firmware (at EL3 or a lower secure exception level) is
>> responsible for creating the CPER. ARM is experimenting with using a Standalone
>> MM EDK2 image in the secure world to do the CPER creation. This will avoid
>> adding the same code in ARM TF in EL3 (better for security). The error will then
>> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM Trusted
>> Firmware.
>>
>> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
>> interface (as discussed with Christoffer below). So it should generate the CPER
>> before injecting the error.
>>
>> This is corresponds to (1) above apart from notifying UEFI (I am assuming you
>> mean guest UEFI). At this time, the guest OS already knows where to pick up the
>> CPER from through the HEST. Qemu has to create the CPER and populate its address
>> at the address exported in the HEST. Guest UEFI should not be involved in this
>> flow. Its job was to create the HEST at boot and that has been done by this
>> stage.
>>
>> Qemu folk will be able to add but it looks like support for CPER generation will
>> need to be added to Qemu. We need to resolve this.
>>
>> Do shout if I am missing anything above.
> 
> After reading this email, the use case looks *very* similar to what
> we've just done with VMGENID for QEMU 2.9.
> 
> We have a facility between QEMU and the guest firmware, called "ACPI
> linker/loader", with which QEMU instructs the firmware to
> 
> - allocate and download blobs into guest RAM (AcpiNVS type memory) --
> ALLOCATE command,
> 
> - relocate pointers in those blobs, to fields in other (or the same)
> blobs -- ADD_POINTER command,
> 
> - set ACPI table checksums -- ADD_CHECKSUM command,
> 
> - and send GPAs of fields within such blobs back to QEMU --
> WRITE_POINTER command.
> 
> This is how I imagine we can map the facility to the current use case
> (note that this is the first time I read about HEST / GHES / CPER):
> 
>     etc/acpi/tables                 etc/hardware_errors
>     ================     ==========================================
>                          +-----------+
>     +--------------+     | address   |         +-> +--------------+
>     |    HEST      +     | registers |         |   | Error Status |
>     + +------------+     | +---------+         |   | Data Block 1 |
>     | | GHES       | --> | | address | --------+   | +------------+
>     | | GHES       | --> | | address | ------+     | |  CPER      |
>     | | GHES       | --> | | address | ----+ |     | |  CPER      |
>     | | GHES       | --> | | address | -+  | |     | |  CPER      |
>     +-+------------+     +-+---------+  |  | |     +-+------------+
>                                         |  | |
>                                         |  | +---> +--------------+
>                                         |  |       | Error Status |
>                                         |  |       | Data Block 2 |
>                                         |  |       | +------------+
>                                         |  |       | |  CPER      |
>                                         |  |       | |  CPER      |
>                                         |  |       +-+------------+
>                                         |  |
>                                         |  +-----> +--------------+
>                                         |          | Error Status |
>                                         |          | Data Block 3 |
>                                         |          | +------------+
>                                         |          | |  CPER      |
>                                         |          +-+------------+
>                                         |
>                                         +--------> +--------------+
>                                                    | Error Status |
>                                                    | Data Block 4 |
>                                                    | +------------+
>                                                    | |  CPER      |
>                                                    | |  CPER      |
>                                                    | |  CPER      |
>                                                    +-+------------+
> 
> (1) QEMU generates the HEST ACPI table. This table goes in the current
> "etc/acpi/tables" fw_cfg blob. Given N error sources, there will be N
> GHES objects in the HEST.
> 
> (2) We introduce a new fw_cfg blob called "etc/hardware_errors". QEMU
> also populates this blob.
> 
> (2a) Given N error sources, the (unnamed) table of address registers
> will contain N address registers.
> 
> (2b) Given N error sources, the "etc/hardwre_errors" fw_cfg blob will
> also contain N Error Status Data Blocks.
> 
> I don't know about the sizing (number of CPERs) each Error Status Data
> Block has to contain, but I understand it is all pre-allocated as far as
> the OS is concerned, which matches our capabilities well.
here I have a question. as you comment: " 'etc/hardwre_errors' fw_cfg blob will also contain N Error Status Data Blocks",
Because the CPER numbers is not fixed, how to assign each "Error Status Data Block" size using one "etc/hardwre_errors" fw_cfg blob.
when use one etc/hardwre_errors, will the N Error Status Data Block use one continuous buffer? as shown below. if so, maybe it not convenient for each data block size extension.
I see the bios_linker_loader_alloc will allocate one continuous buffer for a blob(such as VMGENID_GUID_FW_CFG_FILE)

    /* Allocate guest memory for the Data fw_cfg blob */
    bios_linker_loader_alloc(linker, VMGENID_GUID_FW_CFG_FILE, guid, 4096,
                             false /* page boundary, high memory */);



-> +--------------+
     |    HEST      +     | registers |             | Error Status |
     + +------------+     | +---------+             | Data Block  |
     | | GHES       | --> | | address | --------+-->| +------------+
     | | GHES       | --> | | address | ------+     | |  CPER      |
     | | GHES       | --> | | address | ----+ |     | |  CPER      |
     | | GHES       | --> | | address | -+  | |     | |  CPER      |
     +-+------------+     +-+---------+  |  | +---> +--------------+
                                         |  |       | |  CPER      |
                                         |  |       | |  CPER      |
                                         |  +-----> +--------------+
                                         |          | |  CPER      |
                                         +--------> +--------------+
                                                    | |  CPER      |
                                                    | |  CPER      |
                                                    | |  CPER      |
                                                    +-+------------+



so how about we use separate etc/hardwre_errorsN for each Error Status status Block? then

etc/hardwre_errors0
etc/hardwre_errors1
...................
etc/hardwre_errors10
(the max N is 10)


the N can be one of below values, according to ACPI spec "Table 18-345 Hardware Error Notification Structure"
0 – Polled
1 – External Interrupt
2 – Local Interrupt
3 – SCI
4 – NMI
5 - CMCI
6 - MCE
7 - GPIO-Signal
8 - ARMv8 SEA
9 - ARMv8 SEI
10 - External Interrupt - GSIV




> 
> (3) QEMU generates the ACPI linker/loader script for the firmware, as
> always.
> 
> (3a) The HEST table is part of "etc/acpi/tables", which the firmware
> already allocates memory for, and downloads (because QEMU already
> generates an ALLOCATE linker/loader command for it already).
> 
> (3b) QEMU will have to create another ALLOCATE command for the
> "etc/hardware_errors" blob. The firmware allocates memory for this blob,
> and downloads it.
> 
> (4) QEMU generates, in the ACPI linker/loader script for the firwmare, N
> ADD_POINTER commands, which point the GHES."Error Status
> Address" fields in the HEST table, to the corresponding address
> registers in the downloaded "etc/hardware_errors" blob.
> 
> (5) QEMU generates an ADD_CHECKSUM command for the firmware, so that the
> HEST table is correctly checksummed after executing the N ADD_POINTER
> commands from (4).
> 
> (6) QEMU generates N ADD_POINTER commands for the firmware, pointing the
> address registers (located in guest memory, in the downloaded
> "etc/hardware_errors" blob) to the respective Error Status Data Blocks.
> 
> (7) (This is the trick.) For this step, we need a third, write-only
> fw_cfg blob, called "etc/hardware_errors_addr". Through that blob, the
> firmware can send back the guest-side allocation addresses to QEMU.
> 
> Namely, the "etc/hardware_errors_addr" blob contains N 8-byte entries.
> QEMU generates N WRITE_POINTER commands for the firmware.
> 
> For error source K (0 <= K < N), QEMU instructs the firmware to
> calculate the guest address of Error Status Data Block K, from the
> QEMU-dictated offset within "etc/hardware_errors", and from the
> guest-determined allocation base address for "etc/hardware_errors". The
> firmware then writes the calculated address back to fw_cfg file
> "etc/hardware_errors_addr", at offset K*8, according to the
> WRITE_POINTER command.
> 
> This way QEMU will know the GPA of each Error Status Data Block.
> 
> (In fact this can be simplified to a single WRITE_POINTER command: the
> address of the "address register table" can be sent back to QEMU as
> well, which already contains all Error Status Data Block addresses.)
> 
> (8) When QEMU gets SIGBUS from the kernel -- I hope that's going to come
> through a signalfd -- QEMU can format the CPER right into guest memory,
> and then inject whatever interrupt (or assert whatever GPIO line) is
> necessary for notifying the guest.
> 
> (9) This notification (in virtual hardware) can either be handled by the
> guest kernel stand-alone, or else the guest kernel can invoke an ACPI
> event handler method with it (which would be in the DSDT or one of the
> SSDTs, also generated by QEMU). The ACPI event handler method could
> invoke the specific guest kernel driver for errror handling via a
> Notify() operation.
> 
> I'm attracted to the above design because:
> - it would leave the firmware alone after OS boot, and
> - it would leave the firmware blissfully ignorant about HEST, GHES,
> CPER, and the like. (That's why QEMU's ACPI linker/loader was invented
> in the first place.)
> 
> Thanks
> Laszlo
> 
>>>    Do you think which modules generates the APEI table is better? UEFI or Qemu?
>>>
>>>
>>>
>>>
>>> On 2017/3/28 21:40, James Morse wrote:
>>>> Hi gengdongjiu,
>>>>
>>>> On 28/03/17 13:16, gengdongjiu wrote:
>>>>> On 2017/3/28 19:54, Achin Gupta wrote:
>>>>>> On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
>>>>>>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
>>>>>>>> On the host, part of UEFI is involved to generate the CPER records.
>>>>>>>> In a guest?, I don't know.
>>>>>>>> Qemu could generate the records, or drive some other component to do it.
>>>>>>>
>>>>>>> I think I am beginning to understand this a bit.  Since the guet UEFI
>>>>>>> instance is specifically built for the machine it runs on, QEMU's virt
>>>>>>> machine in this case, they could simply agree (by some contract) to
>>>>>>> place the records at some specific location in memory, and if the guest
>>>>>>> kernel asks its guest UEFI for that location, things should just work by
>>>>>>> having logic in QEMU to process error reports and populate guest memory.
>>>>>>>
>>>>>>> Is this how others see the world too?
>>>>>>
>>>>>> I think so!
>>>>>>
>>>>>> AFAIU, the memory where CPERs will reside should be specified in a GHES entry in
>>>>>> the HEST. Is this not the case with a guest kernel i.e. the guest UEFI creates a
>>>>>> HEST for the guest Kernel?
>>>>>>
>>>>>> If so, then the question is how the guest UEFI finds out where QEMU (acting as
>>>>>> EL3 firmware) will populate the CPERs. This could either be a contract between
>>>>>> the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask QEMU
>>>>>> where the memory is.
>>>>>
>>>>> whether invoke the guest UEFI will be complex? not see the advantage. it seems x86 Qemu
>>>>> directly generate the ACPI table, but I am not sure, we are checking the qemu
>>>> logical.
>>>>> let Qemu generate CPER record may be clear.
>>>>
>>>> At boot UEFI in the guest will need to make sure the areas of memory that may be
>>>> used for CPER records are reserved. Whether UEFI or Qemu decides where these are
>>>> needs deciding, (but probably not here)...
>>>>
>>>> At runtime, when an error has occurred, I agree it would be simpler (fewer
>>>> components involved) if Qemu generates the CPER records. But if UEFI made the
>>>> memory choice above they need to interact and it gets complicated again. The
>>>> CPER records are defined in the UEFI spec, so I would expect UEFI to contain
>>>> code to generate/parse them.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> James
>>>>
>>>>
>>>> .
>>>>
>>>
> 
> 
> .
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS
  2017-04-06 12:35                       ` gengdongjiu
@ 2017-04-06 18:55                         ` Laszlo Ersek
  2017-04-07  2:52                           ` gengdongjiu
  2017-04-21 13:27                           ` gengdongjiu
  0 siblings, 2 replies; 10+ messages in thread
From: Laszlo Ersek @ 2017-04-06 18:55 UTC (permalink / raw)
  To: gengdongjiu, Achin Gupta
  Cc: ard.biesheuvel, edk2-devel, qemu-devel, zhaoshenglong,
	James Morse, Christoffer Dall, xiexiuqi, Marc Zyngier,
	catalin.marinas, will.deacon, christoffer.dall, rkrcmar,
	suzuki.poulose, andre.przywara, mark.rutland, vladimir.murzin,
	linux-arm-kernel, kvmarm, kvm, linux-kernel, wangxiongfeng2,
	wuquanming, huangshaoyu, Leif.Lindholm, nd, Michael Tsirkin,
	Igor Mammedov

On 04/06/17 14:35, gengdongjiu wrote:
> Dear, Laszlo
>    Thanks for your detailed explanation.
> 
> On 2017/3/29 19:58, Laszlo Ersek wrote:
>> (This ought to be one of the longest address lists I've ever seen :)
>> Thanks for the CC. I'm glad Shannon is already on the CC list. For good
>> measure, I'm adding MST and Igor.)
>>
>> On 03/29/17 12:36, Achin Gupta wrote:
>>> Hi gengdongjiu,
>>>
>>> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>>>>
>>>> Hi Laszlo/Biesheuvel/Qemu developer,
>>>>
>>>>    Now I encounter a issue and want to consult with you in ARM64 platform， as described below:
>>>>
>>>> when guest OS happen synchronous or asynchronous abort, kvm needs
>>>> to send the error address to Qemu or UEFI through sigbus to
>>>> dynamically generate APEI table. from my investigation, there are
>>>> two ways:
>>>>
>>>> (1) Qemu get the error address, and generate the APEI table, then
>>>> notify UEFI to know this generation, then inject abort error to
>>>> guest OS, guest OS read the APEI table.
>>>> (2) Qemu get the error address, and let UEFI to generate the APEI
>>>> table, then inject abort error to guest OS, guest OS read the APEI
>>>> table.
>>>
>>> Just being pedantic! I don't think we are talking about creating the APEI table
>>> dynamically here. The issue is: Once KVM has received an error that is destined
>>> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the error
>>> into the guest OS, a CPER (Common Platform Error Record) has to be generated
>>> corresponding to the error source (GHES corresponding to memory subsystem,
>>> processor etc) to allow the guest OS to do anything meaningful with the
>>> error. So who should create the CPER is the question.
>>>
>>> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error arrives
>>> at EL3 and secure firmware (at EL3 or a lower secure exception level) is
>>> responsible for creating the CPER. ARM is experimenting with using a Standalone
>>> MM EDK2 image in the secure world to do the CPER creation. This will avoid
>>> adding the same code in ARM TF in EL3 (better for security). The error will then
>>> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM Trusted
>>> Firmware.
>>>
>>> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
>>> interface (as discussed with Christoffer below). So it should generate the CPER
>>> before injecting the error.
>>>
>>> This is corresponds to (1) above apart from notifying UEFI (I am assuming you
>>> mean guest UEFI). At this time, the guest OS already knows where to pick up the
>>> CPER from through the HEST. Qemu has to create the CPER and populate its address
>>> at the address exported in the HEST. Guest UEFI should not be involved in this
>>> flow. Its job was to create the HEST at boot and that has been done by this
>>> stage.
>>>
>>> Qemu folk will be able to add but it looks like support for CPER generation will
>>> need to be added to Qemu. We need to resolve this.
>>>
>>> Do shout if I am missing anything above.
>>
>> After reading this email, the use case looks *very* similar to what
>> we've just done with VMGENID for QEMU 2.9.
>>
>> We have a facility between QEMU and the guest firmware, called "ACPI
>> linker/loader", with which QEMU instructs the firmware to
>>
>> - allocate and download blobs into guest RAM (AcpiNVS type memory) --
>> ALLOCATE command,
>>
>> - relocate pointers in those blobs, to fields in other (or the same)
>> blobs -- ADD_POINTER command,
>>
>> - set ACPI table checksums -- ADD_CHECKSUM command,
>>
>> - and send GPAs of fields within such blobs back to QEMU --
>> WRITE_POINTER command.
>>
>> This is how I imagine we can map the facility to the current use case
>> (note that this is the first time I read about HEST / GHES / CPER):
>>
>>     etc/acpi/tables                 etc/hardware_errors
>>     ================     ==========================================
>>                          +-----------+
>>     +--------------+     | address   |         +-> +--------------+
>>     |    HEST      +     | registers |         |   | Error Status |
>>     + +------------+     | +---------+         |   | Data Block 1 |
>>     | | GHES       | --> | | address | --------+   | +------------+
>>     | | GHES       | --> | | address | ------+     | |  CPER      |
>>     | | GHES       | --> | | address | ----+ |     | |  CPER      |
>>     | | GHES       | --> | | address | -+  | |     | |  CPER      |
>>     +-+------------+     +-+---------+  |  | |     +-+------------+
>>                                         |  | |
>>                                         |  | +---> +--------------+
>>                                         |  |       | Error Status |
>>                                         |  |       | Data Block 2 |
>>                                         |  |       | +------------+
>>                                         |  |       | |  CPER      |
>>                                         |  |       | |  CPER      |
>>                                         |  |       +-+------------+
>>                                         |  |
>>                                         |  +-----> +--------------+
>>                                         |          | Error Status |
>>                                         |          | Data Block 3 |
>>                                         |          | +------------+
>>                                         |          | |  CPER      |
>>                                         |          +-+------------+
>>                                         |
>>                                         +--------> +--------------+
>>                                                    | Error Status |
>>                                                    | Data Block 4 |
>>                                                    | +------------+
>>                                                    | |  CPER      |
>>                                                    | |  CPER      |
>>                                                    | |  CPER      |
>>                                                    +-+------------+
>>
>> (1) QEMU generates the HEST ACPI table. This table goes in the current
>> "etc/acpi/tables" fw_cfg blob. Given N error sources, there will be N
>> GHES objects in the HEST.
>>
>> (2) We introduce a new fw_cfg blob called "etc/hardware_errors". QEMU
>> also populates this blob.
>>
>> (2a) Given N error sources, the (unnamed) table of address registers
>> will contain N address registers.
>>
>> (2b) Given N error sources, the "etc/hardwre_errors" fw_cfg blob will
>> also contain N Error Status Data Blocks.
>>
>> I don't know about the sizing (number of CPERs) each Error Status Data
>> Block has to contain, but I understand it is all pre-allocated as far as
>> the OS is concerned, which matches our capabilities well.
> here I have a question. as you comment: " 'etc/hardwre_errors' fw_cfg blob will also contain N Error Status Data Blocks",
> Because the CPER numbers is not fixed, how to assign each "Error Status Data Block" size using one "etc/hardwre_errors" fw_cfg blob.
> when use one etc/hardwre_errors, will the N Error Status Data Block use one continuous buffer? as shown below. if so, maybe it not convenient for each data block size extension.
> I see the bios_linker_loader_alloc will allocate one continuous buffer for a blob(such as VMGENID_GUID_FW_CFG_FILE)
> 
>     /* Allocate guest memory for the Data fw_cfg blob */
>     bios_linker_loader_alloc(linker, VMGENID_GUID_FW_CFG_FILE, guid, 4096,
>                              false /* page boundary, high memory */);
> 
> 
> 
> -> +--------------+
>      |    HEST      +     | registers |             | Error Status |
>      + +------------+     | +---------+             | Data Block  |
>      | | GHES       | --> | | address | --------+-->| +------------+
>      | | GHES       | --> | | address | ------+     | |  CPER      |
>      | | GHES       | --> | | address | ----+ |     | |  CPER      |
>      | | GHES       | --> | | address | -+  | |     | |  CPER      |
>      +-+------------+     +-+---------+  |  | +---> +--------------+
>                                          |  |       | |  CPER      |
>                                          |  |       | |  CPER      |
>                                          |  +-----> +--------------+
>                                          |          | |  CPER      |
>                                          +--------> +--------------+
>                                                     | |  CPER      |
>                                                     | |  CPER      |
>                                                     | |  CPER      |
>                                                     +-+------------+
> 
> 
> 
> so how about we use separate etc/hardwre_errorsN for each Error Status status Block? then
> 
> etc/hardwre_errors0
> etc/hardwre_errors1
> ...................
> etc/hardwre_errors10
> (the max N is 10)
> 
> 
> the N can be one of below values, according to ACPI spec "Table 18-345 Hardware Error Notification Structure"
> 0 – Polled
> 1 – External Interrupt
> 2 – Local Interrupt
> 3 – SCI
> 4 – NMI
> 5 - CMCI
> 6 - MCE
> 7 - GPIO-Signal
> 8 - ARMv8 SEA
> 9 - ARMv8 SEI
> 10 - External Interrupt - GSIV

I'm unsure if, by "not fixed", you are saying

  the number of CPER entries that fits in Error Status Data Block N is
  not *uniform* across 0 <= N <= 10 [1]

or

  the number of CPER entries that fits in Error Status Data Block N is
  not *known* in advance, for all of 0 <= N <= 10 [2]

Which one is your point?

If [1], that's no problem; you can simply sum the individual error
status data block sizes in advance, and allocate "etc/hardware_errors"
accordingly, using the total size.

(Allocating one shared fw_cfg blob for all status data blocks is more
memory efficient, as each ALLOCATE command will allocate whole pages
(rounded up from the actual blob size).)

If your point is [2], then splitting the error status data blocks to
separate fw_cfg blobs makes no difference: regardless of whether we try
to place all the error status data blocks in a single fw_cfg blob, or in
separate fw_cfg blobs, the individual data block cannot be resized at OS
runtime, so there's no way to make it work.

Thanks,
Laszlo

> 
> 
> 
> 
>>
>> (3) QEMU generates the ACPI linker/loader script for the firmware, as
>> always.
>>
>> (3a) The HEST table is part of "etc/acpi/tables", which the firmware
>> already allocates memory for, and downloads (because QEMU already
>> generates an ALLOCATE linker/loader command for it already).
>>
>> (3b) QEMU will have to create another ALLOCATE command for the
>> "etc/hardware_errors" blob. The firmware allocates memory for this blob,
>> and downloads it.
>>
>> (4) QEMU generates, in the ACPI linker/loader script for the firwmare, N
>> ADD_POINTER commands, which point the GHES."Error Status
>> Address" fields in the HEST table, to the corresponding address
>> registers in the downloaded "etc/hardware_errors" blob.
>>
>> (5) QEMU generates an ADD_CHECKSUM command for the firmware, so that the
>> HEST table is correctly checksummed after executing the N ADD_POINTER
>> commands from (4).
>>
>> (6) QEMU generates N ADD_POINTER commands for the firmware, pointing the
>> address registers (located in guest memory, in the downloaded
>> "etc/hardware_errors" blob) to the respective Error Status Data Blocks.
>>
>> (7) (This is the trick.) For this step, we need a third, write-only
>> fw_cfg blob, called "etc/hardware_errors_addr". Through that blob, the
>> firmware can send back the guest-side allocation addresses to QEMU.
>>
>> Namely, the "etc/hardware_errors_addr" blob contains N 8-byte entries.
>> QEMU generates N WRITE_POINTER commands for the firmware.
>>
>> For error source K (0 <= K < N), QEMU instructs the firmware to
>> calculate the guest address of Error Status Data Block K, from the
>> QEMU-dictated offset within "etc/hardware_errors", and from the
>> guest-determined allocation base address for "etc/hardware_errors". The
>> firmware then writes the calculated address back to fw_cfg file
>> "etc/hardware_errors_addr", at offset K*8, according to the
>> WRITE_POINTER command.
>>
>> This way QEMU will know the GPA of each Error Status Data Block.
>>
>> (In fact this can be simplified to a single WRITE_POINTER command: the
>> address of the "address register table" can be sent back to QEMU as
>> well, which already contains all Error Status Data Block addresses.)
>>
>> (8) When QEMU gets SIGBUS from the kernel -- I hope that's going to come
>> through a signalfd -- QEMU can format the CPER right into guest memory,
>> and then inject whatever interrupt (or assert whatever GPIO line) is
>> necessary for notifying the guest.
>>
>> (9) This notification (in virtual hardware) can either be handled by the
>> guest kernel stand-alone, or else the guest kernel can invoke an ACPI
>> event handler method with it (which would be in the DSDT or one of the
>> SSDTs, also generated by QEMU). The ACPI event handler method could
>> invoke the specific guest kernel driver for errror handling via a
>> Notify() operation.
>>
>> I'm attracted to the above design because:
>> - it would leave the firmware alone after OS boot, and
>> - it would leave the firmware blissfully ignorant about HEST, GHES,
>> CPER, and the like. (That's why QEMU's ACPI linker/loader was invented
>> in the first place.)
>>
>> Thanks
>> Laszlo
>>
>>>>    Do you think which modules generates the APEI table is better? UEFI or Qemu?
>>>>
>>>>
>>>>
>>>>
>>>> On 2017/3/28 21:40, James Morse wrote:
>>>>> Hi gengdongjiu,
>>>>>
>>>>> On 28/03/17 13:16, gengdongjiu wrote:
>>>>>> On 2017/3/28 19:54, Achin Gupta wrote:
>>>>>>> On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
>>>>>>>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
>>>>>>>>> On the host, part of UEFI is involved to generate the CPER records.
>>>>>>>>> In a guest?, I don't know.
>>>>>>>>> Qemu could generate the records, or drive some other component to do it.
>>>>>>>>
>>>>>>>> I think I am beginning to understand this a bit.  Since the guet UEFI
>>>>>>>> instance is specifically built for the machine it runs on, QEMU's virt
>>>>>>>> machine in this case, they could simply agree (by some contract) to
>>>>>>>> place the records at some specific location in memory, and if the guest
>>>>>>>> kernel asks its guest UEFI for that location, things should just work by
>>>>>>>> having logic in QEMU to process error reports and populate guest memory.
>>>>>>>>
>>>>>>>> Is this how others see the world too?
>>>>>>>
>>>>>>> I think so!
>>>>>>>
>>>>>>> AFAIU, the memory where CPERs will reside should be specified in a GHES entry in
>>>>>>> the HEST. Is this not the case with a guest kernel i.e. the guest UEFI creates a
>>>>>>> HEST for the guest Kernel?
>>>>>>>
>>>>>>> If so, then the question is how the guest UEFI finds out where QEMU (acting as
>>>>>>> EL3 firmware) will populate the CPERs. This could either be a contract between
>>>>>>> the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask QEMU
>>>>>>> where the memory is.
>>>>>>
>>>>>> whether invoke the guest UEFI will be complex? not see the advantage. it seems x86 Qemu
>>>>>> directly generate the ACPI table, but I am not sure, we are checking the qemu
>>>>> logical.
>>>>>> let Qemu generate CPER record may be clear.
>>>>>
>>>>> At boot UEFI in the guest will need to make sure the areas of memory that may be
>>>>> used for CPER records are reserved. Whether UEFI or Qemu decides where these are
>>>>> needs deciding, (but probably not here)...
>>>>>
>>>>> At runtime, when an error has occurred, I agree it would be simpler (fewer
>>>>> components involved) if Qemu generates the CPER records. But if UEFI made the
>>>>> memory choice above they need to interact and it gets complicated again. The
>>>>> CPER records are defined in the UEFI spec, so I would expect UEFI to contain
>>>>> code to generate/parse them.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> James
>>>>>
>>>>>
>>>>> .
>>>>>
>>>>
>>
>>
>> .
>>
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS
  2017-04-06 18:55                         ` Laszlo Ersek
@ 2017-04-07  2:52                           ` gengdongjiu
  2017-04-07  9:21                             ` Laszlo Ersek
  2017-04-21 13:27                           ` gengdongjiu
  1 sibling, 1 reply; 10+ messages in thread
From: gengdongjiu @ 2017-04-07  2:52 UTC (permalink / raw)
  To: Laszlo Ersek, Achin Gupta
  Cc: ard.biesheuvel, edk2-devel, qemu-devel, zhaoshenglong,
	James Morse, Christoffer Dall, xiexiuqi, Marc Zyngier,
	catalin.marinas, will.deacon, christoffer.dall, rkrcmar,
	suzuki.poulose, andre.przywara, mark.rutland, vladimir.murzin,
	linux-arm-kernel, kvmarm, kvm, linux-kernel, wangxiongfeng2,
	wuquanming, huangshaoyu, Leif.Lindholm, nd, Michael Tsirkin,
	Igor Mammedov

Hi Laszlo,
  thanks.

On 2017/4/7 2:55, Laszlo Ersek wrote:
> On 04/06/17 14:35, gengdongjiu wrote:
>> Dear, Laszlo
>>    Thanks for your detailed explanation.
>>
>> On 2017/3/29 19:58, Laszlo Ersek wrote:
>>> (This ought to be one of the longest address lists I've ever seen :)
>>> Thanks for the CC. I'm glad Shannon is already on the CC list. For good
>>> measure, I'm adding MST and Igor.)
>>>
>>> On 03/29/17 12:36, Achin Gupta wrote:
>>>> Hi gengdongjiu,
>>>>
>>>> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>>>>>
>>>>> Hi Laszlo/Biesheuvel/Qemu developer,
>>>>>
>>>>>    Now I encounter a issue and want to consult with you in ARM64 platform， as described below:
>>>>>
>>>>> when guest OS happen synchronous or asynchronous abort, kvm needs
>>>>> to send the error address to Qemu or UEFI through sigbus to
>>>>> dynamically generate APEI table. from my investigation, there are
>>>>> two ways:
>>>>>
>>>>> (1) Qemu get the error address, and generate the APEI table, then
>>>>> notify UEFI to know this generation, then inject abort error to
>>>>> guest OS, guest OS read the APEI table.
>>>>> (2) Qemu get the error address, and let UEFI to generate the APEI
>>>>> table, then inject abort error to guest OS, guest OS read the APEI
>>>>> table.
>>>>
>>>> Just being pedantic! I don't think we are talking about creating the APEI table
>>>> dynamically here. The issue is: Once KVM has received an error that is destined
>>>> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the error
>>>> into the guest OS, a CPER (Common Platform Error Record) has to be generated
>>>> corresponding to the error source (GHES corresponding to memory subsystem,
>>>> processor etc) to allow the guest OS to do anything meaningful with the
>>>> error. So who should create the CPER is the question.
>>>>
>>>> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error arrives
>>>> at EL3 and secure firmware (at EL3 or a lower secure exception level) is
>>>> responsible for creating the CPER. ARM is experimenting with using a Standalone
>>>> MM EDK2 image in the secure world to do the CPER creation. This will avoid
>>>> adding the same code in ARM TF in EL3 (better for security). The error will then
>>>> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM Trusted
>>>> Firmware.
>>>>
>>>> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
>>>> interface (as discussed with Christoffer below). So it should generate the CPER
>>>> before injecting the error.
>>>>
>>>> This is corresponds to (1) above apart from notifying UEFI (I am assuming you
>>>> mean guest UEFI). At this time, the guest OS already knows where to pick up the
>>>> CPER from through the HEST. Qemu has to create the CPER and populate its address
>>>> at the address exported in the HEST. Guest UEFI should not be involved in this
>>>> flow. Its job was to create the HEST at boot and that has been done by this
>>>> stage.
>>>>
>>>> Qemu folk will be able to add but it looks like support for CPER generation will
>>>> need to be added to Qemu. We need to resolve this.
>>>>
>>>> Do shout if I am missing anything above.
>>>
>>> After reading this email, the use case looks *very* similar to what
>>> we've just done with VMGENID for QEMU 2.9.
>>>
>>> We have a facility between QEMU and the guest firmware, called "ACPI
>>> linker/loader", with which QEMU instructs the firmware to
>>>
>>> - allocate and download blobs into guest RAM (AcpiNVS type memory) --
>>> ALLOCATE command,
>>>
>>> - relocate pointers in those blobs, to fields in other (or the same)
>>> blobs -- ADD_POINTER command,
>>>
>>> - set ACPI table checksums -- ADD_CHECKSUM command,
>>>
>>> - and send GPAs of fields within such blobs back to QEMU --
>>> WRITE_POINTER command.
>>>
>>> This is how I imagine we can map the facility to the current use case
>>> (note that this is the first time I read about HEST / GHES / CPER):
>>>
>>>     etc/acpi/tables                 etc/hardware_errors
>>>     ================     ==========================================
>>>                          +-----------+
>>>     +--------------+     | address   |         +-> +--------------+
>>>     |    HEST      +     | registers |         |   | Error Status |
>>>     + +------------+     | +---------+         |   | Data Block 1 |
>>>     | | GHES       | --> | | address | --------+   | +------------+
>>>     | | GHES       | --> | | address | ------+     | |  CPER      |
>>>     | | GHES       | --> | | address | ----+ |     | |  CPER      |
>>>     | | GHES       | --> | | address | -+  | |     | |  CPER      |
>>>     +-+------------+     +-+---------+  |  | |     +-+------------+
>>>                                         |  | |
>>>                                         |  | +---> +--------------+
>>>                                         |  |       | Error Status |
>>>                                         |  |       | Data Block 2 |
>>>                                         |  |       | +------------+
>>>                                         |  |       | |  CPER      |
>>>                                         |  |       | |  CPER      |
>>>                                         |  |       +-+------------+
>>>                                         |  |
>>>                                         |  +-----> +--------------+
>>>                                         |          | Error Status |
>>>                                         |          | Data Block 3 |
>>>                                         |          | +------------+
>>>                                         |          | |  CPER      |
>>>                                         |          +-+------------+
>>>                                         |
>>>                                         +--------> +--------------+
>>>                                                    | Error Status |
>>>                                                    | Data Block 4 |
>>>                                                    | +------------+
>>>                                                    | |  CPER      |
>>>                                                    | |  CPER      |
>>>                                                    | |  CPER      |
>>>                                                    +-+------------+
>>>
>>> (1) QEMU generates the HEST ACPI table. This table goes in the current
>>> "etc/acpi/tables" fw_cfg blob. Given N error sources, there will be N
>>> GHES objects in the HEST.
>>>
>>> (2) We introduce a new fw_cfg blob called "etc/hardware_errors". QEMU
>>> also populates this blob.
>>>
>>> (2a) Given N error sources, the (unnamed) table of address registers
>>> will contain N address registers.
>>>
>>> (2b) Given N error sources, the "etc/hardwre_errors" fw_cfg blob will
>>> also contain N Error Status Data Blocks.
>>>
>>> I don't know about the sizing (number of CPERs) each Error Status Data
>>> Block has to contain, but I understand it is all pre-allocated as far as
>>> the OS is concerned, which matches our capabilities well.
>> here I have a question. as you comment: " 'etc/hardwre_errors' fw_cfg blob will also contain N Error Status Data Blocks",
>> Because the CPER numbers is not fixed, how to assign each "Error Status Data Block" size using one "etc/hardwre_errors" fw_cfg blob.
>> when use one etc/hardwre_errors, will the N Error Status Data Block use one continuous buffer? as shown below. if so, maybe it not convenient for each data block size extension.
>> I see the bios_linker_loader_alloc will allocate one continuous buffer for a blob(such as VMGENID_GUID_FW_CFG_FILE)
>>
>>     /* Allocate guest memory for the Data fw_cfg blob */
>>     bios_linker_loader_alloc(linker, VMGENID_GUID_FW_CFG_FILE, guid, 4096,
>>                              false /* page boundary, high memory */);
>>
>>
>>
>> -> +--------------+
>>      |    HEST      +     | registers |             | Error Status |
>>      + +------------+     | +---------+             | Data Block  |
>>      | | GHES       | --> | | address | --------+-->| +------------+
>>      | | GHES       | --> | | address | ------+     | |  CPER      |
>>      | | GHES       | --> | | address | ----+ |     | |  CPER      |
>>      | | GHES       | --> | | address | -+  | |     | |  CPER      |
>>      +-+------------+     +-+---------+  |  | +---> +--------------+
>>                                          |  |       | |  CPER      |
>>                                          |  |       | |  CPER      |
>>                                          |  +-----> +--------------+
>>                                          |          | |  CPER      |
>>                                          +--------> +--------------+
>>                                                     | |  CPER      |
>>                                                     | |  CPER      |
>>                                                     | |  CPER      |
>>                                                     +-+------------+
>>
>>
>>
>> so how about we use separate etc/hardwre_errorsN for each Error Status status Block? then
>>
>> etc/hardwre_errors0
>> etc/hardwre_errors1
>> ...................
>> etc/hardwre_errors10
>> (the max N is 10)
>>
>>
>> the N can be one of below values, according to ACPI spec "Table 18-345 Hardware Error Notification Structure"
>> 0 – Polled
>> 1 – External Interrupt
>> 2 – Local Interrupt
>> 3 – SCI
>> 4 – NMI
>> 5 - CMCI
>> 6 - MCE
>> 7 - GPIO-Signal
>> 8 - ARMv8 SEA
>> 9 - ARMv8 SEI
>> 10 - External Interrupt - GSIV
> 
> I'm unsure if, by "not fixed", you are saying
> 
>   the number of CPER entries that fits in Error Status Data Block N is
>   not *uniform* across 0 <= N <= 10 [1]
> 
> or
> 
>   the number of CPER entries that fits in Error Status Data Block N is
>   not *known* in advance, for all of 0 <= N <= 10 [2]
> 
> Which one is your point?
> 
> If [1], that's no problem; you can simply sum the individual error
> status data block sizes in advance, and allocate "etc/hardware_errors"
> accordingly, using the total size.
> 
> (Allocating one shared fw_cfg blob for all status data blocks is more
> memory efficient, as each ALLOCATE command will allocate whole pages
> (rounded up from the actual blob size).)
> 
> If your point is [2], then splitting the error status data blocks to
> separate fw_cfg blobs makes no difference: regardless of whether we try
> to place all the error status data blocks in a single fw_cfg blob, or in
> separate fw_cfg blobs, the individual data block cannot be resized at OS
> runtime, so there's no way to make it work.
>
My Point is [2]. The HEST(Hardware Error Source Table) table format is here:
https://wiki.linaro.org/LEG/Engineering/Kernel/RAS/APEITables#Hardware_Error_Source_Table_.28HEST.29

Now I understand your thought.

> Thanks,
> Laszlo
> 
>>
>>
>>
>>
>>>
>>> (3) QEMU generates the ACPI linker/loader script for the firmware, as
>>> always.
>>>
>>> (3a) The HEST table is part of "etc/acpi/tables", which the firmware
>>> already allocates memory for, and downloads (because QEMU already
>>> generates an ALLOCATE linker/loader command for it already).
>>>
>>> (3b) QEMU will have to create another ALLOCATE command for the
>>> "etc/hardware_errors" blob. The firmware allocates memory for this blob,
>>> and downloads it.
>>>
>>> (4) QEMU generates, in the ACPI linker/loader script for the firwmare, N
>>> ADD_POINTER commands, which point the GHES."Error Status
>>> Address" fields in the HEST table, to the corresponding address
>>> registers in the downloaded "etc/hardware_errors" blob.
>>>
>>> (5) QEMU generates an ADD_CHECKSUM command for the firmware, so that the
>>> HEST table is correctly checksummed after executing the N ADD_POINTER
>>> commands from (4).
>>>
>>> (6) QEMU generates N ADD_POINTER commands for the firmware, pointing the
>>> address registers (located in guest memory, in the downloaded
>>> "etc/hardware_errors" blob) to the respective Error Status Data Blocks.
>>>
>>> (7) (This is the trick.) For this step, we need a third, write-only
>>> fw_cfg blob, called "etc/hardware_errors_addr". Through that blob, the
>>> firmware can send back the guest-side allocation addresses to QEMU.
>>>
>>> Namely, the "etc/hardware_errors_addr" blob contains N 8-byte entries.
>>> QEMU generates N WRITE_POINTER commands for the firmware.
>>>
>>> For error source K (0 <= K < N), QEMU instructs the firmware to
>>> calculate the guest address of Error Status Data Block K, from the
>>> QEMU-dictated offset within "etc/hardware_errors", and from the
>>> guest-determined allocation base address for "etc/hardware_errors". The
>>> firmware then writes the calculated address back to fw_cfg file
>>> "etc/hardware_errors_addr", at offset K*8, according to the
>>> WRITE_POINTER command.
>>>
>>> This way QEMU will know the GPA of each Error Status Data Block.
>>>
>>> (In fact this can be simplified to a single WRITE_POINTER command: the
>>> address of the "address register table" can be sent back to QEMU as
>>> well, which already contains all Error Status Data Block addresses.)
>>>
>>> (8) When QEMU gets SIGBUS from the kernel -- I hope that's going to come
>>> through a signalfd -- QEMU can format the CPER right into guest memory,
>>> and then inject whatever interrupt (or assert whatever GPIO line) is
>>> necessary for notifying the guest.
>>>
>>> (9) This notification (in virtual hardware) can either be handled by the
>>> guest kernel stand-alone, or else the guest kernel can invoke an ACPI
>>> event handler method with it (which would be in the DSDT or one of the
>>> SSDTs, also generated by QEMU). The ACPI event handler method could
>>> invoke the specific guest kernel driver for errror handling via a
>>> Notify() operation.
>>>
>>> I'm attracted to the above design because:
>>> - it would leave the firmware alone after OS boot, and
>>> - it would leave the firmware blissfully ignorant about HEST, GHES,
>>> CPER, and the like. (That's why QEMU's ACPI linker/loader was invented
>>> in the first place.)
>>>
>>> Thanks
>>> Laszlo
>>>
>>>>>    Do you think which modules generates the APEI table is better? UEFI or Qemu?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 2017/3/28 21:40, James Morse wrote:
>>>>>> Hi gengdongjiu,
>>>>>>
>>>>>> On 28/03/17 13:16, gengdongjiu wrote:
>>>>>>> On 2017/3/28 19:54, Achin Gupta wrote:
>>>>>>>> On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
>>>>>>>>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
>>>>>>>>>> On the host, part of UEFI is involved to generate the CPER records.
>>>>>>>>>> In a guest?, I don't know.
>>>>>>>>>> Qemu could generate the records, or drive some other component to do it.
>>>>>>>>>
>>>>>>>>> I think I am beginning to understand this a bit.  Since the guet UEFI
>>>>>>>>> instance is specifically built for the machine it runs on, QEMU's virt
>>>>>>>>> machine in this case, they could simply agree (by some contract) to
>>>>>>>>> place the records at some specific location in memory, and if the guest
>>>>>>>>> kernel asks its guest UEFI for that location, things should just work by
>>>>>>>>> having logic in QEMU to process error reports and populate guest memory.
>>>>>>>>>
>>>>>>>>> Is this how others see the world too?
>>>>>>>>
>>>>>>>> I think so!
>>>>>>>>
>>>>>>>> AFAIU, the memory where CPERs will reside should be specified in a GHES entry in
>>>>>>>> the HEST. Is this not the case with a guest kernel i.e. the guest UEFI creates a
>>>>>>>> HEST for the guest Kernel?
>>>>>>>>
>>>>>>>> If so, then the question is how the guest UEFI finds out where QEMU (acting as
>>>>>>>> EL3 firmware) will populate the CPERs. This could either be a contract between
>>>>>>>> the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask QEMU
>>>>>>>> where the memory is.
>>>>>>>
>>>>>>> whether invoke the guest UEFI will be complex? not see the advantage. it seems x86 Qemu
>>>>>>> directly generate the ACPI table, but I am not sure, we are checking the qemu
>>>>>> logical.
>>>>>>> let Qemu generate CPER record may be clear.
>>>>>>
>>>>>> At boot UEFI in the guest will need to make sure the areas of memory that may be
>>>>>> used for CPER records are reserved. Whether UEFI or Qemu decides where these are
>>>>>> needs deciding, (but probably not here)...
>>>>>>
>>>>>> At runtime, when an error has occurred, I agree it would be simpler (fewer
>>>>>> components involved) if Qemu generates the CPER records. But if UEFI made the
>>>>>> memory choice above they need to interact and it gets complicated again. The
>>>>>> CPER records are defined in the UEFI spec, so I would expect UEFI to contain
>>>>>> code to generate/parse them.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> James
>>>>>>
>>>>>>
>>>>>> .
>>>>>>
>>>>>
>>>
>>>
>>> .
>>>
>>
> 
> 
> .
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS
  2017-04-07  2:52                           ` gengdongjiu
@ 2017-04-07  9:21                             ` Laszlo Ersek
  0 siblings, 0 replies; 10+ messages in thread
From: Laszlo Ersek @ 2017-04-07  9:21 UTC (permalink / raw)
  To: gengdongjiu, Achin Gupta
  Cc: ard.biesheuvel, edk2-devel, qemu-devel, zhaoshenglong,
	James Morse, Christoffer Dall, xiexiuqi, Marc Zyngier,
	catalin.marinas, will.deacon, christoffer.dall, rkrcmar,
	suzuki.poulose, andre.przywara, mark.rutland, vladimir.murzin,
	linux-arm-kernel, kvmarm, kvm, linux-kernel, wangxiongfeng2,
	wuquanming, huangshaoyu, Leif.Lindholm, nd, Michael Tsirkin,
	Igor Mammedov

On 04/07/17 04:52, gengdongjiu wrote:
> 
> On 2017/4/7 2:55, Laszlo Ersek wrote:

>> I'm unsure if, by "not fixed", you are saying
>>
>>   the number of CPER entries that fits in Error Status Data Block N is
>>   not *uniform* across 0 <= N <= 10 [1]
>>
>> or
>>
>>   the number of CPER entries that fits in Error Status Data Block N is
>>   not *known* in advance, for all of 0 <= N <= 10 [2]
>>
>> Which one is your point?
>>
>> If [1], that's no problem; you can simply sum the individual error
>> status data block sizes in advance, and allocate "etc/hardware_errors"
>> accordingly, using the total size.
>>
>> (Allocating one shared fw_cfg blob for all status data blocks is more
>> memory efficient, as each ALLOCATE command will allocate whole pages
>> (rounded up from the actual blob size).)
>>
>> If your point is [2], then splitting the error status data blocks to
>> separate fw_cfg blobs makes no difference: regardless of whether we try
>> to place all the error status data blocks in a single fw_cfg blob, or in
>> separate fw_cfg blobs, the individual data block cannot be resized at OS
>> runtime, so there's no way to make it work.
>>
> My Point is [2]. The HEST(Hardware Error Source Table) table format is here:
> https://wiki.linaro.org/LEG/Engineering/Kernel/RAS/APEITables#Hardware_Error_Source_Table_.28HEST.29
> 
> Now I understand your thought.

But if you mean [2], then I am confused, with regard to firmware on
physical hardware. Namely, even on physical machines, the firmware has
to estimate, in advance, the area size that will be needed for CPERs,
doesn't it? And once the firmware allocates that memory area, it cannot
be resized at OS runtime. If there are more CPERs at runtime (due to
hardware errors) than the firmware allocated room for, they must surely
wrap around in the preallocated buffer (like in a ring buffer). Isn't
that correct?

On the diagrams that you linked above (great looking diagrams BTW!), I
see CPER in two places (it is helpfully shaded red):

- to the right of BERT; the CPER is part of a box that is captioned
"firmware reserved memory"

- to the right of HEST; again the CPER is part of a box that is
captioned "firmware reserved memory"

So, IMO, when QEMU has to guesstimate the room for CPERs in advance,
that doesn't differ from the physical firmware case. In QEMU maybe you
can let the user specify the area size on the command line, with a
machine type property or similar.

Thanks
Laszlo


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS
  2017-04-06 18:55                         ` Laszlo Ersek
  2017-04-07  2:52                           ` gengdongjiu
@ 2017-04-21 13:27                           ` gengdongjiu
  2017-04-24 11:27                             ` Laszlo Ersek
  1 sibling, 1 reply; 10+ messages in thread
From: gengdongjiu @ 2017-04-21 13:27 UTC (permalink / raw)
  To: Laszlo Ersek, Achin Gupta
  Cc: ard.biesheuvel, edk2-devel, qemu-devel, zhaoshenglong,
	James Morse, Christoffer Dall, xiexiuqi, Marc Zyngier,
	catalin.marinas, will.deacon, christoffer.dall, rkrcmar,
	suzuki.poulose, andre.przywara, mark.rutland, vladimir.murzin,
	linux-arm-kernel, kvmarm, kvm, linux-kernel, wangxiongfeng2,
	wuquanming, huangshaoyu, Leif.Lindholm, nd, Michael Tsirkin,
	Igor Mammedov

Hi all/Laszlo,

  sorry, I have a question to consult with you.


On 2017/4/7 2:55, Laszlo Ersek wrote:
> On 04/06/17 14:35, gengdongjiu wrote:
>> Dear, Laszlo
>>    Thanks for your detailed explanation.
>>
>> On 2017/3/29 19:58, Laszlo Ersek wrote:
>>> (This ought to be one of the longest address lists I've ever seen :)
>>> Thanks for the CC. I'm glad Shannon is already on the CC list. For good
>>> measure, I'm adding MST and Igor.)
>>>
>>> On 03/29/17 12:36, Achin Gupta wrote:
>>>> Hi gengdongjiu,
>>>>
>>>> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>>>>>
>>>>> Hi Laszlo/Biesheuvel/Qemu developer,
>>>>>
>>>>>    Now I encounter a issue and want to consult with you in ARM64 platform， as described below:
>>>>>
>>>>> when guest OS happen synchronous or asynchronous abort, kvm needs
>>>>> to send the error address to Qemu or UEFI through sigbus to
>>>>> dynamically generate APEI table. from my investigation, there are
>>>>> two ways:
>>>>>
>>>>> (1) Qemu get the error address, and generate the APEI table, then
>>>>> notify UEFI to know this generation, then inject abort error to
>>>>> guest OS, guest OS read the APEI table.
>>>>> (2) Qemu get the error address, and let UEFI to generate the APEI
>>>>> table, then inject abort error to guest OS, guest OS read the APEI
>>>>> table.
>>>>
>>>> Just being pedantic! I don't think we are talking about creating the APEI table
>>>> dynamically here. The issue is: Once KVM has received an error that is destined
>>>> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the error
>>>> into the guest OS, a CPER (Common Platform Error Record) has to be generated
>>>> corresponding to the error source (GHES corresponding to memory subsystem,
>>>> processor etc) to allow the guest OS to do anything meaningful with the
>>>> error. So who should create the CPER is the question.
>>>>
>>>> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error arrives
>>>> at EL3 and secure firmware (at EL3 or a lower secure exception level) is
>>>> responsible for creating the CPER. ARM is experimenting with using a Standalone
>>>> MM EDK2 image in the secure world to do the CPER creation. This will avoid
>>>> adding the same code in ARM TF in EL3 (better for security). The error will then
>>>> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM Trusted
>>>> Firmware.
>>>>
>>>> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
>>>> interface (as discussed with Christoffer below). So it should generate the CPER
>>>> before injecting the error.
>>>>
>>>> This is corresponds to (1) above apart from notifying UEFI (I am assuming you
>>>> mean guest UEFI). At this time, the guest OS already knows where to pick up the
>>>> CPER from through the HEST. Qemu has to create the CPER and populate its address
>>>> at the address exported in the HEST. Guest UEFI should not be involved in this
>>>> flow. Its job was to create the HEST at boot and that has been done by this
>>>> stage.
>>>>
>>>> Qemu folk will be able to add but it looks like support for CPER generation will
>>>> need to be added to Qemu. We need to resolve this.
>>>>
>>>> Do shout if I am missing anything above.
>>>
>>> After reading this email, the use case looks *very* similar to what
>>> we've just done with VMGENID for QEMU 2.9.
>>>
>>> We have a facility between QEMU and the guest firmware, called "ACPI
>>> linker/loader", with which QEMU instructs the firmware to
>>>
>>> - allocate and download blobs into guest RAM (AcpiNVS type memory) --
>>> ALLOCATE command,
>>>
>>> - relocate pointers in those blobs, to fields in other (or the same)
>>> blobs -- ADD_POINTER command,
>>>
>>> - set ACPI table checksums -- ADD_CHECKSUM command,
>>>
>>> - and send GPAs of fields within such blobs back to QEMU --
>>> WRITE_POINTER command.
>>>
>>> This is how I imagine we can map the facility to the current use case
>>> (note that this is the first time I read about HEST / GHES / CPER):

Laszlo lists a Qemu GHES table generation solution, Mainly use the four commands: "ALLOCATE/ADD_POINTER/ADD_CHECKSUM/WRITE_POINTER" to communicate with BIOS
so whether the four commands needs to be supported by the guest firware/UEFI.  I found the  "WRITE_POINTER" always failed. so I suspect guest UEFI/firmware not support the "WRITE_POINTER" command. please help me confirm it, thanks so much.




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS
  2017-04-21 13:27                           ` gengdongjiu
@ 2017-04-24 11:27                             ` Laszlo Ersek
  0 siblings, 0 replies; 10+ messages in thread
From: Laszlo Ersek @ 2017-04-24 11:27 UTC (permalink / raw)
  To: gengdongjiu, Achin Gupta
  Cc: ard.biesheuvel, edk2-devel, qemu-devel, zhaoshenglong,
	James Morse, Christoffer Dall, xiexiuqi, Marc Zyngier,
	catalin.marinas, will.deacon, christoffer.dall, rkrcmar,
	suzuki.poulose, andre.przywara, mark.rutland, vladimir.murzin,
	linux-arm-kernel, kvmarm, kvm, linux-kernel, wangxiongfeng2,
	wuquanming, huangshaoyu, Leif.Lindholm, nd, Michael Tsirkin,
	Igor Mammedov

On 04/21/17 15:27, gengdongjiu wrote:
> Hi all/Laszlo,
> 
>   sorry, I have a question to consult with you.
> 
> 
> On 2017/4/7 2:55, Laszlo Ersek wrote:
>> On 04/06/17 14:35, gengdongjiu wrote:
>>> Dear, Laszlo
>>>    Thanks for your detailed explanation.
>>>
>>> On 2017/3/29 19:58, Laszlo Ersek wrote:
>>>> (This ought to be one of the longest address lists I've ever seen :)
>>>> Thanks for the CC. I'm glad Shannon is already on the CC list. For good
>>>> measure, I'm adding MST and Igor.)
>>>>
>>>> On 03/29/17 12:36, Achin Gupta wrote:
>>>>> Hi gengdongjiu,
>>>>>
>>>>> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>>>>>>
>>>>>> Hi Laszlo/Biesheuvel/Qemu developer,
>>>>>>
>>>>>>    Now I encounter a issue and want to consult with you in ARM64 platform， as described below:
>>>>>>
>>>>>> when guest OS happen synchronous or asynchronous abort, kvm needs
>>>>>> to send the error address to Qemu or UEFI through sigbus to
>>>>>> dynamically generate APEI table. from my investigation, there are
>>>>>> two ways:
>>>>>>
>>>>>> (1) Qemu get the error address, and generate the APEI table, then
>>>>>> notify UEFI to know this generation, then inject abort error to
>>>>>> guest OS, guest OS read the APEI table.
>>>>>> (2) Qemu get the error address, and let UEFI to generate the APEI
>>>>>> table, then inject abort error to guest OS, guest OS read the APEI
>>>>>> table.
>>>>>
>>>>> Just being pedantic! I don't think we are talking about creating the APEI table
>>>>> dynamically here. The issue is: Once KVM has received an error that is destined
>>>>> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the error
>>>>> into the guest OS, a CPER (Common Platform Error Record) has to be generated
>>>>> corresponding to the error source (GHES corresponding to memory subsystem,
>>>>> processor etc) to allow the guest OS to do anything meaningful with the
>>>>> error. So who should create the CPER is the question.
>>>>>
>>>>> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error arrives
>>>>> at EL3 and secure firmware (at EL3 or a lower secure exception level) is
>>>>> responsible for creating the CPER. ARM is experimenting with using a Standalone
>>>>> MM EDK2 image in the secure world to do the CPER creation. This will avoid
>>>>> adding the same code in ARM TF in EL3 (better for security). The error will then
>>>>> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM Trusted
>>>>> Firmware.
>>>>>
>>>>> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
>>>>> interface (as discussed with Christoffer below). So it should generate the CPER
>>>>> before injecting the error.
>>>>>
>>>>> This is corresponds to (1) above apart from notifying UEFI (I am assuming you
>>>>> mean guest UEFI). At this time, the guest OS already knows where to pick up the
>>>>> CPER from through the HEST. Qemu has to create the CPER and populate its address
>>>>> at the address exported in the HEST. Guest UEFI should not be involved in this
>>>>> flow. Its job was to create the HEST at boot and that has been done by this
>>>>> stage.
>>>>>
>>>>> Qemu folk will be able to add but it looks like support for CPER generation will
>>>>> need to be added to Qemu. We need to resolve this.
>>>>>
>>>>> Do shout if I am missing anything above.
>>>>
>>>> After reading this email, the use case looks *very* similar to what
>>>> we've just done with VMGENID for QEMU 2.9.
>>>>
>>>> We have a facility between QEMU and the guest firmware, called "ACPI
>>>> linker/loader", with which QEMU instructs the firmware to
>>>>
>>>> - allocate and download blobs into guest RAM (AcpiNVS type memory) --
>>>> ALLOCATE command,
>>>>
>>>> - relocate pointers in those blobs, to fields in other (or the same)
>>>> blobs -- ADD_POINTER command,
>>>>
>>>> - set ACPI table checksums -- ADD_CHECKSUM command,
>>>>
>>>> - and send GPAs of fields within such blobs back to QEMU --
>>>> WRITE_POINTER command.
>>>>
>>>> This is how I imagine we can map the facility to the current use case
>>>> (note that this is the first time I read about HEST / GHES / CPER):
> 
> Laszlo lists a Qemu GHES table generation solution, Mainly use the
> four commands: "ALLOCATE/ADD_POINTER/ADD_CHECKSUM/WRITE_POINTER" to
> communicate with BIOS so whether the four commands needs to be
> supported by the guest firware/UEFI.  I found the  "WRITE_POINTER"
> always failed. so I suspect guest UEFI/firmware not support the
> "WRITE_POINTER" command. please help me confirm it, thanks so much.

That's incorrect, both OVMF and ArmVirtQemu support the WRITE_POINTER
command (see <https://bugzilla.tianocore.org/show_bug.cgi?id=359>.) A
number of OvmfPkg/ modules are included in ArmVirtPkg binaries as well.

In QEMU, the WRITE_POINTER command is currently generated for the
VMGENID device only. If you try to test VMGENID with qemu-system-aarch64
(for the purposes of WRITE_POINTER testing), that won't work, because
the VMGENID device is not available for aarch64. (The Microsoft spec
that describes the device lists Windows OS versions that are x86 only.)

In other words, no QEMU code exists at the moment that would allow you
to readily test WRITE_POINTER in aarch64 guests. However, the
firmware-side code is not architecture specific, and WRITE_POINTER
support is already being built into ArmVirtQemu.

Thanks,
Laszlo


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-04-24 11:28 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <76795e20-2f20-1e54-cfa5-7444f28b18ee@huawei.com>
     [not found] ` <20170321113428.GC15920@cbox>
     [not found]   ` <58D17AF0.2010802@arm.com>
     [not found]     ` <20170321193933.GB31111@cbox>
     [not found]       ` <58DA3F68.6090901@arm.com>
     [not found]         ` <20170328112328.GA31156@cbox>
     [not found]           ` <20170328115413.GJ23682@e104320-lin>
     [not found]             ` <b1c6e747-2fa7-b7a1-60d5-4a9c480b9dc9@huawei.com>
     [not found]               ` <58DA67BA.8070404@arm.com>
     [not found]                 ` <5b7352f4-4965-3ed5-3879-db871797be47@huawei.com>
2017-03-29 10:36                   ` [PATCH] kvm: pass the virtual SEI syndrome to guest OS Achin Gupta
2017-03-29 11:58                     ` Laszlo Ersek
     [not found]                       ` <20170329154539-mutt-send-email-mst@kernel.org>
2017-03-29 13:36                         ` Laszlo Ersek
2017-04-06 12:35                       ` gengdongjiu
2017-04-06 18:55                         ` Laszlo Ersek
2017-04-07  2:52                           ` gengdongjiu
2017-04-07  9:21                             ` Laszlo Ersek
2017-04-21 13:27                           ` gengdongjiu
2017-04-24 11:27                             ` Laszlo Ersek
     [not found]                     ` <CAMj-D2BT3ByY-iFrRVVK7y=G7zhRBtM031VgLn6JzwUE-WCdWQ@mail.gmail.com>
     [not found]                       ` <20170329144822.GA1020@cbox>
2017-03-29 15:37                         ` Laszlo Ersek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox