I think we need a Panic API...

public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed

* I think we need a Panic API...
@ 2022-02-03  4:37 Andrew Fish
  2022-02-03 17:05 ` [edk2-devel] " Michael D Kinney
  2022-02-10  2:54 ` Ni, Ray
  0 siblings, 2 replies; 6+ messages in thread
From: Andrew Fish @ 2022-02-03  4:37 UTC (permalink / raw)
  To: edk2-devel-groups-io; +Cc: Ni, Ray

Mike Kinney mentioned at the TianoCore Stewards Meeting that it might be a good idea to add a panic API to the edk2. I agree. 

Some background…. While ASSERT is a powerful tool I think we have a tendency to misuse it at times in our TianoCore code:
1) Sometime we do ASSERT when we should be error checking. This pattern probably goes back to the days when we had to fit EFI + Legacy BIOS in 512 KiB ROMs. I actually noticed with clang some of the ASSERTs we have that check for null actually generate Undefined Behavior and clang emits a trap. That is for a future mail to discuss. 
2) Sometimes we use the ASSERTs to indicate catastrophic failure. Which is all nice on a DEBUG build to get a nice message but not very helpful on a RELEASE build when ASSERT is turned off. 

Thus the idea for a Panic API would be to capture fatal errors. For example if DRAM fails to train you are kind of stuck. There are also software states that make you dead in the water so you may as well panic. Since my VP runs CoreOS I’ve got a few opinions on what this infrastructure should look like. I also happen to have implemented this in a proprietary way. 

So 1st rule of panic is don’t talk about panic. No sorry that is fight club. The 1st rule of panic is don’t assume the reset of the software stack functions. So we need the option of pluming up the panic as BASE code. Given that we probably want an independent library that produces the panic function. A platform can chose to implement the panic flow at a higher level and use some existing infrastructure like report status code, but the architecture needs to empower a BASE implementation that does not require any other infrastructure. The 2nd rule of panic is people are interested in what happened when the system panicked so if at first you get a panic string over time you will want to collect more and more info. So we need a scalable way to get data out. 3rd rule of panic is there may actually be some hardware to talk to so don’t skimp on sending data. I think QEMU supports panic devices, and you might be able to wax poetic to a security coprocessor to tell the tail of your EFI panic. I almost forgot the 4th rule of panic, sometimes you want to panic for a friend. Sounds funny but think about how the exception handler works today you can kind of end up taking nested exceptions. So the unexpected exception handler is a good place to panic from, but you don’t want to panic in the context of the exception handler you want to point to the code that faulted. 

Ok given all that here is my 1st pass at an API.

VOID
EFIAPI
Panic (
  IN EFI_SYSTEM_CONTEXT  SystemContext OPTIONAL,
  IN CHAR16                             *PanicString
  );

VOID
EFIAPI
PanicEx (
  IN EFI_SYSTEM_CONTEXT  SystemContext OPTIONAL,
  IN BOOLEAN                          ProcessSystemContext,
  IN CHAR16                             *PanicString
  );

VOID
EFIAPI
PanicExData (
  IN VOID *Data,
  IN UINT DataSize
  );

VOID
EFIAPI
PanicExDone (
  VOID
  );

The simple form is:
Panic (NULL, L”MRC: Memory training failed due to XYZ thing-a-ma-bob failure”);

The extended form example from the exception handler:
PanicEx (SystemContext, FALSE, L”Exception 13: GP fault address: 0xAFAFAFAFAFAFAFAF”);
PanicExData (StackTrace, SizeOfStackTrace);
PanicExData (FirmwareVersion, SizeFirmwareVersion);
PanicExDone ();

The reason for having multiple calls to PanicExData() is you might be limited to a smallish buffer on the stack that you want to iterate over to do more complex logging. The 4th rule of panic implies the agent getting the data may have a lot more resources than the panicking code, so you might be able to send a lot of data.  

The idea behind ProcessSystemContext is if you are called from arbitrary code you might what to process SystemContext in a way similar to the unexpected exception handler. If you are in the unexpected exception handler maybe you want to let that code do the processing for you? I guess this implies we may want to make some worker libs for processing exception data and building stack frames etc. But that is more of an optimization after we add the Panic API.

The other things we could think about adding to the PanicEx is a UUID that could start the stream of data (or follow the panic string) and it could define the encoding of the panic data. This would make it easy to post process the panic data. Even if the panic data was just text it would point at which automated parser you could use to extract useful info out of the text stream. If you want to log locally, think DEBUG print then maybe you want more type info about Data. So maybe you want UINT8 vs CHAR8 vs CHAR16 API options for PanicExData(). 

I thought I’d get the conversation started, looking forward to seeing what others think.

Thanks,

Andrew Fish

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [edk2-devel] I think we need a Panic API...
  2022-02-03  4:37 I think we need a Panic API Andrew Fish
@ 2022-02-03 17:05 ` Michael D Kinney
  2022-02-10  2:54 ` Ni, Ray
  1 sibling, 0 replies; 6+ messages in thread
From: Michael D Kinney @ 2022-02-03 17:05 UTC (permalink / raw)
  To: devel@edk2.groups.io, afish@apple.com, Kinney, Michael D; +Cc: Ni, Ray

Hi Andrew,

I think it would be useful to define a few macros for PANIC() conditions that
would make it easy to convert an existing ASSERT() condition that is really
a PANIC() condition by simply using a different macro name.  One attribute of
this macro (unlike DEBUG() and ASSERT() macros) is that there would be no build
time configuration to remove PANIC() macros.  The PanicString in this specific
case can be the ASSERT() condition.  A PANIC_EX() macro could provide more
context.

The idea of a PanicLib library class has many benefits.  It would allow different
implementations.  Some of type BASE for panic conditions in exception context,
but also implementations that can route through ReportStatusCode so it can go to
a common platform handler.  Another advantage of ReportStatusCode is it already
supports the UUID concept to identify the source of the panic condition.  Of course
the version that goes through ReportStatusCode only makes sense if the state of the
system is stable enough to route the information through the report status code
router and to the correct report status code listener to perform the platform
specific processing of the event.

If we do not take advantage of ReportStatusCode routing of these conditions, then
every module that contains a PANIC() check would statically link the PanicLib
which may have significant FW size impacts. So there is a tradeoff between 
FW size and minimizing the infrastructure you have to trust to process a panic
condition.

Mike

> -----Original Message-----
> From: devel@edk2.groups.io <devel@edk2.groups.io> On Behalf Of Andrew Fish via groups.io
> Sent: Wednesday, February 2, 2022 8:37 PM
> To: edk2-devel-groups-io <devel@edk2.groups.io>
> Cc: Ni, Ray <ray.ni@intel.com>
> Subject: [edk2-devel] I think we need a Panic API...
> 
> Mike Kinney mentioned at the TianoCore Stewards Meeting that it might be a good idea to add a panic API to the edk2. I agree.
> 
> Some background…. While ASSERT is a powerful tool I think we have a tendency to misuse it at times in our TianoCore code:
> 1) Sometime we do ASSERT when we should be error checking. This pattern probably goes back to the days when we had to fit EFI +
> Legacy BIOS in 512 KiB ROMs. I actually noticed with clang some of the ASSERTs we have that check for null actually generate
> Undefined Behavior and clang emits a trap. That is for a future mail to discuss.
> 2) Sometimes we use the ASSERTs to indicate catastrophic failure. Which is all nice on a DEBUG build to get a nice message but
> not very helpful on a RELEASE build when ASSERT is turned off.
> 
> Thus the idea for a Panic API would be to capture fatal errors. For example if DRAM fails to train you are kind of stuck. There
> are also software states that make you dead in the water so you may as well panic. Since my VP runs CoreOS I’ve got a few
> opinions on what this infrastructure should look like. I also happen to have implemented this in a proprietary way.
> 
> So 1st rule of panic is don’t talk about panic. No sorry that is fight club. The 1st rule of panic is don’t assume the reset of
> the software stack functions. So we need the option of pluming up the panic as BASE code. Given that we probably want an
> independent library that produces the panic function. A platform can chose to implement the panic flow at a higher level and
> use some existing infrastructure like report status code, but the architecture needs to empower a BASE implementation that does
> not require any other infrastructure. The 2nd rule of panic is people are interested in what happened when the system panicked
> so if at first you get a panic string over time you will want to collect more and more info. So we need a scalable way to get
> data out. 3rd rule of panic is there may actually be some hardware to talk to so don’t skimp on sending data. I think QEMU
> supports panic devices, and you might be able to wax poetic to a security coprocessor to tell the tail of your EFI panic. I
> almost forgot the 4th rule of panic, sometimes you want to panic for a friend. Sounds funny but think about how the exception
> handler works today you can kind of end up taking nested exceptions. So the unexpected exception handler is a good place to
> panic from, but you don’t want to panic in the context of the exception handler you want to point to the code that faulted.
> 
> Ok given all that here is my 1st pass at an API.
> 
> VOID
> EFIAPI
> Panic (
>   IN EFI_SYSTEM_CONTEXT  SystemContext OPTIONAL,
>   IN CHAR16                             *PanicString
>   );
> 
> 
> VOID
> EFIAPI
> PanicEx (
>   IN EFI_SYSTEM_CONTEXT  SystemContext OPTIONAL,
>   IN BOOLEAN                          ProcessSystemContext,
>   IN CHAR16                             *PanicString
>   );
> 
> VOID
> EFIAPI
> PanicExData (
>   IN VOID *Data,
>   IN UINT DataSize
>   );
> 
> VOID
> EFIAPI
> PanicExDone (
>   VOID
>   );
> 
> The simple form is:
> Panic (NULL, L”MRC: Memory training failed due to XYZ thing-a-ma-bob failure”);
> 
> The extended form example from the exception handler:
> PanicEx (SystemContext, FALSE, L”Exception 13: GP fault address: 0xAFAFAFAFAFAFAFAF”);
> PanicExData (StackTrace, SizeOfStackTrace);
> PanicExData (FirmwareVersion, SizeFirmwareVersion);
> PanicExDone ();
> 
> The reason for having multiple calls to PanicExData() is you might be limited to a smallish buffer on the stack that you want
> to iterate over to do more complex logging. The 4th rule of panic implies the agent getting the data may have a lot more
> resources than the panicking code, so you might be able to send a lot of data.
> 
> The idea behind ProcessSystemContext is if you are called from arbitrary code you might what to process SystemContext in a way
> similar to the unexpected exception handler. If you are in the unexpected exception handler maybe you want to let that code do
> the processing for you? I guess this implies we may want to make some worker libs for processing exception data and building
> stack frames etc. But that is more of an optimization after we add the Panic API.
> 
> The other things we could think about adding to the PanicEx is a UUID that could start the stream of data (or follow the panic
> string) and it could define the encoding of the panic data. This would make it easy to post process the panic data. Even if the
> panic data was just text it would point at which automated parser you could use to extract useful info out of the text stream.
> If you want to log locally, think DEBUG print then maybe you want more type info about Data. So maybe you want UINT8 vs CHAR8
> vs CHAR16 API options for PanicExData().
> 
> I thought I’d get the conversation started, looking forward to seeing what others think.
> 
> Thanks,
> 
> Andrew Fish
> 
> 
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: I think we need a Panic API...
  2022-02-03  4:37 I think we need a Panic API Andrew Fish
  2022-02-03 17:05 ` [edk2-devel] " Michael D Kinney
@ 2022-02-10  2:54 ` Ni, Ray
  2022-02-10  3:37   ` [edk2-devel] " Jeff Fan
  1 sibling, 1 reply; 6+ messages in thread
From: Ni, Ray @ 2022-02-10  2:54 UTC (permalink / raw)
  To: Andrew Fish, edk2-devel-groups-io

Andrew,
I agree Panic is useful because ASSERT is a NOP in the  release build.

Can you explain a bit more on ProcessSystemContext?

-----Original Message-----
From: Andrew Fish <afish@apple.com> 
Sent: Thursday, February 3, 2022 12:37 PM
To: edk2-devel-groups-io <devel@edk2.groups.io>
Cc: Ni, Ray <ray.ni@intel.com>
Subject: I think we need a Panic API...

Mike Kinney mentioned at the TianoCore Stewards Meeting that it might be a good idea to add a panic API to the edk2. I agree. 

Some background…. While ASSERT is a powerful tool I think we have a tendency to misuse it at times in our TianoCore code:
1) Sometime we do ASSERT when we should be error checking. This pattern probably goes back to the days when we had to fit EFI + Legacy BIOS in 512 KiB ROMs. I actually noticed with clang some of the ASSERTs we have that check for null actually generate Undefined Behavior and clang emits a trap. That is for a future mail to discuss. 
2) Sometimes we use the ASSERTs to indicate catastrophic failure. Which is all nice on a DEBUG build to get a nice message but not very helpful on a RELEASE build when ASSERT is turned off. 

Thus the idea for a Panic API would be to capture fatal errors. For example if DRAM fails to train you are kind of stuck. There are also software states that make you dead in the water so you may as well panic. Since my VP runs CoreOS I’ve got a few opinions on what this infrastructure should look like. I also happen to have implemented this in a proprietary way. 

So 1st rule of panic is don’t talk about panic. No sorry that is fight club. The 1st rule of panic is don’t assume the reset of the software stack functions. So we need the option of pluming up the panic as BASE code. Given that we probably want an independent library that produces the panic function. A platform can chose to implement the panic flow at a higher level and use some existing infrastructure like report status code, but the architecture needs to empower a BASE implementation that does not require any other infrastructure. The 2nd rule of panic is people are interested in what happened when the system panicked so if at first you get a panic string over time you will want to collect more and more info. So we need a scalable way to get data out. 3rd rule of panic is there may actually be some hardware to talk to so don’t skimp on sending data. I think QEMU supports panic devices, and you might be able to wax poetic to a security coprocessor to tell the tail of your EFI panic. I almost forgot the 4th rule of panic, sometimes you want to panic for a friend. Sounds funny but think about how the exception handler works today you can kind of end up taking nested exceptions. So the unexpected exception handler is a good place to panic from, but you don’t want to panic in the context of the exception handler you want to point to the code that faulted. 

Ok given all that here is my 1st pass at an API.

VOID
EFIAPI
Panic (
  IN EFI_SYSTEM_CONTEXT  SystemContext OPTIONAL,
  IN CHAR16                             *PanicString
  );

VOID
EFIAPI
PanicEx (
  IN EFI_SYSTEM_CONTEXT  SystemContext OPTIONAL,
  IN BOOLEAN                          ProcessSystemContext,
  IN CHAR16                             *PanicString
  );

VOID
EFIAPI
PanicExData (
  IN VOID *Data,
  IN UINT DataSize
  );

VOID
EFIAPI
PanicExDone (
  VOID
  );

The simple form is:
Panic (NULL, L”MRC: Memory training failed due to XYZ thing-a-ma-bob failure”);

The extended form example from the exception handler:
PanicEx (SystemContext, FALSE, L”Exception 13: GP fault address: 0xAFAFAFAFAFAFAFAF”);
PanicExData (StackTrace, SizeOfStackTrace);
PanicExData (FirmwareVersion, SizeFirmwareVersion);
PanicExDone ();

The reason for having multiple calls to PanicExData() is you might be limited to a smallish buffer on the stack that you want to iterate over to do more complex logging. The 4th rule of panic implies the agent getting the data may have a lot more resources than the panicking code, so you might be able to send a lot of data.  

The idea behind ProcessSystemContext is if you are called from arbitrary code you might what to process SystemContext in a way similar to the unexpected exception handler. If you are in the unexpected exception handler maybe you want to let that code do the processing for you? I guess this implies we may want to make some worker libs for processing exception data and building stack frames etc. But that is more of an optimization after we add the Panic API.

The other things we could think about adding to the PanicEx is a UUID that could start the stream of data (or follow the panic string) and it could define the encoding of the panic data. This would make it easy to post process the panic data. Even if the panic data was just text it would point at which automated parser you could use to extract useful info out of the text stream. If you want to log locally, think DEBUG print then maybe you want more type info about Data. So maybe you want UINT8 vs CHAR8 vs CHAR16 API options for PanicExData(). 

I thought I’d get the conversation started, looking forward to seeing what others think.

Thanks,

Andrew Fish

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [edk2-devel] I think we need a Panic API...
  2022-02-10  2:54 ` Ni, Ray
@ 2022-02-10  3:37   ` Jeff Fan
  2022-02-10  5:33     ` Ni, Ray
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff Fan @ 2022-02-10  3:37 UTC (permalink / raw)
  To: devel@edk2.groups.io, Ni, Ray, 'Andrew Fish'

[-- Attachment #1: Type: text/plain, Size: 5558 bytes --]

Ray,

EFI_SYSTEM_CONTEXT was defined in MdePkg/Include/Protocol/DebugSupport.h

Jeff

fanjianfeng@byosoft.com.cn

From: Ni, Ray
Date: 2022-02-10 10:54
To: Andrew Fish; edk2-devel-groups-io
Subject: Re: [edk2-devel] I think we need a Panic API...
Andrew,
I agree Panic is useful because ASSERT is a NOP in the  release build.

Can you explain a bit more on ProcessSystemContext?

-----Original Message-----
From: Andrew Fish <afish@apple.com> 
Sent: Thursday, February 3, 2022 12:37 PM
To: edk2-devel-groups-io <devel@edk2.groups.io>
Cc: Ni, Ray <ray.ni@intel.com>
Subject: I think we need a Panic API...

Mike Kinney mentioned at the TianoCore Stewards Meeting that it might be a good idea to add a panic API to the edk2. I agree. 

Some background…. While ASSERT is a powerful tool I think we have a tendency to misuse it at times in our TianoCore code:
1) Sometime we do ASSERT when we should be error checking. This pattern probably goes back to the days when we had to fit EFI + Legacy BIOS in 512 KiB ROMs. I actually noticed with clang some of the ASSERTs we have that check for null actually generate Undefined Behavior and clang emits a trap. That is for a future mail to discuss. 
2) Sometimes we use the ASSERTs to indicate catastrophic failure. Which is all nice on a DEBUG build to get a nice message but not very helpful on a RELEASE build when ASSERT is turned off. 

Thus the idea for a Panic API would be to capture fatal errors. For example if DRAM fails to train you are kind of stuck. There are also software states that make you dead in the water so you may as well panic. Since my VP runs CoreOS I’ve got a few opinions on what this infrastructure should look like. I also happen to have implemented this in a proprietary way. 

So 1st rule of panic is don’t talk about panic. No sorry that is fight club. The 1st rule of panic is don’t assume the reset of the software stack functions. So we need the option of pluming up the panic as BASE code. Given that we probably want an independent library that produces the panic function. A platform can chose to implement the panic flow at a higher level and use some existing infrastructure like report status code, but the architecture needs to empower a BASE implementation that does not require any other infrastructure. The 2nd rule of panic is people are interested in what happened when the system panicked so if at first you get a panic string over time you will want to collect more and more info. So we need a scalable way to get data out. 3rd rule of panic is there may actually be some hardware to talk to so don’t skimp on sending data. I think QEMU supports panic devices, and you might be able to wax poetic to a security coprocessor to tell the tail of your EFI panic. I almost forgot the 4th rule of panic, sometimes you want to panic for a friend. Sounds funny but think about how the exception handler works today you can kind of end up taking nested exceptions. So the unexpected exception handler is a good place to panic from, but you don’t want to panic in the context of the exception handler you want to point to the code that faulted. 

Ok given all that here is my 1st pass at an API.

VOID
EFIAPI
Panic (
  IN EFI_SYSTEM_CONTEXT  SystemContext OPTIONAL,
  IN CHAR16                             *PanicString
  );

VOID
EFIAPI
PanicEx (
  IN EFI_SYSTEM_CONTEXT  SystemContext OPTIONAL,
  IN BOOLEAN                          ProcessSystemContext,
  IN CHAR16                             *PanicString
  );

VOID
EFIAPI
PanicExData (
  IN VOID *Data,
  IN UINT DataSize
  );

VOID
EFIAPI
PanicExDone (
  VOID
  );

The simple form is:
Panic (NULL, L”MRC: Memory training failed due to XYZ thing-a-ma-bob failure”);

The extended form example from the exception handler:
PanicEx (SystemContext, FALSE, L”Exception 13: GP fault address: 0xAFAFAFAFAFAFAFAF”);
PanicExData (StackTrace, SizeOfStackTrace);
PanicExData (FirmwareVersion, SizeFirmwareVersion);
PanicExDone ();

The reason for having multiple calls to PanicExData() is you might be limited to a smallish buffer on the stack that you want to iterate over to do more complex logging. The 4th rule of panic implies the agent getting the data may have a lot more resources than the panicking code, so you might be able to send a lot of data.  

The idea behind ProcessSystemContext is if you are called from arbitrary code you might what to process SystemContext in a way similar to the unexpected exception handler. If you are in the unexpected exception handler maybe you want to let that code do the processing for you? I guess this implies we may want to make some worker libs for processing exception data and building stack frames etc. But that is more of an optimization after we add the Panic API.

The other things we could think about adding to the PanicEx is a UUID that could start the stream of data (or follow the panic string) and it could define the encoding of the panic data. This would make it easy to post process the panic data. Even if the panic data was just text it would point at which automated parser you could use to extract useful info out of the text stream. If you want to log locally, think DEBUG print then maybe you want more type info about Data. So maybe you want UINT8 vs CHAR8 vs CHAR16 API options for PanicExData(). 

I thought I’d get the conversation started, looking forward to seeing what others think.

Thanks,

Andrew Fish

[-- Attachment #2: Type: text/html, Size: 8222 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [edk2-devel] I think we need a Panic API...
  2022-02-10  3:37   ` [edk2-devel] " Jeff Fan
@ 2022-02-10  5:33     ` Ni, Ray
  2022-02-11  1:02       ` Andrew Fish
  0 siblings, 1 reply; 6+ messages in thread
From: Ni, Ray @ 2022-02-10  5:33 UTC (permalink / raw)
  To: devel@edk2.groups.io, fanjianfeng@byosoft.com.cn,
	'Andrew Fish'

[-- Attachment #1: Type: text/plain, Size: 6130 bytes --]

Jeff,
I understand what “EFI_SYSTEM_CONTEXT” is. I am curious of the need of “BOOLEAN ProcessSystemContext”.

From: devel@edk2.groups.io <devel@edk2.groups.io> On Behalf Of Jeff Fan
Sent: Thursday, February 10, 2022 11:37 AM
To: devel@edk2.groups.io; Ni, Ray <ray.ni@intel.com>; 'Andrew Fish' <afish@apple.com>
Subject: Re: [edk2-devel] I think we need a Panic API...

Ray,

EFI_SYSTEM_CONTEXT was defined in MdePkg/Include/Protocol/DebugSupport.h

Jeff

________________________________
fanjianfeng@byosoft.com.cn<mailto:fanjianfeng@byosoft.com.cn>

From: Ni, Ray<mailto:ray.ni@intel.com>
Date: 2022-02-10 10:54
To: Andrew Fish<mailto:afish@apple.com>; edk2-devel-groups-io<mailto:devel@edk2.groups.io>
Subject: Re: [edk2-devel] I think we need a Panic API...
Andrew,
I agree Panic is useful because ASSERT is a NOP in the  release build.

Can you explain a bit more on ProcessSystemContext?

-----Original Message-----
From: Andrew Fish <afish@apple.com<mailto:afish@apple.com>>
Sent: Thursday, February 3, 2022 12:37 PM
To: edk2-devel-groups-io <devel@edk2.groups.io<mailto:devel@edk2.groups.io>>
Cc: Ni, Ray <ray.ni@intel.com<mailto:ray.ni@intel.com>>
Subject: I think we need a Panic API...

Mike Kinney mentioned at the TianoCore Stewards Meeting that it might be a good idea to add a panic API to the edk2. I agree.

Some background…. While ASSERT is a powerful tool I think we have a tendency to misuse it at times in our TianoCore code:
1) Sometime we do ASSERT when we should be error checking. This pattern probably goes back to the days when we had to fit EFI + Legacy BIOS in 512 KiB ROMs. I actually noticed with clang some of the ASSERTs we have that check for null actually generate Undefined Behavior and clang emits a trap. That is for a future mail to discuss.
2) Sometimes we use the ASSERTs to indicate catastrophic failure. Which is all nice on a DEBUG build to get a nice message but not very helpful on a RELEASE build when ASSERT is turned off.

Thus the idea for a Panic API would be to capture fatal errors. For example if DRAM fails to train you are kind of stuck. There are also software states that make you dead in the water so you may as well panic. Since my VP runs CoreOS I’ve got a few opinions on what this infrastructure should look like. I also happen to have implemented this in a proprietary way.

So 1st rule of panic is don’t talk about panic. No sorry that is fight club. The 1st rule of panic is don’t assume the reset of the software stack functions. So we need the option of pluming up the panic as BASE code. Given that we probably want an independent library that produces the panic function. A platform can chose to implement the panic flow at a higher level and use some existing infrastructure like report status code, but the architecture needs to empower a BASE implementation that does not require any other infrastructure. The 2nd rule of panic is people are interested in what happened when the system panicked so if at first you get a panic string over time you will want to collect more and more info. So we need a scalable way to get data out. 3rd rule of panic is there may actually be some hardware to talk to so don’t skimp on sending data. I think QEMU supports panic devices, and you might be able to wax poetic to a security coprocessor to tell the tail of your EFI panic. I almost forgot the 4th rule of panic, sometimes you want to panic for a friend. Sounds funny but think about how the exception handler works today you can kind of end up taking nested exceptions. So the unexpected exception handler is a good place to panic from, but you don’t want to panic in the context of the exception handler you want to point to the code that faulted.

Ok given all that here is my 1st pass at an API.

VOID
EFIAPI
Panic (
  IN EFI_SYSTEM_CONTEXT  SystemContext OPTIONAL,
  IN CHAR16                             *PanicString
  );

VOID
EFIAPI
PanicEx (
  IN EFI_SYSTEM_CONTEXT  SystemContext OPTIONAL,
  IN BOOLEAN                          ProcessSystemContext,
  IN CHAR16                             *PanicString
  );

VOID
EFIAPI
PanicExData (
  IN VOID *Data,
  IN UINT DataSize
  );

VOID
EFIAPI
PanicExDone (
  VOID
  );

The simple form is:
Panic (NULL, L”MRC: Memory training failed due to XYZ thing-a-ma-bob failure”);

The extended form example from the exception handler:
PanicEx (SystemContext, FALSE, L”Exception 13: GP fault address: 0xAFAFAFAFAFAFAFAF”);
PanicExData (StackTrace, SizeOfStackTrace);
PanicExData (FirmwareVersion, SizeFirmwareVersion);
PanicExDone ();

The reason for having multiple calls to PanicExData() is you might be limited to a smallish buffer on the stack that you want to iterate over to do more complex logging. The 4th rule of panic implies the agent getting the data may have a lot more resources than the panicking code, so you might be able to send a lot of data.

The idea behind ProcessSystemContext is if you are called from arbitrary code you might what to process SystemContext in a way similar to the unexpected exception handler. If you are in the unexpected exception handler maybe you want to let that code do the processing for you? I guess this implies we may want to make some worker libs for processing exception data and building stack frames etc. But that is more of an optimization after we add the Panic API.

The other things we could think about adding to the PanicEx is a UUID that could start the stream of data (or follow the panic string) and it could define the encoding of the panic data. This would make it easy to post process the panic data. Even if the panic data was just text it would point at which automated parser you could use to extract useful info out of the text stream. If you want to log locally, think DEBUG print then maybe you want more type info about Data. So maybe you want UINT8 vs CHAR8 vs CHAR16 API options for PanicExData().

I thought I’d get the conversation started, looking forward to seeing what others think.

Thanks,

Andrew Fish

[-- Attachment #2: Type: text/html, Size: 24490 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [edk2-devel] I think we need a Panic API...
  2022-02-10  5:33     ` Ni, Ray
@ 2022-02-11  1:02       ` Andrew Fish
  0 siblings, 0 replies; 6+ messages in thread
From: Andrew Fish @ 2022-02-11  1:02 UTC (permalink / raw)
  To: edk2-devel-groups-io, Ni, Ray; +Cc: fanjianfeng@byosoft.com.cn

[-- Attachment #1: Type: text/plain, Size: 7396 bytes --]



> On Feb 9, 2022, at 9:33 PM, Ni, Ray <ray.ni@intel.com> wrote:
> 
> Jeff,
> I understand what “EFI_SYSTEM_CONTEXT” is. I am curious of the need of “BOOLEAN ProcessSystemContext”.
>  

Ray,

ProcessSystemContext may not be needed… 

My thinking was if you pass ProcessSystemContext == TRUE there is some implied processing of the SystemContext by the PanicLib. If you pass FALSE then there is no processing. For example if you called panic from DumpImageAndCpuContent() [1] in the CpuExceptionHandleLib then the logic exists  in DumpImageAndCpuContent() to decode the SystemContext so you don’t want the PanicLib to do it for you. 

It might be possible to pass NULL for SystemContext and get the same behavior. The only time you might need SystemContext if there was some kind of binary panic header that encoded system state that only the Panic code knew about. 

[1] https://github.com/tianocore/edk2/blob/master/UefiCpuPkg/Library/CpuExceptionHandlerLib/X64/ArchExceptionHandler.c#L416

Thanks,

Andrew Fish

> From: devel@edk2.groups.io <mailto:devel@edk2.groups.io> <devel@edk2.groups.io <mailto:devel@edk2.groups.io>> On Behalf Of Jeff Fan
> Sent: Thursday, February 10, 2022 11:37 AM
> To: devel@edk2.groups.io <mailto:devel@edk2.groups.io>; Ni, Ray <ray.ni@intel.com <mailto:ray.ni@intel.com>>; 'Andrew Fish' <afish@apple.com <mailto:afish@apple.com>>
> Subject: Re: [edk2-devel] I think we need a Panic API...
>  
> Ray,
>  
> EFI_SYSTEM_CONTEXT was defined in MdePkg/Include/Protocol/DebugSupport.h
>  
> Jeff
>  
> fanjianfeng@byosoft.com.cn <mailto:fanjianfeng@byosoft.com.cn>
>  
> From: Ni, Ray <mailto:ray.ni@intel.com>
> Date: 2022-02-10 10:54
> To: Andrew Fish <mailto:afish@apple.com>; edk2-devel-groups-io <mailto:devel@edk2.groups.io>
> Subject: Re: [edk2-devel] I think we need a Panic API...
> Andrew,
> I agree Panic is useful because ASSERT is a NOP in the  release build.
>  
> Can you explain a bit more on ProcessSystemContext?
>  
> -----Original Message-----
> From: Andrew Fish <afish@apple.com <mailto:afish@apple.com>>
> Sent: Thursday, February 3, 2022 12:37 PM
> To: edk2-devel-groups-io <devel@edk2.groups.io <mailto:devel@edk2.groups.io>>
> Cc: Ni, Ray <ray.ni@intel.com <mailto:ray.ni@intel.com>>
> Subject: I think we need a Panic API...
>  
> Mike Kinney mentioned at the TianoCore Stewards Meeting that it might be a good idea to add a panic API to the edk2. I agree.
>  
> Some background…. While ASSERT is a powerful tool I think we have a tendency to misuse it at times in our TianoCore code:
> 1) Sometime we do ASSERT when we should be error checking. This pattern probably goes back to the days when we had to fit EFI + Legacy BIOS in 512 KiB ROMs. I actually noticed with clang some of the ASSERTs we have that check for null actually generate Undefined Behavior and clang emits a trap. That is for a future mail to discuss.
> 2) Sometimes we use the ASSERTs to indicate catastrophic failure. Which is all nice on a DEBUG build to get a nice message but not very helpful on a RELEASE build when ASSERT is turned off. 
>  
> Thus the idea for a Panic API would be to capture fatal errors. For example if DRAM fails to train you are kind of stuck. There are also software states that make you dead in the water so you may as well panic. Since my VP runs CoreOS I’ve got a few opinions on what this infrastructure should look like. I also happen to have implemented this in a proprietary way.
>  
> So 1st rule of panic is don’t talk about panic. No sorry that is fight club. The 1st rule of panic is don’t assume the reset of the software stack functions. So we need the option of pluming up the panic as BASE code. Given that we probably want an independent library that produces the panic function. A platform can chose to implement the panic flow at a higher level and use some existing infrastructure like report status code, but the architecture needs to empower a BASE implementation that does not require any other infrastructure. The 2nd rule of panic is people are interested in what happened when the system panicked so if at first you get a panic string over time you will want to collect more and more info. So we need a scalable way to get data out. 3rd rule of panic is there may actually be some hardware to talk to so don’t skimp on sending data. I think QEMU supports panic devices, and you might be able to wax poetic to a security coprocessor to tell the tail of your EFI panic. I almost forgot the 4th rule of panic, sometimes you want to panic for a friend. Sounds funny but think about how the exception handler works today you can kind of end up taking nested exceptions. So the unexpected exception handler is a good place to panic from, but you don’t want to panic in the context of the exception handler you want to point to the code that faulted. 
>  
> Ok given all that here is my 1st pass at an API.
>  
> VOID
> EFIAPI
> Panic (
>   IN EFI_SYSTEM_CONTEXT  SystemContext OPTIONAL,
>   IN CHAR16                             *PanicString
>   );
>  
>  
> VOID
> EFIAPI
> PanicEx (
>   IN EFI_SYSTEM_CONTEXT  SystemContext OPTIONAL,
>   IN BOOLEAN                          ProcessSystemContext,
>   IN CHAR16                             *PanicString
>   );
>  
> VOID
> EFIAPI
> PanicExData (
>   IN VOID *Data,
>   IN UINT DataSize
>   );
>  
> VOID
> EFIAPI
> PanicExDone (
>   VOID
>   );
>  
> The simple form is:
> Panic (NULL, L”MRC: Memory training failed due to XYZ thing-a-ma-bob failure”);
>  
> The extended form example from the exception handler:
> PanicEx (SystemContext, FALSE, L”Exception 13: GP fault address: 0xAFAFAFAFAFAFAFAF”);
> PanicExData (StackTrace, SizeOfStackTrace);
> PanicExData (FirmwareVersion, SizeFirmwareVersion);
> PanicExDone ();
>  
> The reason for having multiple calls to PanicExData() is you might be limited to a smallish buffer on the stack that you want to iterate over to do more complex logging. The 4th rule of panic implies the agent getting the data may have a lot more resources than the panicking code, so you might be able to send a lot of data. 
>  
> The idea behind ProcessSystemContext is if you are called from arbitrary code you might what to process SystemContext in a way similar to the unexpected exception handler. If you are in the unexpected exception handler maybe you want to let that code do the processing for you? I guess this implies we may want to make some worker libs for processing exception data and building stack frames etc. But that is more of an optimization after we add the Panic API.
>  
> The other things we could think about adding to the PanicEx is a UUID that could start the stream of data (or follow the panic string) and it could define the encoding of the panic data. This would make it easy to post process the panic data. Even if the panic data was just text it would point at which automated parser you could use to extract useful info out of the text stream. If you want to log locally, think DEBUG print then maybe you want more type info about Data. So maybe you want UINT8 vs CHAR8 vs CHAR16 API options for PanicExData().
>  
> I thought I’d get the conversation started, looking forward to seeing what others think.
>  
> Thanks,
>  
> Andrew Fish
>  
>  
>  
>  
> 


[-- Attachment #2: Type: text/html, Size: 34245 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-02-11  1:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-02-03  4:37 I think we need a Panic API Andrew Fish
2022-02-03 17:05 ` [edk2-devel] " Michael D Kinney
2022-02-10  2:54 ` Ni, Ray
2022-02-10  3:37   ` [edk2-devel] " Jeff Fan
2022-02-10  5:33     ` Ni, Ray
2022-02-11  1:02       ` Andrew Fish

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox