public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed
* I think we need a Panic API...
@ 2022-02-03  4:37 Andrew Fish
  2022-02-03 17:05 ` [edk2-devel] " Michael D Kinney
  2022-02-10  2:54 ` Ni, Ray
  0 siblings, 2 replies; 6+ messages in thread
From: Andrew Fish @ 2022-02-03  4:37 UTC (permalink / raw)
  To: edk2-devel-groups-io; +Cc: Ni, Ray

Mike Kinney mentioned at the TianoCore Stewards Meeting that it might be a good idea to add a panic API to the edk2. I agree. 

Some background…. While ASSERT is a powerful tool I think we have a tendency to misuse it at times in our TianoCore code:
1) Sometime we do ASSERT when we should be error checking. This pattern probably goes back to the days when we had to fit EFI + Legacy BIOS in 512 KiB ROMs. I actually noticed with clang some of the ASSERTs we have that check for null actually generate Undefined Behavior and clang emits a trap. That is for a future mail to discuss. 
2) Sometimes we use the ASSERTs to indicate catastrophic failure. Which is all nice on a DEBUG build to get a nice message but not very helpful on a RELEASE build when ASSERT is turned off. 

Thus the idea for a Panic API would be to capture fatal errors. For example if DRAM fails to train you are kind of stuck. There are also software states that make you dead in the water so you may as well panic. Since my VP runs CoreOS I’ve got a few opinions on what this infrastructure should look like. I also happen to have implemented this in a proprietary way. 

So 1st rule of panic is don’t talk about panic. No sorry that is fight club. The 1st rule of panic is don’t assume the reset of the software stack functions. So we need the option of pluming up the panic as BASE code. Given that we probably want an independent library that produces the panic function. A platform can chose to implement the panic flow at a higher level and use some existing infrastructure like report status code, but the architecture needs to empower a BASE implementation that does not require any other infrastructure. The 2nd rule of panic is people are interested in what happened when the system panicked so if at first you get a panic string over time you will want to collect more and more info. So we need a scalable way to get data out. 3rd rule of panic is there may actually be some hardware to talk to so don’t skimp on sending data. I think QEMU supports panic devices, and you might be able to wax poetic to a security coprocessor to tell the tail of your EFI panic. I almost forgot the 4th rule of panic, sometimes you want to panic for a friend. Sounds funny but think about how the exception handler works today you can kind of end up taking nested exceptions. So the unexpected exception handler is a good place to panic from, but you don’t want to panic in the context of the exception handler you want to point to the code that faulted. 

Ok given all that here is my 1st pass at an API.

VOID
EFIAPI
Panic (
  IN EFI_SYSTEM_CONTEXT  SystemContext OPTIONAL,
  IN CHAR16                             *PanicString
  );


VOID
EFIAPI
PanicEx (
  IN EFI_SYSTEM_CONTEXT  SystemContext OPTIONAL,
  IN BOOLEAN                          ProcessSystemContext,
  IN CHAR16                             *PanicString
  );

VOID
EFIAPI
PanicExData (
  IN VOID *Data,
  IN UINT DataSize
  );

VOID
EFIAPI
PanicExDone (
  VOID
  );

The simple form is:
Panic (NULL, L”MRC: Memory training failed due to XYZ thing-a-ma-bob failure”);

The extended form example from the exception handler:
PanicEx (SystemContext, FALSE, L”Exception 13: GP fault address: 0xAFAFAFAFAFAFAFAF”);
PanicExData (StackTrace, SizeOfStackTrace);
PanicExData (FirmwareVersion, SizeFirmwareVersion);
PanicExDone ();

The reason for having multiple calls to PanicExData() is you might be limited to a smallish buffer on the stack that you want to iterate over to do more complex logging. The 4th rule of panic implies the agent getting the data may have a lot more resources than the panicking code, so you might be able to send a lot of data.  

The idea behind ProcessSystemContext is if you are called from arbitrary code you might what to process SystemContext in a way similar to the unexpected exception handler. If you are in the unexpected exception handler maybe you want to let that code do the processing for you? I guess this implies we may want to make some worker libs for processing exception data and building stack frames etc. But that is more of an optimization after we add the Panic API.

The other things we could think about adding to the PanicEx is a UUID that could start the stream of data (or follow the panic string) and it could define the encoding of the panic data. This would make it easy to post process the panic data. Even if the panic data was just text it would point at which automated parser you could use to extract useful info out of the text stream. If you want to log locally, think DEBUG print then maybe you want more type info about Data. So maybe you want UINT8 vs CHAR8 vs CHAR16 API options for PanicExData(). 

I thought I’d get the conversation started, looking forward to seeing what others think.

Thanks,

Andrew Fish

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-02-11  1:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-02-03  4:37 I think we need a Panic API Andrew Fish
2022-02-03 17:05 ` [edk2-devel] " Michael D Kinney
2022-02-10  2:54 ` Ni, Ray
2022-02-10  3:37   ` [edk2-devel] " Jeff Fan
2022-02-10  5:33     ` Ni, Ray
2022-02-11  1:02       ` Andrew Fish

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox