What’s your logging philosophy?

https://stackoverflow.com/questions/44554

logging

09-06-2019
|

Question

As Jeff Atwood asked: "What’s your logging philosophy? Should all code be littered with .logthis() and .logthat() calls? Or do you inject logging after the fact somehow?"

Solution

My logging philosophy is pretty easily summarized in four parts:

Auditing, or business logic logging

Log those things that are required to be logged. This comes from the application requirements, and may include logging every change made to any database (as in many financial applications) or logging accesses to data (as may be required in the health industry to meet industry regulations)

As this is part of the program requirements many do not include it in their general discussions of logging, however there is overlap in these areas, and for some applications it is useful to consider all logging activities together.

Program logging

Messages which will help developers test and debug the application, and more easily follow the data flow and program logic to understand where implementation, integration, and other errors may exist.

In general this logging is turned on and off as needed for debugging sessions.

Performance logging

Add later logging as needed to find and resolve performance bottlenecks and other program issues which aren't causing the program to fail, but will lead to better operation. Overlaps with Program logging in the case of memory leaks and some non-critical errors.

Security logging

Logging user actions and interactions with external systems where security is a concern. Useful for determining how an attacker broke a system after an attack, but may also tie into an intrusion detection system to detect new or ongoing attacks.

OTHER TIPS

I work with safety critical real-time systems and logging is often the only way to catch rare bugs that only turn up every 53rd tuesday when it's a full moon, if you catch my drift. This kind of makes you obsessive about the subject, so I'll apologise now if I start to froth at the mouth.

I design systems which are capable of logging pretty much everything, but I don't turn everything on by default. The debug information is sent to a hidden debug dialog which timestamps it and outputs it to a listbox (limited to around 500 lines before deletion), and the dialog allows me to stop it, save it to a log file automatically, or divert it to an attached debugger such as DBWin32. That diversion allows me to see the debug output from multiple applications all neatly serialized, which can be a life saver sometimes. The log files are automatically purged every N days. I used to use numeric logging levels (the higher you set the level, the more you capture):

off
errors only
basic
detailed
everything

but this is too inflexible - as you work your way towards a bug it's much more efficient to be able to focus logging in on exactly what you need without having to wade through tons of detritus, and it may be one particular kind of transaction or operation that causes the error. If that requires you to turn everything on, you're just making your own job harder. You need something finer-grained.

So now I'm in the process of switching to logging based on a flag system. Everything that gets logged has a flag detailing what kind of operation it is, and there's a set of checkboxes allowing me to define what gets logged. Typically that list looks like this:

#define DEBUG_ERROR          1
#define DEBUG_BASIC          2
#define DEBUG_DETAIL         4
#define DEBUG_MSG_BASIC      8
#define DEBUG_MSG_POLL       16
#define DEBUG_MSG_STATUS     32
#define DEBUG_METRICS        64
#define DEBUG_EXCEPTION      128
#define DEBUG_STATE_CHANGE   256
#define DEBUG_DB_READ        512
#define DEBUG_DB_WRITE       1024
#define DEBUG_SQL_TEXT       2048
#define DEBUG_MSG_CONTENTS   4096

This logging system ships with the release build, turned on and saving to file by default. It's too late to find out you should have been logging AFTER the bug has occurred, if that bug only occurs once every six months on average and you have no way of reproducing it.

The software typically ships with ERROR, BASIC, STATE_CHANGE and EXCEPTION turned on, but this can be changed in the field via the debug dialog (or a registry/ini/cfg setting, where these things get saved).

Oh and one thing - my debug system generates one file per day. Your requirements may be different. But make sure your debug code starts every file with the date, version of the code you're running, and if possible some marker for the customer ID, location of the system or whatever. You can get a mish-mash of log files coming in from the field, and you need some record of what came from where and what version of the system they were running that's actually in the data itself, and you can't trust the customer/field engineer to tell you what version they've got - they may just tell you what version they THINK they've got. Worse, they may report the exe version that's on the disk, but the old version is still running because they forgot to reboot after replacing. Have your code tell you itself.

That's my brain dumped...

I think always, always, always add logging when there is an exception, including the message and full stack trace. Beyond that, I think it's pretty subjective to whether or not you use the logs often or not...

I often try to only add logging in critical places where what I am logging should very rarely hit, otherwise you get the problem like he mentioned of logs that grow way too big... this is why logging error cases is the ideal thing to always log (and it's great to be able to see when these error cases are actually being hit so you can inspect the problem further).

Other good things to log are if you have assertions, and your assertions fail, then log it... such as, this query should be under 10 results, if it is bigger there may be a problem, so log it. Of course, if a log statement ends up filling the logs, it is probably a hint to either put it to some sort of "debug" level, or to adjust or remove the log statement. If the logs grow too big, you will often end up ignoring them.

I take what I consider a traditional approach; some logging, surrounded by conditional defines. For production builds, I turn off the defines.

I choose to log deliberately as I go, as this means the log data is meaningful:

Depending on logging framework you can add level/severity/category information so that the log data can be filtered
You can make sure that the right level of information is present, not too much, not too little
You know when writing the code which the most important things are, and can therefore ensure they are logged

Using some form of code injection, profiling or tracing tool to generate logs would most likely generate verbose, less useful logs that would be harder to dive into. They may be useful as a debugging aid, however.

I start by asserting a lot of conditions in my code (in C#, using System.Diagnostics.Assert), but I add logging only where I find, while debugging or putting the system under stress, that I really need to have a way to follow what's happening inside of my code without having a debugger permanently attached.

Otherwise, I prefer using Visual Studio's capability to put traces in the code as special breakpoints (i.e. you insert a breakpoint and right-click it, then select "When hit..." and tell it what to display in that case). There is no need to recompile and it is easy to enable/disable the traces on the fly.

If you're writing a program that will be used by many people, it's best to have some kind of mechanism to choose what will be logged and what won't. One argument in favor of .logthis() functions is that they can be an excellent replacement for inline comments in some instances (if done properly).

Plus, it helps you narrow down EXACTLY where an error is occurring.

Log 'em all and let Grep sort 'em out.

I agree with Adam, but I also would consider logging things of interest or things that you can demonstrate as achievements as a kind of proof of them happening.

I define a variety of levels and pass in a setting with the config / invocation.

If you really need logging in your system then your tests are crap or at the very least incomplete and not very thorough. Everything in your system should be a black box as much as possible. Notice how core classes like String dont need logging - the primary reason being they are very well tested and perform as detailed. No surprises.

I use logging as a way to narrow down issues that do not reproduce in our unit tests let alone repeating the same steps provided by the user: those rare glitches that only show up on some very remote hardware (and sometimes, albeit very rarely, even caused by a driver or third party library glitch outside of our control).

I agree with the comment that this should all be caught by our testing procedure, but it's difficult to find a million+ LOC codebase that demands very low-level, performance-critical code to ever meet those requirements. I don't work in mission-critical software but I work in the graphics industry where we're often having to do everything from implementing memory allocators to utilizing GPU code to SIMD.

Even with very modular, loosely-coupled or even completely decoupled code, the system interactions can lead to very complex inputs and outputs with behavior varying between platforms where occasionally we have that rogue edge case which eludes our tests. Modular black boxes can be very simple but the interactions between them can get very complex and lead to the occassional unanticipated edge case.

As an example of a case where logging saved my butt, one time I had this odd user with a prototype Intel machine that was crashing. We listed the minimum requirements for machines which should support SSE 4, but this particular machine met those minimum requirements and still did not support Streaming SIMD extensions past SSE 3 in spite of being a 16-core machine. Discovering that quickly was made possible by looking at his log which showed precisely the line number where the SSE 4 instructions were used. None of us in our team could reproduce the issue let alone a single other user that participated in verifying the report. Ideally we should have written code for older SIMD versions or at least did some branching and checking to make sure the hardware supported the minimum requirements, but we wanted to make a firm assumption communicated through the minimum hardware requirements for simplicity and economy. Here, perhaps, it's arguable that it was our minimum system requirements that had the "glitch".

Given the way I use logging here, we tend to get fairly large logs. However, the goal is not readability -- what's typically important is the last line of a log sent in with a report when the user experiences a crash of some sort that none of us on the team (let alone few other users in the world) can reproduce.

Nevertheless, one trick I employ regularly to avoid excessive log spamming is that it is often reasonable to assume that a piece of code which executes once successfully will also do so subsequently (not a hard guarantee, but often a reasonable assumption). So I often employ a log_once kind of function for granular functions to avoid the overhead of paying the cost of logging every time it is called.

I don't sprinkle log outputs all over the place (I might if I had the time). Typically I reserve them most for areas which seem the most dangerous: code invoking GLSL shaders, e.g. (GPU vendors vary wildly here in terms of capability and even how they compile the code), code using SIMD intrinsics, very low-level code, code that inevitably has to rely on OS-specific behavior, low-level code making assumptions about the representation of PODs (ex: code that assumes 8 bits to a byte) -- the kind of cases where we would likewise sprinkle a lot of assertions and sanity checks as well as write the most number of unit tests. Typically this is enough, and logging has saved my butt many times where I would have otherwise taken an unreproducible issue and would have had to take blind stabs at the problem, requiring many iteratons bouncing attempts at a solution to the one user in the world who could reproduce the problem.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow