Bad Log May Drive Crazy. Good Log May Become a Magic Wand

VulfL

4.86/5 (11 votes)

Jan 15, 2020

CPOL

6 min read

7894

Rules of thumb to write useful and readable logs in software system

Introduction

In this article, I would like to share my personal experience of writing logs. Bad log may drive crazy. Good log may become a magic wand. I have written several logs during my career and came to my rules of thumbs. I cannot say that my logs became an ultimate magic wand for me, but definitely when I obey these rules of thumb, I have significantly less nightmares. This article is not about which logging library is better - it's just about how to write the log.

Assumptions

System contains at least 2 modules (i.e., server and client).
There are at least 100 users, which consume the system simultaneously.
There are at least 2 different flows in which user can consume the system (application).

Rules

Log is deliverable and should be treated and tested just like we treat UI, API and other components of software system.

This rule means - log is not a rubbish bin, where we throw anything we want and when we want. Log is a witness, to whom we are going to ask the following questions:
- What happened?
- When did it happen?
- Why did it happen?
- Is the client claim correct?
So, log should be designed just like we design all the components of the software.
Log changes as system is being developed.

This rule says: when we start to develop a software system, usually, we do not know how it would be changed in the future and which requirements we will get from product department or from users a year or 2 years since the first version of the software was released. Software can be changed so a log should be changed accordingly.
Log must have the unified timeline.

As it was stated in the assumptions section of the article, our software system has at least 2 components (for example, client and server). Client module runs at end user’s machine and it has its own time zone, which can be different from the server's one. I would suggest to use UTC for all log components. Here, you can ask: what if we want to know the time of the event at local client’s time. Then I would propose to use one more field where I would store client’s local time stamp of the event.
The log file should be of such size, that your preferable text editor can find it easy to load it and search.

The log file of each component should not be too large. It should be easily loaded by your preferable editor. If the component is verbose and writes a lot of logs, I would propose to use file rolling. Most of the open source logging libraries allow to define file rolling parameter which depends on file size. If log exceeds the size, the library creates a new file and adds an index.
Logs of all software system components should be sinked in one repository, which it would be convenient to query and investigate.

If software system consists of several components, each of them writes its log file specific location. The log file is rolled - it will be very complicated to combine all files and investigate some cross component issue. The log repository will solve the issue. By the way, if we use repository, we can ignore rule #4.
Log should be monotonous.

This rule says: if in line X, the log states that the system was in state A at time T0. At some moment, T1 software system moved to state B, this means there should be an explicit line in the log which states that the system state was changed to B at time T1. In other words: every system state change should be written explicitly in the log file. We should not guess in what state the system was - we must know it explicitly from the log.
If the system made the decision, it should be written in the log.

This rule can be demonstrated in an example. For example, we are logging user authentication process. We must write either user was authenticated or not. Both decisions should be written in the log. Reading the log afterwards, we explicitly understand exactly what decision was made by the authentication process. Writing only negative decisions (such approach exists as well) we may partially affect the rule #6: the may log becomes not monotonous.
Log messages should be associated with source code.

Reading the log, we should be able to easily find the source code section, which is responsible for a particular log message. Some frameworks allow to add file name and line number.
There are more important messages, than others.

There are lines which are more important to us than others. For example, we would like to know in the fastest way exception occurred in the system. Usually, we use severity for each log message. The most common severities are FATAL - for fatal exception, ERROR, for error, INFO and DEBUG. Sometimes, there are more severities like WARNING, TRACE, etc.
Log should be continuous.

Let’s imagine the user complains he/she saw an error message. Our Dev or DevOps team accesses the log repository queries for the error message and finds the appropriate log message. This is fine, but almost useless. Why? Because it is good that we know the user is right, but it would be much better if we could know what was the root cause of the error in this specific case. In order to be able to investigate, obviously we will need to see some log messages, which were written for this user a few moments before the error had occurred. To solve this issue, I propose the continuous approach. Continuous approach says: when user starts some flow, we will assign to it unique id. We will sign with this unique id each log message of this flow. When the flow will end, the id will be dismissed. So coming back to the example, once we found the log message, which relates to the error message user has seen, by the unique flow id, we will be able to find all the log messages written before the error was shown. If the log was written according to these rules, we will be able to easily understand what led the system to the error message for this specific user.
We are writing logs in order to read them sometime.

The log is written to read it - so it should be clear ! Preferably, no typos. According to rule #1: we should treat the log as a deliverable (for example UI). We will not release UI either not clear or with typo, so we should not release log neither with typo nor not clear.

Conclusion

Above are my rules of thumb, when I write logs. What are yours? What do you think about it? I will be happy to discuss and enrich mine.

History

2020-01-12 - First version