Anyone who has been on production support can testify that time taken to resolve issues is inversely proportional to effectiveness of application logs. To understand what to log and more importantly what not, read ahead!
Logging is to engineers as X-ray is to doctors. With respect to an application, logging brings visibility under the hood.
Understanding what to log and what not is paramout. An application at my company handling throughput of 250k rpm was logging every request. Converging on the root cause of an issue from logs would be like finding a needle in a haystack.
To Log or Not To Log is an objective question if looked from a production support standpoint. Log levels help us maintain sanity. Let’s understand the role of levels in logging.
When an application logs, it’s sending a message to it’s owners. Messages have different meanings and warrant different actions under different levels.
The severity of log levels in increasing order is: debug < info <warn <error <fatal < panic. Levels can be configured for an application to maintain sanity of logs. Application will only log messages of higher severity than configured log levels.
We’ll take example of an
Order Management Service aka OMS to understand what should be logged. It handles orders for a ride-hailing service. The flow of OMS is: customer requests for a ride -> OMS assigns driver to the ride -> ride reaches completion -> driver gets payment for the ride.
These messages are lifecycle events. They highlight the progress of an application.
Example: application start/stop; order completion in OMS
These messages help in figuring out weird stuff. Logging at debug level means tracking state changes at every step of an application.
Example: Let’s say a driver accepts order ABC, but OMS assigns order DEF. To find root cause of issues, log all events: creation of order, list of eligible drivers for an order, acceptance of booking by a driver, saving state in OMS
These messages are potentially harmful situations. These issues need to be fixed but may not require immediate intervention.
Example: Let’s say OMS talks to a payment service for paying drivers. On breach of SLA with payment service, OMS will timeout and retry the request. Log at warning level: payment service SLA breached. Retrying
These messages are of considerable importance. Log at error level when normal flow of execution is blocked, and fixing the issue requires human intervention. You should keep an eye out for error logs; so you can fix issues manually to maintain consistency in system.
Example: If OMS has exhausted driver payment retries to payment service, Log Error paying driver XYZ for order ABC. Later pay the driver manually.
These messages are severe events that might cause the application to terminate. There should be immediate action when application logs at fatal level.
Example: Running out of disk space, total loss of network connectivity
These messages are exceptional scenarios which should be fixed immediately. Log this when application panics and terminates or recovers
Example: database wasn’t configured, number divided by by 0
Plan of Action
In production environment, ideally, application should only log on severity of warn level and above. Application should log at debug and info level only in staging, integration and UAT environment for the benefit of engineers and QAs.
warn level issues should be tracked and fixed. error level issues should be fixed ASAP. fatal and panic level issues warrant immediate attention.
Retaining logs beyond 3 days defeats the purpose of logs. Ideally logs shouldn’t be retained beyond 24 hours, but considering no work on weekends, adding 48 hours should be more than enough.
I hope this blog was helpful 😀. Please leave your thoughts in the comments section below. You can reach out to me on email: akshat.iitj⚽gmail.com