October 30, 2024

While Scout provides powerful monitoring tools (try it now!) mastering logging is an awesome complement to these skills. In this post, we’ll see how to create readable, actionable logs for both humans and machines. You’ll improve your logging strategy, drastically reduce troubleshooting time, and put yourself in the best possible position for maximum observability.

 As a starting example, let’s take this error log:


Traceback (most recent call last):
  File "/Users/code/app.py", line 19, in 
    engine = create_engine(DATABASE_URL)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "", line 2, in create_engine
  File "/Users/code/env/lib/python3.12/site-packages/sqlalchemy/util/deprecations.py", line 281, in warned
    return fn(*args, **kwargs)  # type: ignore[no-any-return]
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/code/env/lib/python3.12/site-packages/sqlalchemy/engine/create.py", line 550, in create_engine
    entrypoint = u._get_entrypoint()
                 ^^^^^^^^^^^^^^^^^^^
  File "/Users/code/env/lib/python3.12/site-packages/sqlalchemy/engine/url.py", line 758, in _get_entrypoint
    cls = registry.load(name)
          ^^^^^^^^^^^^^^^^^^^
  File "/Users/Users/env/lib/python3.12/site-packages/sqlalchemy/util/langhelpers.py", line 375, in load
    raise exc.NoSuchModuleError(
sqlalchemy.exc.NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:postgres

No such module as sqlalchemy?! (You could swear you’ve installed sqlalchemy. Yes, you checked, and did.) Can’t load a plugin? And what is a “dialect” here? 

So, you do what any seasoned developer would do and ask ChatGPT. It turns out that the DATABASE_URL was slightly wrong for this use case. It suggests a one-line code change, and that works!

…but should we have had to ask an AI to parse a log message? Shouldn’t the message be clear already? If ChatGPT knows, surely the software reporting the error should also know?

Yes, it should – and it should tell you, too!

The psychology of logs

Another example: it’s 11 p.m. on a Friday, and you’re trying to figure out why production has stalled. You look through CloudWatch and see something like this: 


[2024-08-02 23:17:45] APP-SERVER-1 WARNING: High memory usage detected (85%)
[2024-08-02 23:17:46] DB-SERVER-1 INFO: Slow query detected, execution time: 1500ms
[2024-08-02 23:17:47] CACHE-1 NOTICE: Cache miss rate increased to 25%
[2024-08-02 23:17:48] APP-SERVER-1 ERROR: Request timed out for /api/v1/users
[2024-08-02 23:17:49] LOAD-BALANCER-1 INFO: Server APP-SERVER-2 removed from pool
[2024-08-02 23:17:50] APP-SERVER-1 WARNING: Connection pool approaching limit (current: 95, max: 100)
[2024-08-02 23:17:51] DB-SERVER-1 ERROR: Lock wait timeout exceeded; try restarting transaction
[2024-08-02 23:17:52] APP-SERVER-1 ERROR: Internal Server Error (500) returned to client
[2024-08-02 23:17:53] CACHE-1 WARNING: Eviction rate spiked to 15%
[2024-08-02 23:17:54] APP-SERVER-1 INFO: Garbage collection completed in 500ms
[2024-08-02 23:17:55] DB-SERVER-1 WARNING: Slow query detected, execution time: 2500ms
[2024-08-02 23:17:56] APP-SERVER-1 ERROR: Failed to process payment for order #12345
[2024-08-02 23:17:57] LOAD-BALANCER-1 ERROR: Health check failed for APP-SERVER-1
[2024-08-02 23:17:58] CACHE-1 ERROR: Out of memory error, some keys evicted
[2024-08-02 23:17:59] APP-SERVER-1 CRITICAL: Service unresponsive, initiating restart

Here are some options if faced with such a log dump in a high-stress situation:

  1. Manual analysis: You could try to piece together the sequence of events. However, this is time-consuming and error-prone, especially under pressure.
  2. Grep and prayer: You might frantically search for specific error messages, hoping to stumble upon the root cause.
  3. Shotgun debugging: In desperation, you might start restarting services randomly, hoping it fixes the issue without understanding why.

What is not so obvious is, well, how to properly troubleshoot this issue. Detached from the actual scenario, perhaps you can understand what is happening. But in a real-life crisis situation, when the engineer is tired (and maybe unfamiliar with this particular issue) things aren’t so clear.

Whenever information needs to be shared in a stressful situation, that information has to be clear, concise, and actionable. When an engineer is already under pressure, possibly sleep-deprived, and racing against time, asking them to parse through a wall of text and piece together a narrative sets them up for failure.

Several negative outcomes are possible:

  1. Cognitive overload can lead to increased error rates, decision paralysis, and situations where engineers might latch onto familiar patterns or issues, potentially missing the root cause.
  2. The impact of poorly designed logs extends beyond the immediate incident. It can affect the entire organizational culture around incident response.
  3. Ambiguous logs also make it harder for new team members to learn from past incidents, slowing down their onboarding and reducing the team’s overall resilience.

Let’s talk about have to solve this.

#1: Treat logs as first-class citizens

Logs are often poorly thought out because no one treats them as a key part of the code. They’re often an afterthought, hastily added just before deployment or, worse, in response to a production issue. This approach fundamentally misunderstands the role and importance of logs in an application’s lifecycle.

Logs are not just a debugging tool but the core of your observability. They tell the story of your application over time, capturing its behavior, performance, and health. In many ways, logs are the closest thing we have to a time machine in software development, allowing us to travel back to specific moments and understand what happened.

Yet, despite their crucial role, logs are frequently neglected in the development process. They’re rarely reviewed in code reviews with the same rigor as functional code, seldom tested or validated, and often the first thing to be cut when deadlines loom. This neglect stems from a misconception that logs are separate from an application’s core functionality.

Treating logs as first-class citizens in your codebase means:

  • Designing them alongside your application architecture
  • Evolving them as your application evolves
  • Reviewing and refining them regularly
  • Testing them to ensure they provide value
  • Considering their performance impact and storage requirements

By elevating logs to first-class status, we acknowledge their importance in the development process and ensure they receive the attention they deserve.

#2: Consider logs as deliverables

Viewing logs as deliverables emphasizes their value beyond the development team. They’re used by operations teams, customer support, and even business analysts. They inform decisions about system health, user behavior, and business metrics. In many organizations, logs feed into monitoring systems, alerting mechanisms, and data analytics pipelines. They’re a critical part of the entire software ecosystem.

As deliverables, logs should be:

  • Planned and designed with intention for various stakeholders
  • Developed to meet specific requirements and use cases
  • Tested for accuracy and usefulness in real-world scenarios
  • Maintained and updated over time to reflect changing needs
  • Documented for ease of use and understanding by different teams

Logs are for life–the life of your application, the life of your system, and often, the professional lives of the people working with your software. They’re not just for debugging today’s issues but for understanding tomorrow’s challenges and opportunities.

#3: Make logs readable

There are multiple needles to thread when it comes to making logs readable: 

  1. We want to make logs readable to humans (but that doesn’t mean long sentences of explanations). 
  2. We need the information to be clear and concise, so our 11pm engineer can understand the problem and fix it. 
  3. We also need to think about how machines will read these errors. Monitoring and alerting tools need to monitor and be on alert. This is made more complicated if the logs are written for human eyes only. 

So, for humans and machines, the first consideration must be structure. If we look back at our poor logs from earlier, we can see they lack this:


[2024-08-02 23:17:45] APP-SERVER-1 WARNING: High memory usage detected (85%)
[2024-08-02 23:17:46] DB-SERVER-1 INFO: Slow query detected, execution time: 1500ms
[2024-08-02 23:17:47] CACHE-1 NOTICE: Cache miss rate increased to 25%
[2024-08-02 23:17:48] APP-SERVER-1 ERROR: Request timed out for /api/v1/users
…

Yes, these logs have a lot of detail, but they are tough to parse (for both humans and machines). A better option would be to structure each message within a format such as JSON or XML:


{
  "timestamp": "2024-08-02T23:17:45.123Z",
  "level": "WARNING",
  "service": "app-server-1",
  "event": "HIGH_MEMORY_USAGE",
  "message": "High memory usage detected",
  "data": {
    "memory_usage_percent": 85,
    "total_memory_mb": 16384,
    "used_memory_mb": 13926,
    "trace_id": "abc123"
  }
}

{
  "timestamp": "2024-08-02T23:17:46.234Z",
  "level": "INFO",
  "service": "db-server-1",
  "event": "SLOW_QUERY_DETECTED",
  "message": "Slow query detected",
  "data": {
    "query_id": "q-987",
    "execution_time_ms": 1500,
    "query_snippet": "SELECT * FROM large_table WHERE...",
    "trace_id": "abc123"
  }
}

This is immediately more machine-parseable and human-readable! It also contains more information while remaining (somewhat) concise. It contains a clear hierarchy and consistent fields and key-value pairs, which makes scanning easier. It also has flexibility (like the data, which allows for additional context specific to each event type). Plus, we now have a trace_id for connecting related events across services.

The second consideration is context. The above “good” version is better because it contains more information from a business context. In particular, the “query_snippet”: “SELECT * FROM large_table WHERE…” is helpful for engineers trying to isolate the problem.

However, as we said, logs should be viewed as a deliverable beyond engineers. Thus, a data object could contain more information depending on the end user for this particular log. 

There is currently an RFC, RFC 9457, in front of the IETF, which makes this problem explicit for HTTP errors. It suggests adding problem details to HTTP errors to give developers more context about the errors they get from servers and applications.

The exact “problem details” concept can be used within all logs. Your logs must include timestamps, levels, and some form of ID to help tools segment, analyze, and visualize them. You might also want to include metadata and tags for easier filtering and aggregation. For instance, perhaps you’re instrumenting a data pipeline for analysts or data engineers. In that case, a log might look like this:


{
  "timestamp": "2024-08-02T09:45:32.123Z",
  "level": "ERROR",
  "event": "DATA_PIPELINE_FAILURE",
  "data": {
    "pipeline_id": "ETL-Daily-Users",
    "step": "data_transformation",
    "rows_processed": 15000,
    "rows_failed": 500,
    "error_type": "schema_mismatch",
    "affected_columns": ["user_age", "last_login_date"],
    "impact": {
      "delayed_reports": ["daily_active_users", "user_retention"],
    }
  }
}

This structure offers essential information for understanding the error, its scope, and its impact on data analysis tasks without including excessive technical details that might not be relevant to a data analyst’s role:

  1. An explicit error identifier “DATA_PIPELINE_FAILURE“, which describes the issue succinctly.
  2. A timestamp that allows for pinpointing when the error occurred.
  3. Relevant error details – pipeline_id identifies which specific data pipeline failed, and others go into more detail.
  4. The impact field shows which analytics are affected.

Solving the problem with logs

All of the above sounds excellent, obvious …and difficult.

As much as logs are first-class citizens and logs are deliverables, the truth is that they don’t feel like that, so they get relegated to the backburner within the development process. How can we make it so developers can see the benefits of good logs?

#1: Start small and implement structured logging in a single, critical service. This allows teams to experience the benefits firsthand without overwhelming them with a massive overhaul.

#2: Create templates. Develop log templates for common scenarios, making it easier for developers to implement good logging practices consistently. Here’s a template for that:


{
  "timestamp": "{{ISO8601_TIMESTAMP}}",
  "level": "{{LOG_LEVEL}}",
  "service": "{{SERVICE_NAME}}",
  "event": "{{EVENT_TYPE}}",
  "message": "{{BRIEF_DESCRIPTION}}",
  "data": {
    "trace_id": "{{UNIQUE_TRACE_ID}}",
    "user_id": "{{USER_ID_IF_APPLICABLE}}",
    "operation": "{{SPECIFIC_OPERATION_NAME}}",
    "duration_ms": {{OPERATION_DURATION}},
    "status": "{{SUCCESS_OR_FAILURE}}",
    "details": {
      // Additional context-specific key-value pairs
    },
    "error": {
      "type": "{{ERROR_TYPE_IF_APPLICABLE}}",
      "message": "{{ERROR_MESSAGE_IF_APPLICABLE}}",
      "stack_trace": "{{STACK_TRACE_IF_APPLICABLE}}"
    }
  },
  "context": {
    "environment": "{{ENVIRONMENT_NAME}}",
    "version": "{{APPLICATION_VERSION}}",
    "server": "{{SERVER_IDENTIFIER}}"
  },
  "metadata": {
    // Additional tags or metadata for filtering and aggregation
  }
}

#3: Automate where possible. Use linters and static analysis tools to enforce logging standards in your codebase.

#4: Make it part of the review process. Log quality should be included as a criterion in code reviews, treating it with the same importance as functional code.

#5: Measure the impact. Track metrics like mean time to resolution (MTTR) before and after implementing better logging practices. Concrete improvements can help justify the effort.

By treating logs with the respect they deserve, we make our systems more observable and our incidents more manageable and create a more empathetic environment for future maintainers of our code.

🔍 Thanks for reading! And another thing: At Scout, we’ve got the perfect monitoring setup to complement your logging skills  – give us a try with our 14-day free trial!

Related Articles

Ruby memory mastery: a Scout roadmap to monitoring like a pro | part 3

Ruby memory mastery: a Scout roadmap to monitoring like a pro | part 3

What is good memory management in Ruby? In this post – the third part of our series – we’ll look at the best practices for preventing memory issues, how to identify and distinguish them, and how to use Scout like a pro to take your memory mastery to the next level. ...

Ruby memory mastery: a Scout roadmap to monitoring like a pro | part 2

Ruby memory mastery: a Scout roadmap to monitoring like a pro | part 2

Preventing and solving memory issues is at the heart of good memory management in Ruby – and of course, at Scout Monitoring, we also know that solid monitoring can be the X factor that makes all the difference. But what exactly are we looking for when we load up...

Scout Monitoring Changelog – September 2024

Scout Monitoring Changelog – September 2024

We had a couple of nice releases in September and we are still cranking away on some nice treats this month as well. Here’s what we are looking back on: Python Log Management We’ve released our python package for Log Management! It leverages a pre-configured Otel SDK...

Subscribe to our newsletter