Tuning Rails log levels per class, without a redeploy

Terminal screenshot. Top command: rails runner 'MailerService.logger_level = :debug', output => :debug (expires in 9h). Then a tail of application.log showing INFO and WARN lines from OrdersService, AuthService, WebhookService, InventoryService, and OrdersController in muted gray, interleaved with bright DEBUG lines from MailerService showing the SMTP handshake, authentication, send, and delivery. Punchline beneath: one class loud, the rest at the app default. exactly the dial we wanted.

Friday afternoon. A subset of users in one country aren't getting their transactional emails. The trace says "delivered" through the service that hands the message off, then nothing useful. The mailer is logging at whatever your app's production default is, same as the rest of the codebase. The next step is obvious. Turn on debug for the mailer, see what it's doing, find the bug.

Most engineers reach for Rails.logger.level = :debug first. That's the whole-app logger, though. Crank it on a fleet of forty pods serving live traffic and you're paying log volume for every controller, every job, every render, for one investigation that needs maybe twenty lines from one class. Every aggregator-indexed line is a few million lines of noise around the signal. The bill goes vertical. Half your team's queries time out because the logs explorer is choking. This is the sledgehammer-for-surgery problem1 and it's been the state of Rails logging for as long as I've been writing it.

I wanted a dial that was per-component, runtime-tunable, and didn't need a redeploy. Crank the mailer to debug for thirty minutes, leave the rest at the production default, have it expire on its own when I forget about it.

This post is how I built it.

A quick primer on log levels, since defaults vary

Ruby's standard log levels go debug, info, warn, error, fatal, in increasing order of severity. The logger holds a level threshold. Lines emitted below the threshold are short-circuited and never reach the appender. So a logger at info will emit info, warn, error, fatal lines and silently drop debug ones.

Rails defaults to info in production2. Plenty of teams move it to warn for high-volume apps because debug is the firehose: ActiveRecord query traces, cache hits, every internal callback a library author thought to log, every cookie session lookup. On a busy fleet, dropping the global threshold to debug can multiply log line volume severalfold to an order of magnitude depending on the app's shape, and your aggregator is going to bill you for every byte. Most teams pick info or warn to keep that bill sane and rely on tracing tools for the high-frequency stuff.

That tradeoff defines the entire shape of this problem. The moment you actually need debug for an investigation, you're asking the fleet to pay debug-level volume for everyone else's traffic too. Whatever your default is, the dial in this post breaks that constraint: you crank one class to debug, the rest stay at the default, and the volume bill barely moves.

What I wanted

A few requirements ruled options in or out.

Per-class control was the first need. Setting the level on the controller layer alone won't cut it. Most of the interesting work happens downstream of the controller, in jobs, mailers, service objects, view components, and whatever else your app has accumulated into base classes. Anywhere I have a class with a meaningful boundary, I want to be able to dial its logger independently.

It also had to change at runtime, with no redeploy involved. An investigation dial only earns its name if it's available the moment you need it, not after a CI cycle.

And it had to expire on its own. If I crank a service to debug at midnight chasing a bug, I shouldn't wake up to a nine-hour debug spew because I forgot to turn it off. A finite TTL handles the cleanup so I don't have to remember.

Approaches I considered

Cranking Rails.logger.level globally. This is the default reflex, and it works fine for development. In a production fleet of forty pods serving live traffic, it's the sledgehammer-for-surgery problem from above. Even with five-minute investigations, the aggregator cost is real, and you're polluting every other engineer's logs while you're at it. I ruled this one out fast.

Lograge. Lograge reformats Rails request logs into one-line entries and is excellent for what it does. But it's a request-line cleanup, not a level controller. The level still lives on Rails.logger. Doesn't solve this problem.

Tagged logging. ActiveSupport::TaggedLogging3 lets you wrap a block in tagged("user-123") do ... end and every log entry inside picks up the tag. But again, it's tags, not levels. Wrapping MailerService.deliver in a tagged block doesn't change which lines emit, it changes how they're labeled. It solves the filtering problem after the fact, not the volume problem.

Prefab's filter approach. Prefab.cloud ships a Semantic Logger filter you wire as SemanticLogger.add_appender(filter: Prefab.log_filter)4. Per-line, the filter Proc evaluates whether to emit, and can read dynamic config from Prefab's SaaS. This is a clean shape and probably the closest thing to industry-standard for this problem. The downside for me was the SaaS dependency. The dial config sits in their service. An outage there is an outage of my debugging primitive at exactly the worst time. I wanted the dial in my own infrastructure.

Plain SemanticLogger[ClassName]. Semantic Logger5 gives every class its own logger via SemanticLogger[MailerService]. That part is excellent. Each class's lines come out with the class name in the named-class slot, indexable in the aggregator with no extra work. The level on each per-class logger is a class instance variable, though, which means it's static unless something mutates it. There's no "tune at runtime" baked in. So this gives me the per-class shape but I still need to build the dial.

What ruled all of these in or out: the dial had to live in my own config store, expire on its own, and not require a redeploy or an external service.

A dark mixing-board panel with seven vertical sliders, each labeled with a service class. OrdersService, AuthService, InventoryService, AccountService all sit at warn. WebhookService and NotifierService at info. The MailerService slider in the middle has its knob cranked all the way up to debug, glowing orange against the muted gray of the others. Caption beneath: one slider up, the rest at the app default. exactly the dial.

What I built

A small concern, prepend'd into every base class that owns a meaningful boundary in the app. ApplicationController, ApplicationJob, ApplicationMailer, ApplicationRecord, ApplicationService, ApplicationCable::Channel, ApplicationComponent (if you use ViewComponent), and any other base class your codebase has grown for things like message delivery or event broadcast. Each subclass inherits the behavior automatically.

The concern overrides one thing: the class-level logger method.

module ComponentLogger
  extend ActiveSupport::Concern

  included do
    helper_method :logger if respond_to?(:helper_method)
    include SemanticLogger::Loggable
  end

  delegate :logger, to: :class

  class_methods do
    def logger
      @semantic_logger ||= SemanticLogger[name]
      current_level = logger_level

      if current_level == SemanticLogger.default_level
        @semantic_logger.level = nil
      elsif @semantic_logger.level != current_level
        @semantic_logger.level = current_level
      end

      @semantic_logger
    end

    def logger_level
      if @_cached_level && @_cached_level[:expires_at] > Time.current
        return @_cached_level[:level]
      end

      level = config_store.get(logger_key)&.to_sym || SemanticLogger.default_level
      @_cached_level = { level:, expires_at: Time.current + log_level_cache_ttl }
      level
    end

    def logger_level=(level)
      level = level.to_sym
      config_store.setex(logger_key, log_level_expiry, level)
      @_cached_level = { level:, expires_at: Time.current + log_level_cache_ttl }
      @semantic_logger.level = level if @semantic_logger && @semantic_logger.level != level
    end

    def logger_key = "log_levels:#{name}"
    def log_level_cache_ttl = 2.minutes
    def log_level_expiry = 9.hours.to_i
  end
end

Walking through it: every call to MailerService.logger checks the in-process cache. If it's been more than two minutes since the last refresh, the cache reads the config store. If the store has an explicit level for this class, the logger pins that level. If not, it clears the level. (More on why "clear" not "leave alone" in a moment.)

Setting the dial, via MailerService.logger_level = :debug, writes to the config store with a nine-hour TTL. After nine hours the entry is gone, the next read falls back to the default, and no one needs to remember to turn it off.

Use any key-value store with TTL support: setex, get, and process-local caching are the only operations the pattern needs. The cross-pod invalidation question, how does pod B find out that pod A changed the level, gets answered in a separate post about the broadcast layer underneath. For the purposes of this pattern, two-minute eventual consistency is fine, and the broadcast layer brings it under a second when it matters.

The include that did nothing

The prepend in the snippet above is doing real work. I tried include LoggerConcern first, the way you reach for any concern by default. It silently did nothing on the Rails base classes I cared about. The dial was set, the cache was warm, the class-level method ran when I called it directly in a console, and yet every controller.logger inside the app kept resolving to the inherited Rails.logger, ignoring the per-class dial entirely.

Method-lookup order is what trips this up. Rails base classes wire logger via class_attribute :logger, which defines both a class-level reader and an instance-level reader directly on the base class itself (AbstractController::Base, ActiveJob::Base, ActionMailer::Base, all of them). With include, the concern's instance method (the delegate :logger, to: :class) ends up shadowed in real call paths by the inherited base-class logger. With prepend, the concern's instance methods sit above the including class itself in the ancestor chain, which means the delegate always fires first and lands on the class-level method I actually wrote.

Just one keyword changed. It's worth flagging because the failure was silent: the dial appeared to work, the tests could pass if they were calling the class-level method directly, but the instance-level call paths that emit actual log lines were never going through the dial. The bug didn't fail loud.

The cooperation, drawn

Before the trap section dives into the mechanism, a picture of the two states the dial can be in. The branch that pins @level versus the branch that clears it back to nil is the load-bearing piece, and the consequence for whatever thread-local override is layered on top is the whole story.

flowchart TD
    A[ComponentLogger.logger called] --> B[read level from config store]
    B --> C{store override differs<br/>from global default?}
    C -->|"no, or unset"| D["@semantic_logger.level = nil"]
    C -->|"yes, non-default pin"| E["@semantic_logger.level = override"]
    D --> F[logger falls back to<br/>SemanticLogger.default_level_index]
    F --> G[thread-local silence respected<br/>per-user override can win]
    E --> H[per-class level pinned<br/>thread-local silence ignored<br/>per-user override loses for this class]

The trap that took the longest to see

Look at the level-pinning block again.

if current_level == SemanticLogger.default_level
  @semantic_logger.level = nil
elsif @semantic_logger.level != current_level
  @semantic_logger.level = current_level
end

Pay attention to the @semantic_logger.level = nil branch. That's where the weight lives. Without it, here's what breaks.

@semantic_logger is a class-level instance, one per class, shared across all the process's threads. When the config store has a class-level dial set, @semantic_logger.level = current_level is correct. When no dial is set, the obvious thing is to do nothing and leave the level wherever it was. That is wrong.

Why: SemanticLogger.silence(level)6 is the gem's thread-local, block-scoped helper for lowering the minimum log level inside a block. It's the standard way to say "drop this thread's level for the duration of this block, then restore it." Per the docs, silence only takes effect when the per-class logger's @level is unset, meaning the logger falls back to SemanticLogger.default_level_index, which respects the thread-local. If the per-class logger has any explicit @level pinned, silence is ignored for that logger.

That property is load-bearing for a sibling feature I'm covering in the next post: per-user per-request log overrides built on silence. If the per-class dial here pins @level whenever it runs, the sibling feature is silently dead, because every class would be ignoring the thread-local override.

So if the method body sets @semantic_logger.level once, to anything, and never clears it, the thread-local override is permanently dead for that class. The first call to MailerService.logger after boot pins @level to whatever the app's global default is (info, warn, whatever). Every subsequent thread-local override fails silently for that class.

Fixing it takes the explicit nil branch. Whenever the config store says "default level for this class," clear @level back to nil and let the logger fall back. The class-level dial only pins @level when a non-default override is actually set. When it isn't, the thread-local is free to take over.

This is a one-line change with a five-line comment, and the comment is the longer of the two on purpose. Anyone removing the nil branch later would assume it was a no-op.

What this gets you

A working dial. Call MailerService.logger_level = :debug from a Rails console, an internal admin endpoint, a feature-flag platform, or however you choose to expose the setter to operators. The next request that touches that class emits debug logs, indexed in the aggregator under MailerService. The other thirty-something component classes stay at the app's default. Aggregator volume goes up by maybe a tenth of a percent.

After thirty minutes, or nine hours if you forget, the entry expires and the logger falls back. There's no cleanup to do, no follow-up to remember.

The pattern scales to anything with a base class. I have the concern prepend'd in eight or so places, ApplicationController and ApplicationJob and the rest, and every subclass under each base inherits the dynamic logger automatically. Adding a new component is one line.

It also keeps the aggregator clean. Every emitted log carries the class name as a structured field (Semantic Logger's named-class slot), so filtering by class Just Works in whatever query syntax your stack uses. The field is already there. You don't have to add any tagging code on the app side.

What it doesn't

The two-minute cache means a level change has up to two minutes of skew across the fleet by default. The broadcast layer brings it under a second, but that's a separate piece.

It doesn't change how individual lines emit, only whether they emit. If you want to filter individual lines by content, suppress every health_check, log every line for one user, that is the appender-filter approach Prefab uses. Both shapes are valid. This one is cheaper at the high-volume end because below-level lines short-circuit before reaching the appender. The filter approach is more flexible because the Proc has full context of every log line.

It doesn't help with libraries that own their own loggers. Sidekiq.logger, gem-internal logging, anything that resolves a logger by some other path goes through whatever level it was configured with. The pattern is "every component I own."

The pattern most surprised me by working everywhere. The interface is the same anywhere a base class exists: prepend ComponentLogger and you're done. The dial is now available for that class.

What I'd revisit today

I shipped this with a TTL cache. At read time, every pod checks its own in-memory map. Every pod independently decides when to refresh. Writes wait out the cache window before showing up everywhere. That worked, and the production fleet has been running on it for over a year. But the more system design I've done since then, the more I'd push back on the TTL-cache shape if I were proposing it in a design review today.

The structural issue is read-side polling. Every pod has its own clock for invalidation. Most of the time the staleness doesn't matter, but during an active investigation, when you're flipping the dial back and forth checking a hypothesis, the two-minute lag is visible and frustrating. A senior reviewer asking "what happens when pod A and pod B see different values for the same level" is asking the right question, and "they reconcile within two minutes" is the wrong answer for an investigation primitive.

Three cleaner store shapes for the same problem:

  1. Pub/sub invalidation. Writes go to the store and publish an invalidation message at the same time. Every pod subscribes, drops its cache on each invalidation, refreshes on the next read. Convergence goes from minutes to seconds. Redis pub/sub is the simplest version, NATS or Kafka if you want durability.
  2. Watch-based config. Treat the level as declarative state in something like etcd, Consul, or a Kubernetes ConfigMap, and use the store's native WATCH primitive to push changes. Each pod sees the new value as soon as it lands, no polling. I wrote about exactly this pattern with Kubernetes ConfigMaps as the store, and the shape can be simplified further if you swap the K8s watch for a broadcast layer like Redis Streams or a plain pub/sub channel.
  3. Versioned snapshots. Bump a single version key on write. Pods cache by version, so a stale read pays one extra round-trip to verify the version, but it never returns out-of-date data.

The next post in this series goes with option (1) while extending this dial into per-user, per-request log overrides. Same per-class plumbing as here, but the store sits behind a pub/sub broadcast layer that pushes invalidations the moment a write lands. Convergence drops to under a second across the fleet. If you're building this from scratch and you have the option, skip the TTL-cache phase entirely and start with (1). The TTL phase mostly exists in this post because that's the order I built things in.

Closer

If you've built something similar, or hit a failure mode I didn't, especially around the cooperation with thread-local overrides, I'd love to compare notes. Email's on the contact page. The closer to "I tried this and here's what bit us," the more useful.

Up next: the per-user flavor of this pattern. How to dial debug logging for one specific user through their next handful of requests, without redeploying, without affecting any other users, and without stomping on the per-component dial described here.

Footnotes

  1. Prefab Cloud, Changing Log Levels At Runtime in Rails. The "sledgehammer for surgery" framing for Rails.logger.level = :debug lives in this post.

  2. Rails Guides, Debugging Rails Applications · Log Levels. Production defaults to :info. Levels go :debug, :info, :warn, :error, :fatal, :unknown, in increasing severity, and debug-level emissions in production "can have a massive impact on performance" because of how often Rails internals log at debug.

  3. Rails API, ActiveSupport::TaggedLogging. Wraps a block to attach a tag to every log entry inside. The level itself is not changed.

  4. Prefab Cloud, Dynamic Ruby Log Levels With Semantic Logger & Prefab. Covers the appender-filter shape as of prefab-cloud-ruby 1.6.0.

  5. Reid Morrison, Semantic Logger docs. Per-class logger via SemanticLogger[YourClass], the building block this pattern wraps.

  6. Reid Morrison, Semantic Logger. SemanticLogger.silence reduces the minimum level for the duration of a block on the calling thread. Per the docs, suppression only applies to loggers using the global default level. If a logger has an explicitly configured @level, silence is ignored for that logger. This is the property the nil branch above leaves intact on purpose.

Comments