# Maria Khan — runbookpages.com

> Platform engineer writing about the unglamorous shape of platform work: Rails internals, Kubernetes patterns, dynamic configuration, and failure modes that take three days to debug and one line to fix.

## Author

Maria Khan is a platform engineer working on Ruby on Rails, Kubernetes, PostgreSQL, Redis, and distributed systems. She writes long-form technical posts drawn from real production experience, the kind of operational detail that doesn't make it onto conference slides.

## Focus areas

- Ruby on Rails internals and ActiveRecord patterns (connection pooling, transactions, callbacks)
- Kubernetes configuration management and dynamic config propagation
- Distributed coordination, etcd internals, and the Raft consensus protocol
- Incident response and operational tooling for platform teams
- Platform security, threat modelling, and VAPT
- Observability and SRE practices
- Monolith ergonomics and migration patterns
- PgBouncer, connection pooling, and zero-downtime database-tier rollouts on Kubernetes
- Dynamic logging architecture, per-component and per-user log levels, semantic_logger patterns, and runtime-tunable dials for debugging in production
- Debugging Redis client behaviour in Rails, redis-rb / redis-client gem internals, and the connection-config asymmetries that show up in pub/sub blocking-lock paths

## Pages

- [Home](https://runbookpages.com): About Maria, her current focus areas, and what she is reading and building now
- [All posts](https://runbookpages.com/blogs): Full index of published writing
- [Contact](https://runbookpages.com/contact): How to reach Maria and the kinds of conversations she responds to

## Posts

- [When config edits start feeling like deploys](https://runbookpages.com/posts/configmap-dynamic-event-driven/): On propagating config edits across a Kubernetes fleet without a redeploy. Walks through the alternatives considered (periodic polling and the freshness floor; Redis pub/sub and at-most-once silent message loss; vendor feature-flag platforms and the dual-source-of-truth anti-pattern; dedicated coordination services like etcd, Consul, and ZooKeeper; volume-mounted ConfigMap with the listen gem and how kubelet's AtomicWriter symlink swap broke it). Lands on Kubernetes ConfigMap as the propagation channel, watched via the kubeclient API watch interface. Key insight: store markers (id, name, updated_at), not values. The ConfigMap is the bell, not the store; the database stays the only source of truth. Includes the publisher BasePublisher and BaseWatcher class scaffolding, the per-key diff that turns watch-event firehose into per-record cache invalidation, the RBAC asymmetry between named-verb scoping and list/watch (resourceNames is silently ignored on list/watch), and a failure-modes section covering the dual-writes window the design adds and how the basic shipped design surfaces failures loudly via Rails' after_commit exception propagation. Production numbers: median 250ms convergence, p99 under 2 seconds, 5-10ms per-pod processing per propagation, nanoseconds for the diff step itself, six months running cleanly.

- [Rolling PgBouncer without dropped queries](https://runbookpages.com/posts/pgbouncer-zero-downtime-rollouts/): On taking the connection-bad spike during PgBouncer rollouts to zero. Explains PgBouncer's three pooling modes (session, transaction, statement) and why transaction pooling is the usual answer, with the caveats that session-scoped state (advisory locks, session GUCs, prepared statements) breaks. Introduces a three-layer mental model: the ActiveRecord pool, the TCP socket, and PgBouncer's backend Postgres connection. Walks through the SIGTERM-vs-SIGINT shutdown dilemma (WAIT_FOR_CLIENTS waits indefinitely; WAIT_FOR_SERVERS force-disconnects mid-request) and quotes PgBouncer maintainers on issue #1468 confirming there is no proxy-side mechanism for graceful client disconnection. The fix lives in the client: a Rails patch that recycles ActiveRecord pool connections older than MAX_LIFETIME at checkout time, sized so all client connections age out before terminationGracePeriodSeconds triggers SIGKILL on the old PgBouncer pod. Includes the timing-math inequality (terminationGracePeriodSeconds > MAX_LIFETIME + idle_timeout + buffer), notes Rails 8.1 ships max_age in core (with pool_jitter to avoid thundering-herd recycling), and credits prior art (Samuel Cochran's gist for autoscaling rebalance). Failure-modes section covers the dual-writes window on the publisher side (mitigated by Rails' after_commit raise propagation surfacing 5xx to the operator, plus optional retry-with-backoff and reconciliation-loop variants) and subscriber-side risks of long-lived idle threads (supervisor, read timeout, time-since-last-event metric). Also covers PgBouncer deployment patterns on Kubernetes (DaemonSet, Sidecar, Deployment + ClusterIP service), other things PgBouncer does (PAUSE/RESUME admin commands, TLS termination, multi-database routing, authentication proxy, per-user/per-database configuration), and alternatives (AWS RDS Proxy with extended-protocol multiplexing, PgCat in Rust, app-side pooling only, Pgpool-II). Companion read recommended: JP Camara on PgBouncer's SQL-side perils.

- [Tuning Rails log levels per class, without a redeploy](https://runbookpages.com/posts/per-component-log-levels-rails/): On dialing debug logging for a single Rails class without cranking the whole-app logger. Walks through alternatives (Rails.logger.level globally and the volume tax, Lograge for request-line cleanup but not levels, ActiveSupport::TaggedLogging for tags-but-not-levels, Prefab.cloud's appender-filter approach with its SaaS dependency, plain SemanticLogger[ClassName] giving per-class loggers with only static levels). Lands on a small concern, ComponentLogger, prepend'd into ApplicationController, ApplicationJob, ApplicationMailer, ApplicationRecord, ApplicationService, ApplicationCable::Channel, and ApplicationComponent. Each class gets a runtime-tunable level backed by a config store with a process-local TTL cache. Dials set via the .logger_level= setter from a Rails console or admin endpoint with a finite expiry so debug doesn't run forever if forgotten. Covers two traps: the include-vs-prepend method-lookup-order failure (Rails base classes wire logger via class_attribute, which interacts with module inclusion such that the concern's delegate is shadowed by the inherited base-class method when included; prepend places the concern above the class itself in the ancestor chain and the delegate fires first), and the @semantic_logger.level = nil cooperation pattern (the load-bearing branch is clearing @level when the config store has no override different from the global default, so thread-local SemanticLogger.silence overrides from sibling features can take over for that class). Closing "What I'd revisit today" section calls out the read-side polling tradeoff in the TTL cache and proposes three cleaner store shapes: pub/sub invalidation (Redis pub/sub or NATS/Kafka for durability), watch-based config (etcd/Consul/Kubernetes ConfigMap with native WATCH primitives, linking to the configmap-dynamic-event-driven post as prior art that can be further simplified with a broadcast layer), and versioned snapshots. The next post extends this dial into per-user request-scoped overrides built on a pub/sub broadcast.

- [Tuning Rails log levels per user, request-scoped at runtime](https://runbookpages.com/posts/per-user-debug-logging-rails/): On dialing debug logging for one specific user's next handful of requests, without affecting other users or redeploying. Builds on the per-class dial from the previous post and adds the orthogonal axis (per-user). Walks through alternatives (Rails.logger.level globally, the per-component dial alone giving every user's calls into that class at debug, ActiveSupport::TaggedLogging for tags-but-not-levels, Prefab's appender filter, a custom logger subclass with thread-variable level). Lands on a small RequestScope.apply(user_id, &block) module wired as a controller around_action, using SemanticLogger.silence(level) for the thread-local minimum-level override and SemanticLogger.tagged(user_debug: user_id) for aggregator filtering. Both ensure-scoped, no leak to the next request on the same worker thread. The store is a single Redis hash, keyed user_debug_logging with fields keyed by user_id, chosen for enumeration (HGETALL over SCAN) and locality (one key for all overrides, trivial to dump, count, or wipe). The catch is Redis's lack of per-field TTL on hashes, so each value encodes level:expires_at and the read path checks the timestamp; expired entries are treated as missing and lazily deleted, with the broadcast-driven refresh sweeping any expired fields it iterates over. The cooperation with the per-component dial is the punchline: silence only takes effect when the per-class logger's @level is unset, which is exactly what the previous post's @level = nil branch leaves intact. Honest about limits: async-job propagation needs explicit wiring with three standard fixes (concern on ApplicationJob passing user_id as an arg, ActiveSupport::CurrentAttributes for ambient propagation auto-handled by ActiveJob's serializer, or custom Sidekiq/GoodJob/SolidQueue middleware), non-controller paths need their own RequestScope.apply call at the entry point, and class-level non-default pins still win over per-user overrides for that class.

- [Debugging Redis::CannotConnectError in Ruby](https://runbookpages.com/posts/redis-cannot-connect-error-ruby/): On a month of thousands of connect-timeout-flavoured Redis::CannotConnectError exceptions a day from a Rails app on redis-rb, the dead ends, and the four-line fix that turned out to be a footgun in the application's own code. Walks through the redis-rb error taxonomy (CannotConnectError, ConnectionError, TimeoutError as siblings under BaseConnectionError), and shows that CannotConnectError covers several quite different causes (connection refused, host unreachable, DNS failure, TLS handshake failed, AND connect timeout) since redis-rb's hierarchy is flatter than redis-client's. Names the load-bearing insight: every visible error in the dashboard is a post-retry error (redis-rb 5.x defaults reconnect_attempts to 1, so the client retries once before propagating), so the underlying failure rate is at least double what the count suggests. Covers the redis-rb 5.0 timeout default change (5s to 1s) and the layered-default gotcha where the wrapper sets reconnect_attempts: 1 even though the underlying redis-client gem's own default is false. Traces the connect-timeout chain in source: Socket.tcp(host, port, connect_timeout: X, resolv_timeout: X) inside RedisClient::RubyConnection#connect raises Errno::ETIMEDOUT, the inner rescue mutates the message to append ': Xs', the outer rescue translates SystemCallError to RedisClient::CannotConnectError, and redis-rb wraps as Redis::CannotConnectError with message 'Connection timed out: 1.0s'. The trap section is the post's actual differentiator: any time you construct a fresh Redis client from $pool.connection.slice(:host, :port, :db, :id), you silently drop the timeout config because the connection accessor only exposes the addressing fields. The pattern shows up most often in pub/sub blocking-lock subscribers, where the dedicated socket is built on demand from the pool's connection metadata. The fix extracts a shared REDIS_TIMEOUTS constant and merges it into both the main pool config and the subscriber construction, four lines total. Honest about dead ends: increasing connection pool size (wrong layer, would surface as ConnectionPool::TimeoutError), looking for KEDA cold-start correlation (real but not causal), looking for cross-call patterns (broad distribution should have been the clue, missed it), suspecting DNS (worthwhile check, dead end). Things you can't tune on managed Redis (worked example: ElastiCache): tcp-backlog, somaxconn, tcp_max_syn_backlog, TCP keepalive intervals, OS-level connection timeout, ephemeral port range, conntrack table size. All behind the managed-service wall. AWS-side ceilings worth knowing: NetworkBandwidthInAllowanceExceeded for baseline-vs-burst throttling, the 65,000 concurrent client connection ceiling per node, the soft recommendation to keep currConnections in the low hundreds. The verification trick: introspect _client.config.connect_timeout at runtime to confirm what your client actually believes about its own configuration, regardless of what your config file says (and the equivalent introspection paths in redis-py, ioredis, Lettuce). Why connect_timeout=3s landed there: 1.0s leaves zero margin for natural network variance, 5.0s wastes worker threads on real failures, 3.0s absorbs noise without faking a problem.

## Technical stack

Daily: Ruby · Rails · Kubernetes · PostgreSQL · Redis · RabbitMQ · Sidekiq · Bash
Watching: Rust · Zig · etcd · Tigris · Temporal · ScyllaDB

## Contact

Email: maria@runbookpages.com
LinkedIn: https://www.linkedin.com/in/khan-maria-
GitHub: https://github.com/missusk