Writing as Lossy Encoding

A Theory of Writing

Dec 24, 2025

Writing is a lossy encoding problem. Consider a writer. The writer’s objective can be represented as embedding the latent space of their thoughts—or the thoughts they want to convey—into a discrete representation, text.

Now consider a reader. The reader’s mind converts the discrete text back into embeddings in their own mind using their own “embedding layer,” if you will. Per Reader-Response Theory, every reader has a different “embedding layer” that will interpret the text differently and will create different emotions, thoughts, and ideas in the reader’s mind.

In this text, I seek to formalize optimal writing. We will start by considering this idea in its simplest form. The optimal piece of writing can be defined as follows:

First, define the vector 𝜏 that contains the thoughts, ideas, and emotions the writer desires to evoke in the reader.

Next, define a function β(T) that results in a matrix of shape x by y, where x is the number of readers in a representative pool, and y is the size of each meaning embedding vector. The i-th row, β_i(T), represents the i-th reader’s interpretation of text T.

Now, take the mean of β(T) across the pool of representative readers to get the average interpretation vector α.

\(\boldsymbol{\alpha}(T) = \frac{1}{x} \sum_{i=1}^{x} \boldsymbol{\beta}_i(T)\)

Then, compute the mean squared error between α and 𝜏 to compute the objective function (loss) for ideal writing:

\(L(T) = \frac{1}{y} \sum_{j=1}^{y} (\tau_j - \alpha_j(T))^2\)

Then the ideal writing minimizes L, where T represents the discrete text.

\(T_{\text{optimal}} = \arg\min_T \frac{1}{y} \sum_{j=1}^{y} (\tau_j - \alpha_j(T))^2\)

What I have described above is, what I think to be, a reasonable formulation of the optimal writing for what Umberto Eco called “closed works.” That is, works that are designed to have a specific meaning: instruction manuals, some forms of journalism, and propaganda. But Umberto Eco also identified “open works”; these works seek to have many meanings, like some forms of poetry, fiction, and even non-fiction writing.

For “open works,” the writer would likely select a different aggregation method or objective function, such as maximizing variance while minimizing mean squared error. Thus, what I have described is less of a unified theory of optimal writing and more of a recipe for defining the optimal writing in a given situation.

Furthermore, optimizing the mean squared error across a pool of representative target readers could lead to subjectively poor or bland writing depending on the pool. The writer may desire to evoke extremely strong emotions in a certain subset of people; and this desire may be incompatible with broad appeal; thus the representative reader pool must be selected carefully. Or alternatively, a different metric or aggregation method could be used.

There are practically endless recombinations of this idea. A legal writer might prioritize adversarial robustness by minimizing the maximum error of a bad-faith reader rather than the average of a representative pool. Or an author seeking a cult following could use dynamic subset selection to optimize for the highest resonance within an unknown percentile of the population while ignoring the alienation of the majority. But I will eschew discussing these in depth here.

Revision as Gradient Descent

The formulation above describes optimal writing as a minimization problem, but text is discrete and thus not directly amenable to gradient-based optimization. However, I propose that the human revision process can be understood as an approximation of gradient descent.

Let T_0 denote an initial draft. The writer, upon reflection or after soliciting feedback, forms an estimate of β(T_0)—that is, an approximation of how readers will interpret the text. This estimate may be noisy, derived from the writer’s own mental simulation of readers, from workshop feedback, or from editorial review.

The writer then identifies passages where the estimated interpretation diverges from intent. Formally, for each passage p in the text, the writer approximates a local gradient signal:

\(g_p = \frac{\partial L}{\partial T_p}\)

This gradient is not computed analytically but is instead intuited: the writer senses that a particular sentence is “not landing” or that a paragraph “buries the lede.” The revision process then applies an update:

\(T_{n+1} = T_n - \eta \cdot g\)

Here, η represents the writer’s revision intensity—analogous to a learning rate. A conservative writer makes small, surgical edits (low η), while a more aggressive reviser may rewrite entire sections (high η). As in numerical optimization, both extremes carry risks: too low and the writer converges slowly or gets stuck; too high and the drafts oscillate without settling.

This framing suggests several natural extensions:

Feedback as variance reduction: A single reader provides a high-variance estimate of β(T). Multiple readers reduce this variance, yielding more reliable gradient signals. The workshop model, in which a cohort of readers responds to a draft, can thus be understood as a form of minibatch gradient estimation.
Editorial expertise as curvature awareness: A naive reader provides first-order information: “I was confused here,” “this part dragged.” This is the gradient, a direction to move, but no sense of the terrain. A skilled editor provides second-order information: a sense of how the loss surface curves. This manifests as sensitivity analysis (”this passage is load-bearing; small changes will ripple outward”), interaction effects (”if you cut this paragraph, you must also rework the ending”), and overshoot warnings (”your instinct will be to add exposition, but that will kill the pacing”). Where first-order feedback tells the writer what is wrong, second-order feedback anticipates how the writer will err in fixing it. This enables more efficient updates, analogous to Newton’s method or adaptive optimizers like Adam.
Convergence criteria: When does revision terminate? In practice, writers often stop when returns diminish, when successive drafts yield marginal reductions in perceived loss. This mirrors early stopping in machine learning, where further optimization risks overfitting to a particular pool of readers.

Whether this process converges to T_optimal depends on the fidelity of the writer’s gradient estimates and the convexity of the loss surface, which, given the complexity of human interpretation, is almost certainly non-convex and riddled with local minima. The existence of “good enough” writing, rather than provably optimal writing, may be an inevitable consequence of this rugged landscape.

Sui Generis

Discussion about this post

Ready for more?