On video-text

Timed-text is formed when the discrete words in a transcript are aligned to the continuous audio stream that gives rise to it. All monospace text in this blogpost is timed-text, allowing you to click each word to hear the source recording.

Unlike pure audio, timed-text is skimmable, searchable, and manipulable. Unlike pure text, timed-text retains the music behind the words, the intonation and cadence which encode semantics in their own right, as well as a strong source of provenance. When a video stream is coupled we get video-text, accreting a layer of facial expression and body posture or images and diagrams.

Tools such as Reduct.video build on top of video-text as their core metaphor.

Video-text is not merely a video with subtitles or an audiobook accompanying the written version. Through interactivity, the medium affords control, connection, recollection, and circulation of the media stream, subverting the hot video medium into a cool linkage of ideas. To explore what video-text is, let's stretch these metaphors into possible systems with their own suggestive interactions.

Control: video-text as direct manipulation

When considered on its own, the interactivity of video-text becomes the dominant affordance. Just with the simple interface built into this page, the examples can already be used as a soundboard. To expand the metaphor, you need only to imagine a text editor like a Google Doc, in which each word is tied to its audio source. Re-arranging the sequence of your soundboard or Doc allows you to edit the source material by operating on the level of the word, rather than the level of timestamps as in a non-linear editor. You operate on text, but the audio comes along.

With timed-text, a documentary filmmaker can take hours of an interview, drop them into a text document and easily craft out a "paper edit". The process mirrors analogically transcribing the interview, printing onto paper, and cutting out the parts, and then proceeding to re-order into a coherent narrative. However, you don't have to reverse-engineer this structure back out of the source footage. To polish up the output, one may command-F "umm" delete-all and remove filler words.

Editing with video-text, one deals directly with the ideas in the media, using the text as an interface hook into the video stream.

Taken to its extreme, we approach the cut-up method of Burroughs. Here, the potential for malicious manipulation becomes evident: "I hate oreos. I love you." can quickly become "I hate ~~oreos. I love~~ you." More perniciously, with enough perfectly aligned phonemes one can compose words that weren't ever said, audio deep-fakes. This may save a video professional from re-recording an interview due to a slip-up only noticed in post-production, but proportional to the convenience is the undermining of claims to veracity.

A further dimension of visual perception and skimmability is afforded by forcing the waveform onto the page. When an experienced musician looks at sheet music, the visual shape of the notation transmits information about melody and rhythm. This action occurs prior to the reading of the notes, operating instead at the pre-conscious level that suggests, in an elementary case, a frame being uneven or a label off-centered. The written word, laid out on the page, permits purely visual comprehension operating on the shape of paragraphs and the size of words, some of which pop out saliently.

From abstracted notation to direct manipulation of signals to digital synthesis.

Consuming video-text thus assumes a dual nature: fast skimming of text, relying on our visual intuition, accompanied by slow focused listening of salient regions. Imagine the possibilities of new notations for the spoken word, with inscriptions above the words for tone, word-spacing according to velocity, or using color for different speakers. Thinking of text as an interface makes these kinds of ideas seem a little less absurd.

There are compelling parallels in the digitization of music production. Brian Eno writes lucidly in The Studio as Composition Tool how recorded material can be investigated in the manner of an archaeologist. Upon each listen, newer details are discovered and treasured. Music "becomes a substance which is malleable and mutable and cutable and reversible in ways that discs aren't." Eno writes, "[Tape] really put music in a spatial dimension, making it possible to squeeze the music, or expand it."

Connection: video-text as idea montage

Video-text gains traction when we begin to think, beyond a single piece of source material, towards assemblages. Excerpts from different material can be extracted as salient "highlights." Unlike a quote, destructive in nature, video-text's hyper-textual nature guarantees provenance. The highlights metonymically function as hooks into the source material; they become units with which you can think. One then operates by collage, throwing the highlights onto a canvas for connections to arise and new orders to emerge.

Here Brian Eno's writing again is relevant:

One becomes empirical in a way that the classical composer never was. You're working directly with sound, and there's no transmission loss between you and the sound - you handle it. It puts the composer in the identical position of the painter - he's working directly with a material, working directly onto a substance, and he always retains the options to chop and change, to paint a bit out, add a piece, etc.

The promise of video-text is to operate associatively with concepts extracted from real conversations, situated in material oral discourse rather than abstracted into the written-word.

Both Albers and Vertov operate by juxtaposition. On the left, the orange squares, despite identical, are determined by their neighboring colors. On the right, it is not each image itself, but the composition of them that has meaning. What if we treat ideas as so?

We can trace the associative potential of simultaneously exposed symbols to Soviet montage theory, which emphasized the dialectical collision of images over the narrativization of a story. It is the film analogue of Alber's Interaction of Color: the color is not itself stable, but rather determined in association to its neighbors. Similarly, the potential of film came by mobilizing sentiment through carefully crafted juxtapositions and sequences. Dziga Vertov's Man with a Movie Camera is an effective example that prioritizes the metaphorical effect of juxtaposition. This sees cinema not as a direct continuation of theater, but rather a new medium with a grammar of its own [1].

Video-text builds on this history by allowing the consumer to invert the act of narrativization, breaking a linearized, guided path through ideas back into its constituent parts. This process is what Gordon Brander refers to, via Ted Nelson's Xanadu project, as thought legos. While Soviet montage was foremost about the design of montages with particular effects on a consumer, the malleability of video-text emphasizes remixing by the consumer themselves.

Infinite canvases provide a playground for ideas to connect, outside of the linearized form of writing or video.

It's worth emphasizing how video-text operates at the ethnographic level of the quotidien utterance. This grounds research by providing a clear provenance for an insight. We are accustomed to imbuing research with diagrams and images, however the spoken word remains difficult to transclude. With video-text, every word brings along when and where it was uttered. Quoting out-of-context becomes difficult as the original cadence and tone are immediately accessible. You can even jump directly into the source material and explore beyond the bounds specified by the editor.

Recollection: video-text as instantaneous archive

The process of connection doesn't need to be constrained to the present, i.e the couple of ideas hooked up in random-access memory. Due to its textual half, searching across large archives of video-text becomes as fluid as a Google search. A series of lecture videos becomes a curated database, removing the need to scrub through hours of videos to find a particular comment. Notably, video-text goes beyond hand-tagging author, date of creation, or keywords.

Imagine exposing hundreds of hours of local city council meetings on a shared public archive. Searching for a particular project name may filter down to ten matches, the cross-section that matters for a particular voter. Or instead, a film historian investigating the varied notions of Arabness in the Jean-Luc Godard archive? This is a form of digital humanities that goes beyond aggregate statistics and begins to dive into the material.

Vannevar Bush's MEMEX would allow perusing a personal archive of information, jumping from source to source, following the latent associations the brain conjured.

In As We May Think, Vannevar Bush emphasized a system that would allow us to think associatively, one thought snapping into focus immediately after the other. "Conceptual search," powered by recent large language models, paired with video-text, creates an engine for this kind of navigation of media. Beyond intentional exploration, there is also the possibility of serendipitous connection. In In the Blink of an Eye, the film editor Walter Murch writes fondly about watching footage in an almost arbitrary order, letting the spontaneous montage to dictate the themes that would then solidify into a final edit. We can rescue his desire for spontaneity by stochastically serving a random assortment of thought legos, letting the mind go to work in finding the associations, a process reminiscent of the Oulipo methods of cut-up and statistical constraints [4].

Ben Grosser's Order of Magnitude and Sam Lavigne's Video Grep are two notable works in the super-cut genre.

One transliteration of this ethos into video-text is the super-cut, an artifact native to the medium. These super-cuts make discourse visceral, be it the megalomania of Zuckerberg's speeches or the uniformity of TED talk's speech style. Even simpler techniques such as playing all the words in a speech in alphabetical order immediately brings out a new dimension of understanding.

Paul Fry's Theory of Literature, filtered to -ing, alphabetically sorted.

Recollection can also serve in a group or institutional setting. A video-text archive serves as a collective video memory, aggregating discussions into a discursive bedrock that can be interrogated and referenced. User researchers can construct ethnographic artifacts to be shared with co-workers and stakeholders. Lawyers can scrutinize video evidence and build a compelling case. Documentarians can return to the source artifacts months or years later without losing context.

Circulation: video-text in between orality and literacy

Communication through video-text similarly benefits from the marriage of the situatedness of the spoken word, with its cadence and tone, and the analyticity of the written word. However, one must heed McLuhan's warning of obscuring a medium by naively using a prior one as content. Film is not recorded theater, nor is photography a perfect painting. Video-text is not simply transcribed speech. Looking attentively at how the parts interplay provides some clarity.

The act of transcription is destructive: orality is reduced to marks on page, a symphony into mere sheet music. Consider the following renderings of an interview of Hunter S. Thompson:

He slurs his words in a characteristic drugged-up state. Note how, by sacrificing legibility, you can see Hunter's cadence and tone [2]. He jumps between lines of reasoning, eschewing grammatical structure. He begins a sentence with "I wouldn't," but follows it up with "sure." You can see him begin to utter "I wouldn't say" and then, mid-sentence, transition to "I wasn't sure." But why would he use proper grammar? He's not writing... He's talking! The coercion of living oral dialect into a transcript with proper grammar and diction can be useful, but it isn't necessarily accurate [3].

Drift3, an interface by Robert Ochshorn, author of Gentle, which graphs intonation alongside words.

The spoken word is helplessly redundant: it has to communicate ideas in a single go. The listener can't flip back to a previous page to remember what was said before. We stutter, grasping for the words at the tip of our tongues. Those gaps in time aren't superfluous though. They serve as lacunas in which the listener can continue to reshape the ideas in their head in a hermeneutic dialectic. The speaker strives for memorability, in lieu of a permanent record. In contrast, the written word strives for precision, consistency, and coherence. There's no negotiation between parties, striving for mutual understanding; only a reader confronting a static text. It can refer back to itself through demonstrative pronouns, sign-posting, and meta-textual devices such as indices.

A native video-text artifact goes beyond video and text, allowing the consumer to passively watch or actively explore. Here's an example artifact.

The solution is to not reduce either mode into the other: video-text bridges orality and literacy, allowing each to live in parallel, zippering together the sounds and written-words through interactivity. Video-text affords a new relationship, more intimate, less visceral, with the source material. It mirrors Walter Benjamin's observation on film: "the camera introduces us to unconscious optics as does psychoanalysis to unconscious impulses."

This form of interaction ricochets back into the process of creation of a video-text artifact. The creator begins to consider both a reader, aiming for legibility and concision, but also a listener which will spend time with their material. The small notes (e.g. "[mumbles]" or "[classical music plays]") that feel superfluous when the audio is playing can function as anchors into the text for the eye. The written word need not explain it all, as the music behind the words can always be played.

Reduct: a platform for video-text

Despite the age of mechanical reproduction affording a new materiality in media, creation has not been extended past select few specialized audio-visual technicians. Certain subtle designs can democratize the wielding of media by providing a system that encapsulates the technical complexity behind an accessible interface. The key decision in video-text is to forgo completely the notion of timestamps, letting the written word be an anchor into the continuous domain of time. From that technical process emerges a metaphor, and from that metaphor a new vision of media.

Some of the ideas I've explored in this article are already manifest in Reduct, the platform for video-text which I help develop. Many other ideas emergent from the metaphor video-text still must be made into reality.

Cristóbal Sciutto, October 2022.

[1] It is of note that there are no present technological barriers for montage: Vertov's film is from 1929. When the desired effect is juxtapostion, the quality of the image and the seamlessness of the editing is at most secondary, in contrast to visual effects obsession with realism and modern filmmakers endless debates over gear. ↩

[2] As you've ventured into the footnotes, you'll indulge me this Céline quote: "When you stop to examine the way in which our words are formed and uttered, our sentences are hard-put to it to survive the disaster of their slobbery origins. The mechanical effort of conversation is nastier and more complicated than defecation. That corolla of bloated flesh, the mouth, which screws itself up to whistle, which sucks in breath, contorts itself, discharges all manner of viscous sounds across a fetid barrier of decaying teeth—how revolting! Yet that is what we are adjured to sublimate into an ideal." ↩

[3] A similar tension arises with translated film subtitles. These are designed to be legible, grasped in a single look, as to not steal the viewer's attention from the frame, where the actor's face and the surrounding environment provide key cues as to the meaning of the utterance. Imagine a subtitled comedy special, translated from a foreign language you don't speak. At the climax of a joke, the comedian pauses, building tension for the live audience. On your screen the punchline flashes, ahead of the comedian's delivery. Do you laugh immediately? Or do you wait for the figure on your screen to speak even if you don't understand them? ↩

[4] It's noteworthy that both desire computer systems to be assistants of human intuition (by recording threads of insight, or spurring the imagination), rather than systems with intelligence of their own. ↩