Style guide for transcripts and captions

This page provides information about formatting and styling text in captions and transcripts.

All video recordings must have captions, and all audio recordings (such as podcasts) must have transcripts. Learn more about the requirements for accessible video and accessible audio.

Elements of quality captioning

  • Accurate: Errorless captions are the goal. They should include proper punctuation and capitalization.
  • Synchronized: Captions should display precisely when words are spoken.
  • Consistent: Uniformity in style and presentation of all captioning features is crucial for viewer understanding.
  • Clear: A complete textual representation of the audio, including speaker identification and non-speech information.
  • Readable: Captions are displayed with enough time to be read completely and are not obscured by (nor do they obscure) the visual content.
  • Equal: Equal access requires that the meaning and intention of the material is completely preserved.


  • Transcription: The process of converting audio content to text.
  • Transcript: The product of transcription. It may or may not be time-stamped, but is always displayed in a separate section, window, or file. Transcripts enable a user to access all of the spoken content of a video or meeting. (This differs from captions which only display small pieces of the audio content at a time.)
  • Captions: Audio content that is presented on-screen in short “chunks” of text, synchronized with the audio. They’re in the same language as is used in the recording and should include all meaningful sounds including speech, music/lyrics, and noises. Sometimes, the term “subtitle" is used, but subtitles are a translation and do not include background sounds. The term, “caption,” may also be used to describe the output of live transcription.
  • Closed captions (preferred): Captions exist in a separate file. This allows different users on various media players to adjust the appearance to meet their needs.
  • Open captions: Caption text is part of the video file, so users can not change the appearance. (Don’t do this.)
  • Caption frame: The audio content that is converted into captions is presented visually in chunks of text. These “chunks” are the caption frame.
  • Line breaks: Where text “breaks” into a new line in a caption frame.
  • Caption break: Where text “breaks” between two caption frames.
  • Full verbatim: When captions and transcripts capture all speech, including um, ah, and you know. This approach is used for scripted speech (plays, TV) and court reporting.
  • Clean verbatim: Natural speech tends to be messier than scripted speech and may be difficult to follow if transcribed in full verbatim. This approach removes most of the filler words to improve comprehension.

Font and formatting

  • Mixed case font (only use all caps for YELLING, or speaker identification).
  • Use white text, in a medium weight, sans serif font, with a drop or rim shadow. Ideally, captions should have a dark, translucent box as background.
  • Centered and left-aligned.
  • Ideally, use no more than two lines per caption frame. More lines may be used occasionally if using two lines will interfere with visuals.
  • Place captions at the bottom unless it interferes with visuals/graphics, in which case they can be placed elsewhere.
  • If transcribing math content, use only numerals. For all other topics, write out numbers 1-10 (one, two), and use numerals for numbers over this (11, 53, 978), or use a combination for large, rounded numbers (3 million).

Caption duration and line length

Generally, caption frames should:

  • Be onscreen for 2-6 seconds (depending on the amount of text)
  • Have 1-2 lines of text
  • Aim for a maximum of 32 characters per line. (Do not exceed 42 characters.)

Line breaks and caption frame breaks

When considering where to place line breaks lines and how to break up text across caption frames:

  • Make use of natural pauses in speech.
  • When feasible, observe DCMP guidelines about line breaks, which suggest breaks based on grammatical considerations. In general, avoid breaking up prepositional phrases, modifiers from the modified word, etc.
  • Two shorter lines are more readable than one very long line.
  • Avoid putting the last word of a sentence on the next caption frame.

Speaker identification

If there are two or more speakers, use speaker identification, both for captions and transcripts.
Speakers should be identified every time a new person speaks.

Display the speaker’s name in all caps with a colon. Example:

OBI-WAN: These aren't
the droids you’re looking for.

If the speaker’s name is unknown, some alternatives are: STUDENT, FEMALE SPEAKER, AUDIENCE MEMBER, PROFESSOR. If there are multiple unknown speakers use numbers: STUDENT #1, STUDENT #2.

Best practices for captioning various situations

  • Include meaningful background sounds in brackets. Example: [audience applauds]
  • Only include background sounds if they’re important to the plot, or add meaning / context.
  • Include the source of background sounds.
  • Sounds can be described [dog growling], spelled out [grrrrrrr], or both:
    [dog barking] 
    Woof, woof!
  • Describe musical style if pertinent. Example: [melodic classical music] Do not include background music in captioning if it will interfere with text captions. If the background music plays for a long time, stop the caption frame after 4-5 seconds.
  • If you include captions for song lyrics, add a musical note (♪) to the beginning and end of each line.
  • If a phrase is spoken as a statement / command, but is grammatically a question, use punctuation to indicate the usage is a statement. Example:
    • Why don’t you come in and shut the door. (CORRECT, if spoken as a command.)
    • Why don’t you come in and shut the door? (INCORRECT, unless spoken as an actual question.)
  • Use punctuation to indicate the speed or pace of a sound effect: ellipsis for extended pauses, commas for brief breaks, and dashes for quick repetition. (Example: Oh... my... g-g-god. Oh, Em, Gee.)
  • For lists, use the serial (or Oxford) comma, per DCMP.
  • When you aren't sure, use the version that's easiest to read and understand: 
    • Don't do this: Berkeley dot E-D-U
    • Do this:
  • It may sometimes be appropriate to add a description of the speaker’s tone in brackets [whisper].

