Transcript Document Model
- Document Number
- Current Version
- 1.0 ( http://www.legalxml.org/workgroups/substantive/transcripts/WD_10002.shtml)
- Previous Version(s)
- 0.8, 0.9.
- Workgroup Information
- Workgroup Name: Transcripts
- Workgroup Chair(s): Davin
Eddie O'Brien (firstname.lastname@example.org)
- Workgroup Mailing List:
- Workgroup Mailing
List Archive: http://www.legalxml.org/Archive/Transcripts.html
- Workgroup Website: http://www.legalxml.org/workgroups/substantive/transcripts/
- Document Author(s)
- Davin Fifield (email@example.com)
- Document Editor(s)
- Short Statement of Status
- Final Preliminary Model
This document defines the document model that will
be the subject of the LegalXML Transcript working group's
focus when developing the proposed LegalXML-Transcript
Status of Document
This is the final working draft of the Legal XML Working
Group for review and discussion by interested Members.
It is expected that this document will stay in its current
working draft form as work on the proposed LegalXML
Transcript Standard takes place. The document model
described herein is intended to serve as a reference
point for the group's efforts when developing that standard.
Table of Contents
2.1 Overview of a Transcript
2.2 Front Matter
2.4 End Matter
3. Other Notes
2.1 Paragraph Numbers
2.2 Use of Element IDs
2.3 It's the Content Stupid
4. Appendix: XML Expressions
Before beginning to define the LegalXML transcript
standard, the working group needs a model of the transcript
data being represented. This document will never reach
Recommendation stage, instead it will become an Unofficial
Note to be used as the basis for further documents.
This document builds on the Unofficial Note XML
Standards for Legal Transcripts written by Eddie
O'Brien and Chris Priestley. It is dependent on the
charter for the Transcript working group, and builds
on an e-mail from Allison Stanfield detailing some proposed
transcript tags. It also incorporates additions from
a paper written by the Australian NSW Attorney-Generals
department on the use of XML for transcripts. The e-mail
to the Working Group entitled "Legal XML transcript
container structure" from Eddie O'Brien is also
important as the basis for the Proceedings portion of
It is imperative to note that the pseudo-XML markup
used in this document is not intended to be the basis
for the actual XML tags and structure of the final XML-Transcript
standard. On the contrary, this document should be read
as the requirements for the XML-Transcript standard
developed by the working group.
The data model represented here meets the following
- Describes both dialogue and non-dialogue information
in the transcript.
- Is independent of transcription method.
- Describes current timestamping practices.
- Incorporates the page/line and paragraph-based content
models for the spoken content.
- In most transcripts there is an Appearances section
in which the attorneys that were present are identified.
- An individual that will ask questions of witnesses,
make objections to questions asked by opposing counsel,
or make spoken arguments to the judge in favor or
against certain motions or judgements.
- A certified transcript in the paper world is a non-draft
transcript that has a certificate page with the reporter's
signature on it. In the electronic world, it is a
fully scoped transcript containing no untranslated
steno with an electronic signature attached. See also
draft and uncertified.
- The portion of a transcript that is not recorded
as a Q/A exchange, but as between two specifically
named persons. E.g.
MR OBJECTOR: I object.
MR INTERROGATOR: On what grounds?
MR OBJECTOR: On the coffee grounds.
- This will often be dialogue spoken between attorneys,
without the participation of a witness, and is often
indented further than the rest of the transcript when
- e.g. United States District Court or County
Court or Supreme Court or Philadelphia
Court of Common Pleas
- Also referred to as unscoped or unedited. A version
of a transcript in which untranslated steno may
still appear. A draft transcript is never certified
by a reporter.
- A referenced item entered into evidence. Exhibits
may be offered, refused, accepted or withdrawn.
- item Marked for Identification
- An item that is not (yet) an exhibit that has been
marked for identification during the proceedings.
- e.g. District of Colorado or El Paso County,
or Common Law Division.
- A section in a deposition in which the assumptions
and agreed rules under which the deposition is being
taken are stated.
- For the purposes of this document, an uncertified
transcript is a non-draft transcript that has not
[yet] been certified, but is capable of being certified.
It is fully scoped, containing no untranslated steno,
but having no attached signature.
- voir dire
- Jury selection questioning, or examination of a
witness by a judge.
- An individual that will be asked questions as part
of the proceedings. Questions may be asked by attorneys
2.1 Overview of
From a structural stand-point, a conventional transcript
in today's world looks like this at the top level:
In plain English, this corresponds to:
- Optional non dialogue content at the beginning of
- The verbatim record of the proceedings.
- Optional non dialogue content at the end of the
2.2 Front Matter
In today's transcripts, there is generally
a single title page. However other content following
the title page also exists that is not part of the verbatim
record of the proceedings that the transcript represents.
Here is a sample title page followed
by the appearances, with the original layout preserved:
1 IN THE UNITED STATES DISTRICT COURT
FOR THE DISTRICT OF COLORADO
Criminal Action No. 96-CR-68
UNITED STATES OF AMERICA,
TERRY LYNN NICHOLS,
10 (Trial to Jury: Volume 151)
12 Proceedings before the HONORABLE RICHARD P. MATSCH,
13 Judge, United States District Court for the District of
14 Colorado, commencing at 8:30 a.m., on the 2d day of January,
15 1998, in Courtroom C-204, United States Courthouse, Denver,
24 Proceeding Recorded by Mechanical Stenography, Transcription
Produced via Computer by Paul Zuckerman, 1929 Stout Street,
25 P.O. Box 3563, Denver, Colorado, 80294, (303) 629-9285
2 PATRICK RYAN, United States Attorney for the Western
3 District of Oklahoma, and RANDAL SENGEL, Assistant U.S.
4 Attorney for the Western District of Oklahoma, 210 West Park
5 Avenue, Suite 400, Oklahoma City, Oklahoma, 73102, appearing
6 for the plaintiff.
7 LARRY MACKEY, SEAN CONNELLY, BETH WILKINSON, GEOFFREY
8 MEARNS, JAMIE ORENSTEIN, and AITAN GOELMAN, Special Attorneys
9 to the U.S. Attorney General, 1961 Stout Street, Suite 1200,
10 Denver, Colorado, 80294, appearing for the plaintiff.
11 MICHAEL TIGAR, RONALD WOODS, ADAM THURSCHWELL, REID
12 NEUREITER, and JANE TIGAR, Attorneys at Law, 1120 Lincoln
13 Street, Suite 1308, Denver, Colorado, 80203, appearing for
14 Defendant Nichols.
For interest's sake, note that content may appear on
the lines in between numbered lines, and that
there is a page break that is not explicitly represented
here between the first second physical pages of this
Rather than trying to mark up this content with XML,
a model of the data contained in transcript front matter
is shown below:
The order of the elements subsequent to the TitlePage
is not important.
2.2.1 Title Page
The following shows the information that may appear
on the Title Page:
| "State"| "*")
| "Hearing" |
"Motion" | "Bond"
| "NonJury" |
| "*") Name=""/>
The order in which these elements appear is largely
unimportant. Again this a document model, intended to
represent that those attributes exist at a conceptual
level, not how they are represented. (For example, Name
and Address will be tags as defined by the Horizontal
WG, not attributes of a ReporterInfo tag.)
It may not necessarily be the case in the actual XML-Transcript
standard that the TitlePage information as shown above
is required to appear on a single page. In fact, the
meta-information shown here may be better represented
in a section separate from the main content of the transcript
itself. These will be matters for the working group
DayOfHearing would be a sequential indicator of how
many days of hearing preceded this one (e.g. Day 12).
The appearances may either appear on the title page,
or later in the transcript:
Each PartyRef tag for the Legal Representative indicates
an additional client they are representing.
Stipulations state agreed upon rules for a deposition,
particularly where the parties involved are from different
jurisdictions. Stipulations may need further breakdown
at some future time, but for now they will be treated
as a single chunk of text.
There are two main types of indexes, exhibit indexes
and tables of contents.
Party=("Plaintiff" | "Defendant")>
| "Refused" |
"Accepted" | "Withdrawn")
Table of Contents (sometimes only contains witnesses):
| "Witness" |
There are two types of top-level entries:
Headings and Witnesses. Examinations of any type, occur
within Witnesses. The ContinuedFlag indicates a continuation
of an interrupted examination. A typical (albeit very
lengthy) exchange might be:
<break for lunch>
In this example, the Redirect (continued)
would be marked with ContinuedFlag="Yes",
whereas the Further Redirect would not.
One final property of a transcript that
does not directly correspond to any one part of an original
ASCII transcript is the CertifiedState which indicates
whether the transcript is a Draft, an Uncertified copy
or a Certified copy. See the glossary for definitions
of these terms.
The proceedings are the most important part of the
transcript, as they are it's raison d'etre.
2.3.1 A Logical
The bulk of a transcript consists of exchanges between
participants. Each individual's turn at speaking (transcripts
assume that only one person is speaking at once!) is
typically marked with either "THE COURT:",
"Q.", "A.", "THE DEFENDANT:"
or the witness or attorney's name, "MR. SMITH:".
Therefore, one way of modeling the contents of the
proceedings is as follows:
SpeakerID?=(PartyID | JudgeID | LegalRepresentativeID)
<!-- Text of paragraphs
of the speech go here -->
I.e. One or more speeches containing one or more paragraphs. The
SpeakerName would be permitted when the speaker is not
identified as a party, judge or legal representative.
Either the SpeakerID or the SpeakerName MUST be present.
In fact, we must also permit non-spoken information,
intermingling asides by the reporter, as well as headings,
into the transcript.
The top-level non-spoken information present in a transcript
are the Witness and Section headings, and these in fact
contain the speeches:
(Witness | Section)*
I.e. the proceedings are a sequence of one or more
witnesses or general sections.
In the case that there are no explicit sections, there
is an implicit section at the start of the Proceedings
containing the Speeches, ExhibitEvents and Asides.
Speeches are as defined above, and Asides are simply
textual data. An ExhibitEvent is probably something
Type=("Offered" | "Refused" |
"Accepted" | "Withdrawn" | "MFI")
Party=("Plaintiff" | "Defendant")/>
This logically models the content of the proceedings
- to a certain extent. It does not begin to deal with
timestamp data throughout the transcript, or the existing
situation of transcripts organized into pages, with
line numbers on each page. Nor does it reflect the reality
that exhibit events and asides may well interrupt speeches.
This would be easy to do at the paragraph level with
small modifications to this model (simply by rippling
ExhibitEvents down to the same level as Paragraphs within
a Speech), and it is conceptually not harder to interrupt
paragraphs themselves (you would simply need to model
textual chunks that may or may not start paragraphs,
and allow ExhibitEvents and Asides to be interspersed
among them). However, this is not considered further
here, as the real intricacies are apparent when we combine
this logical model with the physical model.
2.3.2 A Physical
Transcripts are presently saved and
presented to end-users - whether they are on paper on
in electronic form - in a page-based format. Each line
of the Proceedings appears on a distinct numbered line
on a page, although those in the FrontMatter and EndMatter
may sometimes have more than one physical line for a
given numbered line, (see the earlier example of the
Title Page), or lines may be completely unnumbered.
The model itself is a simple one. But
some important points to note are as follows:
1. Page numbers don't always start at one.
2. Page numbers aren't always present on every
3. Page numbers aren't always sequential.
4. Line numbers don't always start at one.
5. Line numbers aren't always present on every physical
6. Line numbers aren't always sequential.
These requirements can be due to redaction
(omitted portions of the transcript); because some states
don't number lines starting from 1 (e.g. New York);
due to transcripts being continued from previous sessions;
or because FrontMatter and EndMatter pages may simply
not require page numbers.
Here then, is a physical model for transcripts:
or footer text goes here"/>
<!-- Content of the
line goes here -->
Although this is shown as being the
internal definition of the Proceedings, in fact the
entire transcript would be paginated this way, including
the FrontMatter and the EndMatter.
Note the Timestamp attribute on the
Line tag. The use of this attribute permits synchronization
with other media types for later playback, and for simply
recording the time-flow of the proceedings.
Although this document does not claim to even be a
straw man for XML-Transcript markup, there is one important
issue that needs resolving. How do we reconcile the
Logical and the Physical models for transcripts?
The two models are fundamentally incompatible if a
well-formed XML structure is to result. The two alternatives
(i) The tags required for the physical model are modified
to point-tags. (See below.)
(ii) The tags required for the logical model are modifed
By point tags, I mean an XML tag that has no content
aside from attributes or other tags that are themselves
If point-tags were not to be used, physical and logical
elements would often overlap, as in:
which is not valid XML.
Before resolving this issue, it is worth noting that
there are some very good reasons to retain the physical
page/line information of the transcript:
- Any XML-Transcript format should be able to handle
transcripts that are in existence today, as well as
those that are generated going forward. Presently
all transcripts in the United States, and in some
other countries as well, are marked with page and
line numbers on each page.
- Existing transcript citations are made to a (case,
volume, page, line) tuple, or similar. It is unlikely
that the vast body of transcripts existing today will
be reformatted to have citable paragraph numbers.
- It is impossible to overstate the importance of
the content at a given citable location being constant
throughout the lifetime of a transcript. Therefore,
for transcripts without page/line information it would
be important to either (a) ensure that NO citations
were ever made to that transcript via page and line
or (b) that the page and line information were stored
with the transcript itself. Since (b) is in concert
with the status quo, it seems it is the path of least
resistance, and the most likely to ensure a usable
When originally thinking about this issue, I was keen
to try to preserve the Page and Line elements as the
top content level of the XML-Transcript document model.
My reasons were partly selfish: I felt they would have
mapped more cleanly to existing (non-XML) based technologies
I have been working with. However, I have now come around
to the opinion that the cleaner alternative is for the
page/line tags to be point-based. My reasons for forming
this opinion are:
- The logical model is an important reflection of
the document's organization, and requiring it to be
compromised in the XML hierarchy would cause confusion
and difficulty when trying to use and extend the standard
(and when being used as part of other legal standards).
- The physical (page/line) model, although it gives
a guaranteed maximum granularity for the content (e.g.
lines are no longer than 80 characters, pages are
no longer than 66 lines) is not the representation
of the content that yields the most value from the
content. In fact, many of today's challenges in transcript
processing deal with trying to retroactively fit some
logical structure onto paginated transcripts. If this
structure were the top-level content, the value and
navigability of transcripts would be greatly increased.
- Any page is still random-accessible when an entire
transcript has been parsed. In practice, in order
to find the content of a particular page under the
physical model, one would've needed to parse up until
the corresponding </page> for a <page>
tag. This is equivalent to parsing to a given point-based
<StartPage> tag and then continuing until a
matching <EndPage> (or subsequent <StartPage>)
is found. Hence the speed of parsing cannot be used
as a reason for against either of these models.
- The logical model has a greater number of tags that
would need to become point-based. The physical model
only has Page and Line tags that need to be spliced
into the logical model.
Here is a combined data model:
(Witness | Section)*
No?="" Name="" Party=("Plaintiff"
Party?=(etc.) Type=("Direct" | etc.)
ContinuedFlag=("Yes" | "No")
FurtherFlag=("Yes" | "No")
| "Answer" | "Other")
| "Judge" | "Attorney")
Note that any bottom-level tag that permits actual
content to appear may cause a page or line wrap to occur,
and so must allow the physical tags (Page and Line)
to appear. Hence both Paragraph and Aside must allow
Pages and Lines within them. This would need to
be extended to the content containing tags of the FrontMatter
The Page construct is replaced by a point tag that
indicates the start of a new page, and contains only
header/footer information (although in practice, at
least one NewLine must follow immediately after a Start
or footer text goes here"/>
Rather than encapsulating the content of a line, each
Line tag now indicates the start of a new line, contains
no textual data, and may be collapsed to a point-tag:
ExhibitEvents were point-based tags in the logical
model shown in the previous section, and as they contain
no content, they do not require modification. However,
since an exhibit-event (unlike an Aside, which the Reporter
has traditionally only placed between paragraphs) may
occur at any time, it is also permitted inside the Proceedings
wherever Page and Line tags are permitted.
The end result successfully models even the most complicated
content. For example, the following extract:
22 Q. Tell us what happened on the night that you went back to
23 the storage unit.
24 A. I drove to the storage unit, and I used the combination and
25 the key that Tim had left at my house. I got into the storage
Michael Fortier - Direct
1 unit, and I set the tank down directly to my right, and then I
2 locked up the unit and left.
Would be modeled as:
Q. Tell us what happened
on the night that you went back to
the storage unit.
A. I drove to the storage
unit, and I used the combination and
the key that Tim had left
at my house. I got into the storage
Fortier - Direct"/>
unit, and I set the tank down
directly to my right, and then I
locked up the unit and left.
In this particular example, each speech is only one
paragraph. However, the proposed model gives the flexibility
for speeches to be multiple paragraphs in length.
The possibility within this model of putting timestamps
on each line reflects the current capability of most
CAT software to do so.
2.4 End Matter
The EndMatter contains optional indexes and tables
of contents, and also the Reporter's Certificate.
The indexes are as defined in the FrontMatter, and
the Certificate would probably be modeled as textual
data in the first instance, although may also contain
links to an electronic signature, if such were to be
present on the transcript. In the absence of such a
signature, it might still contain information about
the reporter, as is optionally present in the FrontMatter
3. Other Notes
3.1 Paragraph Numbers
Paragraph numbers have not been shown in this model,
as in the current situation (at least in the US) paragraphs
are not numbered. It would be a simple matter to extend
the logical model to permit numbers on each paragraph.
3.2 Use of Element IDs
It is important that the LegalXML Transcript standard
that is developed for transcripts uniquely determines
an identity for each paragraph/page and line in the
system. In today's world, it is actually possible for
volumes to be combined into the same physical transcript,
and for page numbers in those volumes to both begin
at "1". For this and other similar reasons,
it is proposed that each Page and Line tag be assigned
an ID as follows:
That is, a page receives an ID that is always zero
padded to 6 digits (this will support trials up to ten
times longer than the longest trial to date), and each
line is referred to with an ID that begins with the
page ID and ends in the 0-based index of the line.
My inclination (since this is machine-readable data)
is to number page IDs beginning at 0, not at one. They
would always be sequentially assigned in a given official
transcript, and would provide a useful random access
mechanism to the content. They would NOT correspond
to the actual page numbers assigned to that transcript.
By explicitly numbering pages with these 0-based indexes
(they would also have page numbers as a separate and
independent attribute) you free yourself of concerns
about relying on an underlying XML parser to provide
a readily usable random access mechanism for objects
in the hierarchy.
If (when) paragraphs are numbered, it would be a simple
matter to do a similar ID for them. Again, this will
deal with the situation of redacted transcripts et.
al. to cause gaps and non-zero (or one) based numbering
to rear their ugly heads.
3.3 Its the Content
I have been working under an important assumption when
thinking about the Legal-XML Transcript format: That
there would be a single element in the document that
would be the home for exactly the text that today constitutes
the transcript. In other words, that there would be
some pair of tags, e.g. <Body> </Body>,
which would embrace other textual data and tags where
if the tags were to be removed you would be left with
no more and no less than the content of the transcript.
There are several reasons for assuming: Organizations
that make use of the format should be able to use the
logic in their existing parsing engines where possible.
Similarly, writing search-engine filters for full-text
indexing should be simplified to the greatest extent
possible. If this means writing a minimalist XML-Transcript
parser that takes the text between <Page> and
<Line> tags within the <Body> section, and
ignores all the logical tags, then that should be supported.
In particular, knowing that certain content is NOT part
of the actual transcript (including headers and
footers) should not be required. Any character data
that is not transcript content (including headers and
footers as shown in this document model) should appear
as attributes, or as meta-data in the non Body part
of the transcript.
The working group will need to decide where to place
the meta-data, and if it makes sense to optionally permit
some of it to appear within the <Body> (provided
it meets the above conditions).
This document has not addressed indenting of lines,
which is important whenever the transcript is displayed
for it to have the correct official look and feel. Typically
all questions and answers appear with one indent, and
all colloquy (see Definitions) appear with another.
Headings are often centered, and Asides may appear centered,
indented, or left-justified according to the preference
of the reporter.
If indenting is not to be supported as part the XML-Transcript
standard (if it is felt to be a formatting issue, and
therefore inappropriate) then it must be handled in
some other way. Perhaps by permitting space content
to be treated literally somehow. One option might simply
be to allow an optional "Indent" tag on each
Line (since that is physical formatting information
anyway) with possible global defaults that are defined
in some stylesheet being permitted. This wouldn't fully
meet the requirement from the previous section for content
to be completely preserved, however.
This issue will need to be explored more as the final
transcript standard evolves.
4. Appendix: XML
This appendix was based in part on the corresponding
section from Marty Halvorson's Unofficial Note XML
Standards Development Project Strawman for Electronic
The XML expressions shown in this document are intended
to convey the following semantics:
- this element must occur one time.
- unless otherwise noted, these elements must occur
in the order shown.
- either of these tags must occur one time.
- either of these elements (defined elsewhere) must
occur one time.
- this element is optional.
- this element can occur zero or more times.
- this element must occur one or more times.
- this attribute must appear with this value
- this attribute must appear with either value
- this attribute is optional.