Legal XML.orghttp://www.oasis-open.org
  About LegalXML Members Join LegalXML News Events Members Only
  OASIS LegalXML Member Section
OASIS Legal XML Member Section
Rules of Procedure Members

Legacy/Transition Documents

Technical Process
IPR Policy
TC Guidelines

Technical Committees
Current TC List
Legal Citation Markup
LegalXML Court Filing
LegalXML eContracts
LegalXML eNotary
LegalXML IntJustice
LegalXML Lawful Intercept
LegalXML Legislative
LegalXML Transcripts

OASIS Network
CGM Open
Open CSA

OASIS Info Channels
Cover Pages

  Transcripts Workgroup Website

Transcript Document Model

Document Number
Current Version
1.0 (
Previous Version(s)
0.8, 0.9.
Workgroup Information
Workgroup Name: Transcripts
Workgroup Chair(s): Davin Fifield (, Eddie O'Brien (
Workgroup Mailing List:
Workgroup Mailing List Archive:
Workgroup Website:
Document Author(s)
Davin Fifield (
Document Editor(s)
Short Statement of Status
Final Preliminary Model


This document defines the document model that will be the subject of the LegalXML Transcript working group's focus when developing the proposed LegalXML-Transcript standard.

Status of Document

This is the final working draft of the Legal XML Working Group for review and discussion by interested Members. It is expected that this document will stay in its current working draft form as work on the proposed LegalXML Transcript Standard takes place. The document model described herein is intended to serve as a reference point for the group's efforts when developing that standard.

Table of Contents

1. Introduction

	1.1 Problem
	1.2 Dependencies
	1.3 Requirements
	1.4 Definitions

2. Specification
	2.1 Overview of a Transcript
	2.2 Front Matter
	2.3 Proceedings
	2.4 End Matter

3. Other Notes
	2.1 Paragraph Numbers
	2.2 Use of Element IDs
	2.3 It's the Content Stupid
	2.4 Indenting

4. Appendix: XML Expressions

1. Introduction

1.1 Problem

Before beginning to define the LegalXML transcript standard, the working group needs a model of the transcript data being represented. This document will never reach Recommendation stage, instead it will become an Unofficial Note to be used as the basis for further documents. 

1.2 Dependencies

This document builds on the Unofficial Note XML Standards for Legal Transcripts written by Eddie O'Brien and Chris Priestley. It is dependent on the charter for the Transcript working group, and builds on an e-mail from Allison Stanfield detailing some proposed transcript tags. It also incorporates additions from a paper written by the Australian NSW Attorney-Generals department on the use of XML for transcripts. The e-mail to the Working Group entitled "Legal XML transcript container structure" from Eddie O'Brien is also important as the basis for the Proceedings portion of this model.

1.3 Requirements

It is imperative to note that the pseudo-XML markup used in this document is not intended to be the basis for the actual XML tags and structure of the final XML-Transcript standard. On the contrary, this document should be read as the requirements for the XML-Transcript standard developed by the working group.

The data model represented here meets the following requirements:

  1. Describes both dialogue and non-dialogue information in the transcript.
  2. Is independent of transcription method.
  3. Describes current timestamping practices.
  4. Incorporates the page/line and paragraph-based content models for the spoken content.

1.4 Definitions

In most transcripts there is an Appearances section in which the attorneys that were present are identified.
An individual that will ask questions of witnesses, make objections to questions asked by opposing counsel, or make spoken arguments to the judge in favor or against certain motions or judgements.
A certified transcript in the paper world is a non-draft transcript that has a certificate page with the reporter's signature on it. In the electronic world, it is a fully scoped transcript containing no untranslated steno with an electronic signature attached. See also draft and uncertified.
The portion of a transcript that is not recorded as a Q/A exchange, but as between two specifically named persons. E.g. 

MR OBJECTOR: I object.
MR INTERROGATOR: On what grounds?
MR OBJECTOR: On the coffee grounds.
This will often be dialogue spoken between attorneys, without the participation of a witness, and is often indented further than the rest of the transcript when printed.
e.g. United States District Court or County Court or Supreme Court or Philadelphia Court of Common Pleas
Also referred to as unscoped or unedited. A version of a transcript in which untranslated steno may still appear. A draft transcript is never certified by a reporter. 
A referenced item entered into evidence. Exhibits may be offered, refused, accepted or withdrawn.
item Marked for Identification (MFI)
An item that is not (yet) an exhibit that has been marked for identification during the proceedings.
e.g. District of Colorado or El Paso County, or Common Law Division.
A section in a deposition in which the assumptions and agreed rules under which the deposition is being taken are stated.
For the purposes of this document, an uncertified transcript is a non-draft transcript that has not [yet] been certified, but is capable of being certified. It is fully scoped, containing no untranslated steno, but having no attached signature.
voir dire
Jury selection questioning, or examination of a witness by a judge.
An individual that will be asked questions as part of the proceedings. Questions may be asked by attorneys or judges.

2. Specification

2.1 Overview of a Transcript

From a structural stand-point, a conventional transcript in today's world looks like this at the top level:


In plain English, this corresponds to:

  1. Optional non dialogue content at the beginning of the transcript.
  2. The verbatim record of the proceedings.
  3. Optional non dialogue content at the end of the transcript.

2.2 Front Matter

In today's transcripts, there is generally a single title page. However other content following the title page also exists that is not part of the verbatim record of the proceedings that the transcript represents.

Here is a sample title page followed by the appearances, with the original layout preserved:

           1                 IN THE UNITED STATES DISTRICT COURT
                                FOR THE DISTRICT OF COLORADO
               Criminal Action No. 96-CR-68
               TERRY LYNN NICHOLS,

                                    REPORTER'S TRANSCRIPT
          10                    (Trial to Jury:  Volume 151)


          12            Proceedings before the HONORABLE RICHARD P. MATSCH,

          13   Judge, United States District Court for the District of

          14   Colorado, commencing at 8:30 a.m., on the 2d day of January,

          15   1998, in Courtroom C-204, United States Courthouse, Denver,

          16   Colorado.








          24    Proceeding Recorded by Mechanical Stenography, Transcription
                 Produced via Computer by Paul Zuckerman, 1929 Stout Street,

          25       P.O. Box 3563, Denver, Colorado, 80294, (303) 629-9285


           1                             APPEARANCES

           2            PATRICK RYAN, United States Attorney for the Western

           3   District of Oklahoma, and RANDAL SENGEL, Assistant U.S.

           4   Attorney for the Western District of Oklahoma, 210 West Park

           5   Avenue, Suite 400, Oklahoma City, Oklahoma, 73102, appearing

           6   for the plaintiff.


           8   MEARNS, JAMIE ORENSTEIN, and AITAN GOELMAN, Special Attorneys

           9   to the U.S. Attorney General, 1961 Stout Street, Suite 1200,

          10   Denver, Colorado, 80294, appearing for the plaintiff.


          12   NEUREITER, and JANE TIGAR, Attorneys at Law, 1120 Lincoln

          13   Street, Suite 1308, Denver, Colorado, 80203, appearing for

          14   Defendant Nichols.

For interest's sake, note that content may appear on the lines in between numbered lines, and that there is a page break that is not explicitly represented here between the first second physical pages of this transcript.

Rather than trying to mark up this content with XML, a model of the data contained in transcript front matter is shown below: 


The order of the elements subsequent to the TitlePage is not important.

2.2.1 Title Page

The following shows the information that may appear on the Title Page:

Court? Country?="" Type?=("Federal" | "State"| "*") 
="" Jurisdiction?=""/>
Location? City="" State?="" CourtRoom?=""/>
TranscriptType? Type=("Deposition" | "Hearing" | "Trial")/>
  <HearingType? "Sentencing" | "Motion""Bond" | "*")/>
TrialType? Type=("Jury" | "NonJury" | "*")/>
  <JudicialOfficer* Type=("Judge" | "Magistrate" | "*") Name=""/>
Party* PartyID="" Name="" Type=("Plaintiff" | "Defendant" |
"Complainant" | "Respondant" |
"CounterPlaintiff" | "CounterDefendant")/>
ReporterInfo? Name="" Address="" Phone=""/>

The order in which these elements appear is largely unimportant. Again this a document model, intended to represent that those attributes exist at a conceptual level, not how they are represented. (For example, Name and Address will be tags as defined by the Horizontal WG, not attributes of a ReporterInfo tag.)

It may not necessarily be the case in the actual XML-Transcript standard that the TitlePage information as shown above is required to appear on a single page. In fact, the meta-information shown here may be better represented in a section separate from the main content of the transcript itself. These will be matters for the working group to discuss.

DayOfHearing would be a sequential indicator of how many days of hearing preceded this one (e.g. Day 12).

2.2.2 Appearances

The appearances may either appear on the title page, or later in the transcript:

LegalRepresentative* Name="" Address="">
PartyRef* PartyID=""/>

Each PartyRef tag for the Legal Representative indicates an additional client they are representing.

2.2.2 Stipulations

Stipulations state agreed upon rules for a deposition, particularly where the parties involved are from different jurisdictions. Stipulations may need further breakdown at some future time, but for now they will be treated as a single chunk of text.

2.2.3 Indexes

There are two main types of indexes, exhibit indexes and tables of contents. 


Exhibit Index:

<ExhibitIndex Party=("Plaintiff" | "Defendant")>
Entry+ Location="pg0017" 
=("Offered" | "Refused" |
"Accepted" | "Withdrawn")
Party=("Plaintiff" | "Defendant")

Table of Contents (sometimes only contains witnesses):

Entry+ Location="pg0017" 
=("Heading" | "Witness" | "DirectExamination" |
"CrossExamination" | "RedirectExamination" |
"RecrossExamination" | "VoirDireExamination"|
"FurtherRedirectExamination" |
=("Yes" | "No")/>

There are two types of top-level entries: Headings and Witnesses. Examinations of any type, occur within Witnesses. The ContinuedFlag indicates a continuation of an interrupted examination. A typical (albeit very lengthy) exchange might be:

<break for lunch>
Redirect (Continued)
Further Redirect
Further Recross
Further Redirect

In this example, the Redirect (continued) would be marked with ContinuedFlag="Yes", whereas the Further Redirect would not. 

2.2.3 Certification State

One final property of a transcript that does not directly correspond to any one part of an original ASCII transcript is the CertifiedState which indicates whether the transcript is a Draft, an Uncertified copy or a Certified copy. See the glossary for definitions of these terms.

<CertifiedState Type="Draft" | "Uncertified" | "Certified")/>

2.3 Proceedings

The proceedings are the most important part of the transcript, as they are it's raison d'etre.

2.3.1 A Logical Model

The bulk of a transcript consists of exchanges between participants. Each individual's turn at speaking (transcripts assume that only one person is speaking at once!) is typically marked with either "THE COURT:", "Q.", "A.", "THE DEFENDANT:" or the witness or attorney's name, "MR. SMITH:".

Therefore, one way of modeling the contents of the proceedings is as follows:

Speech+ Type=("Question" | "Answer" | "Other")
           SpeakerID?=(PartyID | JudgeID | LegalRepresentativeID)

  <!-- Text of paragraphs of the speech go here -->

I.e. One or more speeches containing one or more paragraphs. The SpeakerName would be permitted when the speaker is not identified as a party, judge or legal representative. Either the SpeakerID or the SpeakerName MUST be present.

In fact, we must also permit non-spoken information, intermingling asides by the reporter, as well as headings, into the transcript. 

The top-level non-spoken information present in a transcript are the Witness and Section headings, and these in fact contain the speeches:

Witness | Section)*

I.e. the proceedings are a sequence of one or more witnesses or general sections.


<Witness No?="" Name="" Party=("Plaintiff" | etc.) 
=("Sworn" | "Affirmed")>
Examiner Name="" Party?=(etc.) Type=("Direct" | etc.)  
=("Yes" | "No"
Speech | ExhibitEvent | Aside)*

And Sections:

<Section Name="">
Speech | ExhibitEvent | Aside)*
Section Name="">

In the case that there are no explicit sections, there is an implicit section at the start of the Proceedings containing the Speeches, ExhibitEvents and Asides. 

Speeches are as defined above, and Asides are simply textual data. An ExhibitEvent is probably something like:

<ExhibitEvent Type=("Offered" | "Refused" | 
                    "Accepted" | "Withdrawn" | "MFI")

              Party=("Plaintiff" | "Defendant")

This logically models the content of the proceedings - to a certain extent. It does not begin to deal with timestamp data throughout the transcript, or the existing situation of transcripts organized into pages, with line numbers on each page. Nor does it reflect the reality that exhibit events and asides may well interrupt speeches. This would be easy to do at the paragraph level with small modifications to this model (simply by rippling ExhibitEvents down to the same level as Paragraphs within a Speech), and it is conceptually not harder to interrupt paragraphs themselves (you would simply need to model textual chunks that may or may not start paragraphs, and allow ExhibitEvents and Asides to be interspersed among them). However, this is not considered further here, as the real intricacies are apparent when we combine this logical model with the physical model. 

2.3.2 A Physical Model

Transcripts are presently saved and presented to end-users - whether they are on paper on in electronic form - in a page-based format. Each line of the Proceedings appears on a distinct numbered line on a page, although those in the FrontMatter and EndMatter may sometimes have more than one physical line for a given numbered line, (see the earlier example of the Title Page), or lines may be completely unnumbered.

The model itself is a simple one. But some important points to note are as follows:
1. Page numbers don't always start at one.
2. Page numbers aren't always present on every page.
3. Page numbers aren't always sequential.
4. Line numbers don't always start at one.
5. Line numbers aren't always present on every physical line.
6. Line numbers aren't always sequential.

These requirements can be due to redaction (omitted portions of the transcript); because some states don't number lines starting from 1 (e.g. New York); due to transcripts being continued from previous sessions; or because FrontMatter and EndMatter pages may simply not require page numbers.

Here then, is a physical model for transcripts:

  <Page+ Number="1">
Header | Footer)* Position=("Left" | "Center" | "Right")>
Value="Header or footer text goes here"/>
Line+ Number="1" Timestamp="">
    <!-- Content of the line goes here -->

Although this is shown as being the internal definition of the Proceedings, in fact the entire transcript would be paginated this way, including the FrontMatter and the EndMatter.

Note the Timestamp attribute on the Line tag. The use of this attribute permits synchronization with other media types for later playback, and for simply recording the time-flow of the proceedings.

2.3.3 A Combined Model

Although this document does not claim to even be a straw man for XML-Transcript markup, there is one important issue that needs resolving. How do we reconcile the Logical and the Physical models for transcripts?

The two models are fundamentally incompatible if a well-formed XML structure is to result. The two alternatives resolutions are:
(i) The tags required for the physical model are modified to point-tags. (See below.)
(ii) The tags required for the logical model are modifed to point-tags.

By point tags, I mean an XML tag that has no content aside from attributes or other tags that are themselves point-tags.

If point-tags were not to be used, physical and logical elements would often overlap, as in:


which is not valid XML.

Before resolving this issue, it is worth noting that there are some very good reasons to retain the physical page/line information of the transcript:

  1. Any XML-Transcript format should be able to handle transcripts that are in existence today, as well as those that are generated going forward. Presently all transcripts in the United States, and in some other countries as well, are marked with page and line numbers on each page.
  2. Existing transcript citations are made to a (case, volume, page, line) tuple, or similar. It is unlikely that the vast body of transcripts existing today will be reformatted to have citable paragraph numbers.
  3. It is impossible to overstate the importance of the content at a given citable location being constant throughout the lifetime of a transcript. Therefore, for transcripts without page/line information it would be important to either (a) ensure that NO citations were ever made to that transcript via page and line or (b) that the page and line information were stored with the transcript itself. Since (b) is in concert with the status quo, it seems it is the path of least resistance, and the most likely to ensure a usable standard. 

When originally thinking about this issue, I was keen to try to preserve the Page and Line elements as the top content level of the XML-Transcript document model. My reasons were partly selfish: I felt they would have mapped more cleanly to existing (non-XML) based technologies I have been working with. However, I have now come around to the opinion that the cleaner alternative is for the page/line tags to be point-based. My reasons for forming this opinion are:

  1. The logical model is an important reflection of the document's organization, and requiring it to be compromised in the XML hierarchy would cause confusion and difficulty when trying to use and extend the standard (and when being used as part of other legal standards).
  2. The physical (page/line) model, although it gives a guaranteed maximum granularity for the content (e.g. lines are no longer than 80 characters, pages are no longer than 66 lines) is not the representation of the content that yields the most value from the content. In fact, many of today's challenges in transcript processing deal with trying to retroactively fit some logical structure onto paginated transcripts. If this structure were the top-level content, the value and navigability of transcripts would be greatly increased.
  3. Any page is still random-accessible when an entire transcript has been parsed. In practice, in order to find the content of a particular page under the physical model, one would've needed to parse up until the corresponding </page> for a <page> tag. This is equivalent to parsing to a given point-based <StartPage> tag and then continuing until a matching <EndPage> (or subsequent <StartPage>) is found. Hence the speed of parsing cannot be used as a reason for against either of these models.
  4. The logical model has a greater number of tags that would need to become point-based. The physical model only has Page and Line tags that need to be spliced into the logical model. 

Here is a combined data model:

Witness | Section)*

<Witness No?="" Name="" Party=("Plaintiff" | etc.) 
         Status=("Sworn" | "Affirmed")
Examiner Name="" Party?=(etc.) Type=("Direct" | etc.)  
            ContinuedFlag=("Yes" | "No") 
            FurtherFlag=("Yes" | "No")
Speech | Aside)*

<Section Name="">
Speech | Aside)*
Section Name>

<Speech Type=("Question" | "Answer" | "Other")
=("Witness" | "Judge" | "Attorney")

#CDATA | Page | Line | ExhibitEvent)+

#CDATA | Page | Line | ExhibitEvent)+

Note that any bottom-level tag that permits actual content to appear may cause a page or line wrap to occur, and so must allow the physical tags (Page and Line) to appear. Hence both Paragraph and Aside must allow Pages and Lines within them. This would need to be extended to the content containing tags of the FrontMatter and EndMatter.

The Page construct is replaced by a point tag that indicates the start of a new page, and contains only header/footer information (although in practice, at least one NewLine must follow immediately after a Start Page). 

<StartPage Number="1">
Header | Footer)* Position=("Left" | "Center" | "Right")>
Value="Header or footer text goes here"/>

Rather than encapsulating the content of a line, each Line tag now indicates the start of a new line, contains no textual data, and may be collapsed to a point-tag:

<NewLine Number="1" Timestamp=""/>

ExhibitEvents were point-based tags in the logical model shown in the previous section, and as they contain no content, they do not require modification. However, since an exhibit-event (unlike an Aside, which the Reporter has traditionally only placed between paragraphs) may occur at any time, it is also permitted inside the Proceedings wherever Page and Line tags are permitted. 

The end result successfully models even the most complicated content. For example, the following extract:

          22   Q.  Tell us what happened on the night that you went back to
          23   the storage unit.
          24   A.  I drove to the storage unit, and I used the combination and
          25   the key that Tim had left at my house.  I got into the storage
                                   Michael Fortier - Direct
           1   unit, and I set the tank down directly to my right, and then I
           2   locked up the unit and left.
Would be modeled as:

<Speech Type="Question">
NewLine Number="22"/>
Q.  Tell us what happened on the night that you went back to
    <NewLine Number="23"/>
the storage unit.
Speech Type="Answer">
NewLine Number="24"/>
A. I drove to the storage unit, and I used the combination and
    <NewLine Number="25"/>
the key that Tim had left at my house. I got into the storage
    <StartPage Number="8302">
Header Position="Center" Value="Michael Fortier - Direct"/>
NewLine Number="1"/>
    unit, and I set the tank down directly to my right, and then I
    <NewLine Number="2"/>
   locked up the unit and left.

In this particular example, each speech is only one paragraph. However, the proposed model gives the flexibility for speeches to be multiple paragraphs in length. 

The possibility within this model of putting timestamps on each line reflects the current capability of most CAT software to do so.

2.4 End Matter

The EndMatter contains optional indexes and tables of contents, and also the Reporter's Certificate.


The indexes are as defined in the FrontMatter, and the Certificate would probably be modeled as textual data in the first instance, although may also contain links to an electronic signature, if such were to be present on the transcript. In the absence of such a signature, it might still contain information about the reporter, as is optionally present in the FrontMatter defined above.

3. Other Notes

3.1 Paragraph Numbers

Paragraph numbers have not been shown in this model, as in the current situation (at least in the US) paragraphs are not numbered. It would be a simple matter to extend the logical model to permit numbers on each paragraph. 

3.2 Use of Element IDs

It is important that the LegalXML Transcript standard that is developed for transcripts uniquely determines an identity for each paragraph/page and line in the system. In today's world, it is actually possible for volumes to be combined into the same physical transcript, and for page numbers in those volumes to both begin at "1". For this and other similar reasons, it is proposed that each Page and Line tag be assigned an ID as follows:

<StartPage ID="pg000000"/>
<NewLine ID="ln00000000">
<NewLine ID="ln00000001">
<Line ID="ln00000002">

That is, a page receives an ID that is always zero padded to 6 digits (this will support trials up to ten times longer than the longest trial to date), and each line is referred to with an ID that begins with the page ID and ends in the 0-based index of the line.

My inclination (since this is machine-readable data) is to number page IDs beginning at 0, not at one. They would always be sequentially assigned in a given official transcript, and would provide a useful random access mechanism to the content. They would NOT correspond to the actual page numbers assigned to that transcript. By explicitly numbering pages with these 0-based indexes (they would also have page numbers as a separate and independent attribute) you free yourself of concerns about relying on an underlying XML parser to provide a readily usable random access mechanism for objects in the hierarchy. 

If (when) paragraphs are numbered, it would be a simple matter to do a similar ID for them. Again, this will deal with the situation of redacted transcripts et. al. to cause gaps and non-zero (or one) based numbering to rear their ugly heads.

3.3 Its the Content Stupid

I have been working under an important assumption when thinking about the Legal-XML Transcript format: That there would be a single element in the document that would be the home for exactly the text that today constitutes the transcript. In other words, that there would be some pair of tags, e.g.  <Body> </Body>, which would embrace other textual data and tags where if the tags were to be removed you would be left with no more and no less than the content of the transcript.

There are several reasons for assuming: Organizations that make use of the format should be able to use the logic in their existing parsing engines where possible. Similarly, writing search-engine filters for full-text indexing should be simplified to the greatest extent possible. If this means writing a minimalist XML-Transcript parser that takes the text between <Page> and <Line> tags within the <Body> section, and ignores all the logical tags, then that should be supported. In particular, knowing that certain content is NOT part of the actual transcript  (including headers and footers) should not be required. Any character data that is not transcript content (including headers and footers as shown in this document model) should appear as attributes, or as meta-data in the non Body part of the transcript.

The working group will need to decide where to place the meta-data, and if it makes sense to optionally permit some of it to appear within the <Body> (provided it meets the above conditions).

3.4 Indenting

This document has not addressed indenting of lines, which is important whenever the transcript is displayed for it to have the correct official look and feel. Typically all questions and answers appear with one indent, and all colloquy (see Definitions) appear with another. Headings are often centered, and Asides may appear centered, indented, or left-justified according to the preference of the reporter. 

If indenting is not to be supported as part the XML-Transcript standard (if it is felt to be a formatting issue, and therefore inappropriate) then it must be handled in some other way. Perhaps by permitting space content to be treated literally somehow. One option might simply be to allow an optional "Indent" tag on each Line (since that is physical formatting information anyway) with possible global defaults that are defined in some stylesheet being permitted. This wouldn't fully meet the requirement from the previous section for content to be completely preserved, however.

This issue will need to be explored more as the final transcript standard evolves. 

4. Appendix: XML Expressions

This appendix was based in part on the corresponding section from Marty Halvorson's Unofficial Note XML Standards Development Project Strawman for Electronic Court Filing.

The XML expressions shown in this document are intended to convey the following semantics:

<tag> - this element must occur one time.

tag2> - unless otherwise noted, these elements must occur in the order shown.

<tag1> | <tag2> - either of these tags must occur one time.

(tag1 | tag2) - either of these elements (defined elsewhere) must occur one time.

(tag?) - this element is optional.

(tag*) - this element can occur zero or more times.

(tag+) - this element must occur one or more times.

<tag attribute="value"> - this attribute must appear with this value

<tag attribute=("value1" | "value2")> - this attribute must appear with either value

<tag attribute?="value"> - this attribute is optional.




Gear Image  


Copyright © 1993-2008 OASIS ®. All rights reserved.