Canonical Text Infrastructure (CTI)

Pillar 1: Canonical Text Service

Text inventories are served as tab separated plain text instead of XML because many browser configurations make it hard to read XML (implying wrongly suspected errors) and XML parsing has performance and compatibility disadvantages. For XML text inventories according to the CTS specifications use "/cts/?request=GetCapabilities" instead of "/plain/editions.php" in your request URL but please do not use XML output for automatic requests. The plain text inventory format is
URN [TAB] Title [TAB] Year [TAB] Author [TAB] Copyright restricted [NL]

Public Text Inventories

Namespace	State	Text Inventories	Lang	Copyright Restricted (*)	Content	Source / Link to citation
ancJewLit	Stable	📚	HEB,ARC	Free	Classical Jewish sources	AncJewLit GitHub
dsb	Work in Progress	📚	DSB(,DEU,HSB)	Partly	Lower Sorbian text corpus	Serbski Institute
folgershakespeare	Stable	📚	ENG	Free	All Shakespeare's works	Folger Shakespeare Library
gps4	Stable	📚	DEU	Free	German Political Speeches corpus compiled by Adrien Barbaresi (6'685 documents)	German Political Speeches Corpus and Visualization
gwtc	Work in Progress	📚	ENG,DEU	Fully	12'407 Game Walkthrough Documents	Game Walkthrough Corpus
openarabicpe	Stable	📚	ARA	Free	Open Arabic Periodical Editions (Muqtabas, Manar, Ustadh, Haqaiq, Lughat, Zuhur)	OpenArabicPE GitHub
pbc	Stable	📚	Multi	Free	20 copyright-free parallel bible translations	Parallel Bible Corpus
pcp	Stable	📚	FRA	Free	Chrétien de Troyes's Le Chevalier de la Charrette (Lancelot, ca. 1180)	The Princeton Charrette Project
tg	Stable	📚	DEU	Free	Textgrid	The Digital Library in Textgrid
tgap	Stable	📚	Multi	Free	Thomas Gray Archiv Poems	Thomas Gray Archive
voth	Stable	📚	Multi	Free	David Boder: Voices of the Holocaust	David Boder: Voices of the Holocaust

(*) The current list of requests that are available for copyright restricted texts is documented here.

Online Tools

Namespace Resolver provides endpoint URLs based on URN namespaces
CTS Explorer provides convenient access and example requests for available CTS instances
E-Book Style Reader

Resources and Source Code

Source Code Repositories (Git hosted via Bitbucket.org)
Python Interface (Git hosted via Bitbucket.org)

Suggested Citation

Jochen Tiepmar. Canonical Text Infrastructure. URL https://urncts.eu, requested on

Selected Collections in E-Book Style

Autor	Hans Christian Andersen \| Wilhelm Busch \| Johann Wolfgang von Goethe \| Grimm's Fairy Tales \| Franz Kafka \| Karl May \| Friedrich Schiller \| Shakespeare

Pillar 2: Canonical Text Miner

Text Mining Instances

Text Mining Corpora	Source Document List	Time Series	Type*
ancJewLit	Full	No	default
dsb	Full	Yes	special
folgershakespeare	Full	No	default
gps4	Full	Yes	default
openarabicpe	Full	No	default
pcp	Full	No	default
tg	Full	No	default
tgap	Full	No	default

(*) Type special means that the text miner uses corpus specific features (e.g. specific text annotation). Type default means that the default CTM source code is used without further programming. Cloning the repo and running the setup script results in an identical application on your server. Diachronic features require CTS side publication dates, that may not be available for every document / data set.

Source Code Resources

Source Code and Installation (Git hosted via Bitbucket.org)

Suggested Citation

Jochen Tiepmar. Canonical Text Infrastructure. URL https://urncts.eu, requested on

Impressum and Data Protection Policy

This is a non-commercial academic research and data webservice.

Impressum
Dr. Jochen Tiepmar, c/o IP-Management #48412, Ludwig-Erhard-Str. 18, 20459 Hamburg, Germany.
Preferably Email: tiepilab at gmx.de or the usual academic communication channels.

Data Protection Policy
No user data is collected besides IP access logs that are stored by Apache Server software. These access logs are deleted automatically. Data sets are provided according to their public license or prior individual agreements. Tools may include publicly available software licenses (namely plotly.js and cytoscape.js).

FAQ

What is a Canonical Text Service?

The Canonical Text Services protocol defines interaction between a client and server providing identification of texts and retrieval of canonically cited passages of texts. The official specifications by David Neel Smith and Christopher Blackwell can be found here. To put it relatively simple: CTS serves text passages that are specified by URN like references. It is specified in a way that allows to create CTS URNs for any possible text passage in a document. The data can be requested using GET requests that are provided in an URL. Each request must contain one parameter request which specifies the CTS function to use. Function specific parameters - like the URN - are added as additional GET parameters.

Is the implementation feature complete?

GetPassagePlus and error messages are missing but will soon be implemented as well as a lot of additional features that extend the CTS protocol (e.g. license management on passage request level). See this dissertation for more information about what is planned.

How about data persistency and versioning? Can I reliably cite text passages via URNs or can the text content change?

CTS URNs are meant to be persistent references. However, mistakes and improvements happen and structure markup can change if documents are still edited. There is no clear solution for this problem but some kind of versioning will be implemented (e.g. numbered updates). Text corpora that are still worked on are marked with (!) in the above table. Generally CTS URNs can be considered safe for citation purposes.

How reliable is this service? Will you monetize it once people depend on it?

The server is financed privately and I am using these webservices for my own programming work and research. The software is open source and can be recreated by anyone. It is planned to implement CTS Cloning, which will allow decentralized distributed backups for texts once they got "CTSified"; this will eliminate any dependency on individual servers as it will allow anyone to mix and host their own data instances. Monetizing this service will not be neccessary and would be counter productive for me personally because it would undermine the reliability of my research output.