How is the corpus structured?
Corpus contouring, corpus typology
The corpus is a collection of texts that has been prepared for searches in the areas of lexical, semantic, metric, morphological, and syntactic questions. The primary research goal of the corpus is the onomasiological vocabulary description of Middle High German.
The text and data are acquired through a controlled monitor corpus that is continuously expanded. The current size is 666 texts and 329,376,195 triples. Homogeneous parameters are used, in contrast to the earlier opportunistic data acquisition. This means that data collection is carried out according to specific guidelines and criteria.
The corpus focuses mainly on Middle High German, but also includes instances of early Early Modern German texts. Occasionally, there may also be occasional instances of other languages such as Latin.
The corpus is a reference corpus and contains a balanced mix of different genres and varieties. There is a defined outline of the text collection, whereby all genres of texts are possible. In addition, there is a wide distribution of language areas to ensure a diverse database.
The time frame of the corpus not only covers the period from 1050 to 1350, known as the “Middle High German Basic Data Stock,” but rather extends from 1050 to 1600. The MHDBDB could be described as a “text archive before 1600” (complementing the DTA) and provides an extended temporal context for research.
The corpus is linguistically processed by tokenizing and lemmatizing the texts. Disambiguation is also carried out in some cases to resolve ambiguities. In addition, the data is annotated.
Copyright and licenses
“The birth of the reader must be paid for with the death of the author.” (Barthes, Roland: Der Tod des Autors. In: Texte zur Theorie der Autorschaft. Hrsg. von Fotis Jannidis u.a. Stuttgart: Reclam 2000. p. 193.)
Copyright applies to the authors of texts. They are protected for the legally prescribed period of 70 years after the death of the author and are not in the public domain (cf. §§ 60-61 UrhG-AT and §§ 64-69 UrhG-DE). Copyright protection for a medieval work can therefore be considered to have ‘expired’. Anyone who lawfully publishes a work for which the term of protection has expired (70 years after the author's death) is entitled to the exploitation rights to the work as an author. This property right expires 25 years after publication.
Scholarly editions of medieval texts must therefore be examined in light of the legal concept of editio princeps, which is derived from copyright law. This is a provision of the UrhG-AT and UrhG-DE, which derives special property rights from the first publication of works and refers to the (printed) first edition or the first publication of a literary work.
For the MHDBDB, this means that editions whose first publication was more than 25 years ago are unproblematic. Modern editions that re-edit texts that have already been edited once are also in the public domain. However, editions that were first published in the past 25 years require more detailed legal examination.
The MHDBDB does not publish apparatuses/commentaries in the editorial sense, but rather the pure text with its own annotations. The level of creativity of the works therefore tends to be assessed differently. All MHDBDB annotations are licensed under CC BY-NC-SA 3.0 AT. The e-texts themselves are individually marked. For works that have been made available to the MHDBDB in consultation with the editors, a separate solution is provided.
Example: Vlastimil Brom (ed.): Di tutsch kronik von Behem lant (2012).
Which format is the data available in?
The e-texts were originally encoded in basic, structured TEI-XML (stanza, verse, heading). All files are tokenized, i.e., all tokens are marked with a “seg” element with a unique ID. From there, stand-off markup is used to refer to all annotations/research data/metadata in RDF format. Controlled vocabularies (concept system, name system, text series typology) were implemented as SKOS vocabularies. The lemmas are encoded according to the specifications of the OntoLex-Lemon-Lexicography module. The networking between RDF and TEI data is done using Web Annotation Vocabulary according to the recommendation of the W3 Consortium. All annotations can be downloaded for further use at: github.com/Middle-High-German-Conceptual-Database/TEI-Texte.
Can you download the data?
Yes! All annotations can be downloaded from GitHub as TEI for further use. Where copyright permits, the e-texts are also available as PDF files.
Semantic Modeling in MHDBDB
POS-Tagging
The texts are largely tagged with part-of-speech (POS) tagging. This refers to the assignment of words and punctuation marks in a text to parts of speech. de.wikipedia.org/wiki/Part-of-speech-Tagging Among other things, this makes it possible to search for specific parts of speech only. For example, when searching for references, if you do not want to see all results for “minne,” you can specify “Minne + proper name” and then you will be shown instances where “Frau Minne” appears. The POS annotations follow the specifications of the W3C's OntoLex Lemon Lexicography Module (https://www.w3.org/2019/09/lexicog/).
Conceptual system
The MHDBDB's eponymous conceptual system is based on the system developed by Rudolf Hallig and Walther von Wartburg, which was published in 1952. Klaus M. Schmid, founder of the MHDBDB, applied this system in his dissertation to the vocabulary of Ulrich von Lichtenstein's Frauendienst (Schmidt 1972). This project, which was still implemented with punch cards and Fortran, is the cornerstone of today's MHDBDB.
Over the years, the MHDBDB team gradually modified Hallig and Wartburg's terminology system and adapted it to the corpus and changes over time. Categories that are considered inappropriate today, such as “races” (meaning dwarves and giants in HW), were replaced and numerous new categories were added.
In the final step of the relaunch, the old terminology system was implemented in the form of a controlled vocabulary in SKOS (Simple Knowledge Organization System). SKOS not only allows polyhierarchies (hierarchical structures in which a class can have more than one superordinate class), but also the assignment of different labels (synonyms), languages, and additional notes.
All lemmas and tokens in today's MHDBDB are assigned to this new terminology system. Each token refers to at least one (usually several) categories in the tree in the stand-off markup (https://www.digitale-edition.at/o:konde.171). The Middle High German word “katze,” for example, is not only assigned the meaning “mammals + pets,” but also “weapons + siege + battle,” because a “katze” is also a siege weapon. If you only want to search for war machinery and are not interested in animals, you can combine the reference search with the term “siege,” for example.
Onomasticon
Starting in 1992, numerous texts that Kiel-based linguist Horst Pütz had digitized for a dictionary of names were transferred to the MHDBDB. The name categories were integrated into the conceptual system (see above) and the database was given its final name, MHDBDB.
This approach was necessary for technical reasons at the time, but can no longer be justified today, as meanings (semantics) and names (onomastics) are different levels. For this reason, all 83 name categories, such as “weapons/names” (Balmunc, Excalibur, Mimminc), were removed from the concept system and transferred to a separate controlled vocabulary implemented in SKOS.
SKOS not only allows polyhierarchies, but also the assignment of different labels (synonyms), languages, and additional notes.
Textseries/Genres
The texts continue to be assigned to one or more text series (genres) of the text series typology. (For a detailed explanation, see Katharina Zeppezauer-Wachauer and Marco Heiles, A Digital Text Series Typology for German-Language Texts of the Middle Ages and Early Modern Period. Showcase of a Controlled Vocabulary in SKOS, in: Mittelalter. Interdisciplinary Research and Reception History 6 (2023), pp. 6–39, DOI: [https://doi.org/10.26012/mittelalter-30680](https://doi.org/10.26012/mittelalter-30680).)
This makes it possible, for example, to search only for evidence in the genre or text series “Arthurian romance.” The differentiation in the faceted search is backward-compatible, i.e., all main categories also return all results in lower categories. Example: “Grand Epic” returns results from “Heroic Epic,” “Courtly Romance,” “Prose Novel,” “Comic Novel,” “Text Cycle with Frame Narrative,” and all children of these categories.
In addition, the system is polyhierarchical, meaning that there are often different parent categories. Example: The text series “Lord's Prayer” has the parent records “First reading texts,” “Prayer,” “Meisterlied” and “Liturgy.”
Metadata
The MHDBDB uses [BIBFRAME 2.0](https://www.loc.gov/bibframe/) (RDF) for bibliographic metadata.
The following instances exist at the work/text level:
- Works
- Editions or digital editions of these works
- Electronic texts of these editions (mostly from the 1990s)
- MHDBDB e-texts based on editions, digital editions, or electronic texts of editions
[CIDOC-CRM](https://www.cidoc-crm.org/) (RDF) is used for other metadata objects (persons, places, events), with events currently being used primarily as author biographies and work creation data.
All metadata in the MHDBDB is linked to the Semantic Web or standard data and made available as Linked Open Data (LOD).
![MHDBDB metadata in relation to each other]
Fig.: MHDBDB Main Ontology
Ontologies
The relationships between the annotations are mapped in the MHDBDB Main Ontology, which is also available on GitHub. https://github.com/Middle-High-German-Conceptual-Database/mhdbdb-main-ontology-2023
How to quote us
Middle High German Conceptual Database (MHDBDB). University of Salzburg. Coordination: Katharina Zeppezauer-Wachauer. Since 1992. URL: [http://www.mhdbdb.plus.ac.at/](http://www.mhdbdb.plus.ac.at/) (date of access). DOI: 10.60646/MHDBDB.
Infrastructure, Architecture and Design
Lucene: Lucene is used for full-text search and is connected to the database via a GraphDB connector. It enables fast searching across the entire corpus and provides the relevant references to the source documents.
Angular: Component-based architecture: Angular uses a component-based architecture that facilitates code reusability and makes development modular and clear.
Two-way data binding: This feature facilitates synchronization between the model and the view, making development more efficient and reducing the need to write additional code to update the user interface.
TypeScript-based: Angular is developed using TypeScript, an extension of JavaScript. TypeScript offers strict typing, which leads to better code quality, easier debugging, and higher developer productivity.
Extensive tool support: Angular offers a variety of built-in tools for testing, building, and optimizing applications, which reduces development time.
Strong ecosystem: Thanks to its widespread use and support from Google, Angular has a strong ecosystem that offers rich resources, libraries, and an active community.
Integration with modern web technologies: Angular is compatible with the latest web development standards and integrates well with other technologies and frameworks.
TypeScript: Angular and TypeScript are closely linked, as Angular is written in TypeScript. TypeScript offers stricter typing than JavaScript, which makes debugging and maintenance easier, especially in large projects. This leads to higher code quality and scalability.
GraphDB: GraphDB 10.3.3. Licenses for the software components used: GraphDB Enterprise Edition, evaluation license for universities.
Link to architecture: [https://drive.google.com/file/d/1ooWBbMBAf0YDKjiKDV40yy4voyzEpe2o/view?usp=sharing](https://drive.google.com/file/d/1ooWBbMBAf0YDKjiKDV40yy4voyzEpe2o/view?usp=sharing)