Guidelines for a standardized data format for use in cross-linguistic studies

Date:: October 16, 2018
Source:: Max Planck Institute for the Science of Human History
Summary:: An international team of researchers has set out a proposal for new guidelines on cross-linguistic data formats, in order to facilitate sharing and data comparisons between the growing number of large linguistic databases worldwide. This format provides a software package, a basic ontology and usage examples.
Share:: Facebook Twitter Pinterest LinkedIN Email

FULL STORY

An international team of researchers, members of the Cross-Linguistic Data Formats Initiative (CLDF) led by the Max Planck Institute for the Science of Human History, has set out a proposal for new guidelines on cross-linguistic data formats, in order to facilitate sharing and data comparisons between the growing number of large linguistic databases worldwide. This format provides a software package, a basic ontology and usage examples.

There is an increasing number of linguistic databases worldwide, raising the possibility of a vast network for potential comparative studies. However, these databases are generally created independently of each other, and often have a unique and narrow focus. This means that the formats used for encoding the data are often different and this creates real difficulties in effectively comparing data across databases.

In an effort to resolves these issues, the Cross-Linguistic Data Formats Initiative (CLDF) was created. In a paper published in Scientific Data, the CLDF sets out proposed guidelines for a standardized format for linguistic databases, and also supplies a software package, a basic ontology and usage examples of best practices. The goal of this effort is to facilitate sharing and re-use of data in comparative linguistics.

Standardizing data formats to facilitate sharing and reuse

The CLDF provides a data model underlying its recommendations that aims to be simple, yet expressive, and is based on the data model previously developed for the Cross-Linguistic Data project. This model has four main entities: (a) Languages; (b) Parameters; (c) Values; and (d) Sources. In the model, each Value is related to a Parameter and a Language, and can be based on multiple Sources. There are additionally References for Sources, and References can also have Contexts (which, for example, for printed references would be page numbers).

The CLDF data model is a package format, in which a dataset would be made up of a set of data files containing tables, and a descriptive file that defines the relationships between the tables. Each linguistic data type would have a CLDF module and additional components, which would be the aspects of the data in the module that recur across multiple data types. The CLDF modules would also contain terms from the CLDF ontology. The ontology is a list of vocabulary that represents objects and properties with well-known semantics in comparative linguistics. This makes it possible for users to reference these terms in a uniform way.

A software package to enable validation and manipulation

The CLDF specifications use common file formats -- such as CSV, JSON and BibTeX -- that are widely supported, with the goal that these files can easily be read and written on many platforms. Even more importantly, the standardized format will allow researchers without programming skills to access and manipulate the data with preexisting tools, rather than this ability being limited to researchers with sufficient programming skills to create their own tools. To facilitate this, the CLDF has created a "cookbook" repository for scripts for use with the CLDF specifications.

"We want to bring access to these data and the ability to compare them to as many researchers as possible," states Johann-Mattis List of the Max Planck Institute for the Science of Human History. Robert Forkel, one of the driving forces behind the CLDF initiative, also notes that the CLDF format is not limited to linguistic data alone, but can also incorporate databases of cultural and geographic data, for example. "CLDF may drastically facilitate the testing of questions regarding the interaction between linguistic, cultural, and environmental factors in linguistic and cultural evolution."

RELATED TOPICS

RELATED TERMS

Story Source:

Materials provided by Max Planck Institute for the Science of Human History. Note: Content may be edited for style and length.

Journal Reference:

Robert Forkel, Johann-Mattis List, Simon J. Greenhill, Christoph Rzymski, Sebastian Bank, Michael Cysouw, Harald Hammarström, Martin Haspelmath, Gereon A. Kaiping, Russell D. Gray. Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data, 2018; 5: 180205 DOI: 10.1038/sdata.2018.205

Cite This Page:

Max Planck Institute for the Science of Human History. "Guidelines for a standardized data format for use in cross-linguistic studies." ScienceDaily. ScienceDaily, 16 October 2018. <www.sciencedaily.com/releases/2018/10/181016094425.htm>.

Max Planck Institute for the Science of Human History. (2018, October 16). Guidelines for a standardized data format for use in cross-linguistic studies. ScienceDaily. Retrieved July 3, 2026 from www.sciencedaily.com/releases/2018/10/181016094425.htm

Max Planck Institute for the Science of Human History. "Guidelines for a standardized data format for use in cross-linguistic studies." ScienceDaily. www.sciencedaily.com/releases/2018/10/181016094425.htm (accessed July 3, 2026).

Explore More

from ScienceDaily

Guidelines for a standardized data format for use in cross-linguistic studies

Explore More

Breaking

Trending Topics

Strange & Offbeat