Single Portal
A single Portal prototype project is planned to build a Linked Data and to provide a single portal for 1.5 million Harvard Library, 20k MIT, 400k-1m UIUC, and DPLA online accessible records through cooperation. The prototype will address how metadata will be pulled regularly from Well-Design Digital Libraries (WDDLs), how to build Linked Date with CT from their original records, and how to build a union catalog to maximize performance of searching.
Abstract
The Common Terminology expanded prototype project seeks Grant to build a Linked Data by new tool, Common Terminology (CT), and to implement a single search interface for online accessible records of the four cooperating libraries: 1.5 million Harvard, 20k QDC Massachusetts Institute of Technology (MIT), 400k-1m MARCXML University of Illinois at Urbana-Champaign (UIUC) university libraries, and Digital Public Library of America (DPLA) Metadata Application Profile (MAP). The prototype tests and evaluates that new tool, CT, developed during 2012-2014, can address how metadata can be pulled regularly from cooperating institutions and organizations to share their infrastructure, and how to build Linked Date by CT from their original records, and how to build a union catalog by CT to maximize performance of searching. It explores the potential for libraries to share their collections by innovative new type of CT regardless which standards communities use. It tests how effectively new tool, CT, can implement a Linked Data saving cost especially for MARC records. The prototype results in that CT can be widely used and adapted in the library field to improve the visibility and discoverability on the Web. It will impact broadly on libraries and archives to address challenges in improving interoperability and in building Linked Data regardless standards that communities use. Lastly, it will be a corner stone to develop a single portal of many well-designed digital libraries for International Open Public Digital Library (IOPDL). The CT expanded project shall be directly responsible to President of IOPDL, and shall be consulted by Dean Smith and Professor Dubin of GSLIS at UIUC. It will be conducted for two years: April, 2015 – March, 2017.
-
Project Justification
1.0. Interoperability Problems and Common Terminology (CT)
Interoperability problems pose a barrier for many libraries and/or archives to sharing and exchanging information among digital libraries and repositories. This is due to the use of diverse metadata standards, and their different degrees of generality or specificity. This causes loss of information at all metadata model levels (e.g., schema, schema definition language, record, and repository).
As a possible solution, the roles for a Common Terminology (CT) based on the DCMI abstract model are proposed. Common Terminology (CT) is to improve interoperability by allowing communities to use their own standards while providing uniformity to searching. Taking commonly used standards (MARC, MODS, DC, and QDC) as bases, CT has developed as a bridge across different generality and specificity levels by CT project during 2012-2014. The CT project paper is on https://www.ideals.illinois.edu/handle/2142/50100.
The developed CT is a set of 12 Common Terms (properties) and 58 qualifiers (sub-properties) with CTScheme at the schema metadata model level. CTScheme is defined as an enumerated set of resources used as a controlled set of values including authorities, Syntax Encoding Scheme and Vocabulary Encoding Scheme of DCMI. CT is represented as XML schema, RDF schema, and SKOS concepts at the schema definition language metadata model level.
The performance of CT in achieving and improving metadata interoperability is presented through empirical evaluations at the record level. The experiments for CT are conducted with MIT (QDC), and UIUC (MARCXML) metadata records through cooperation of universities in the USA. The results of empirical evaluations prove that CT minimizes considerably loss of information reducing the gaps among different standards (0.00463% in MIT (QDC) to CT mapping; 4.729% in UIUC (MARCXML) to CT mapping). CT increases significantly accuracy in mappings showing high lexical (98.7% in QDC to CT mapping; 95.2709% in MARCXML to CT mapping) and 100% semantic match rates.
It is amazing results, considering that CT has only 12 common terms (less than Dublin Core elements) and 58 qualifiers (many fewer than MARC tags and subcodes). The results prove CT is an innovative and fundamental way to solve a big barrier of interoperability issues across multiple schemas in libraries and archives.
Based on the successful performance of Common Terminology (CT), a rapid prototype is planned to provide a portal for Harvard, MIT and UIUC libraries and DPLA that provide their metadata with cooperation. It is to connect their several million online accessible records on the Web. It is to make cooperating libraries records discoverable, visible and usable on the Web. The portal will be built by Linked Open Data (LOD) using CT and by CT union catalog. Conducting LOD with the developed CT will address future directions how effectively to build LOD with records of different standards, and how much CT improves interoperability between many institutions and organizations in the library field.
1.1. Challenges in Building a Linked Data
To make libraries’ materials discoverable and visible on the Web, building a Linked Data has been trying out by some organizations and institutions. Europeana implemented Linked Open Data as a way of publishing structured data represented in the Europeana Data Model (EDM). Europeana indicates that LOD makes metadata to be connected between related resources and enriched, and makes it access easily on the Web. The Library of Congress initiates BIBFRAME (Bibliographic Framework Initiative project) to make MARC bibliographic descriptions visible and sharable on the web by enriching MARC records with URIs. Amsterdam Museum in Netherlands made the world-class collections available as Linked Open Data, launching the whole collection of around 70,000 objects online in March 2011, which contains international historical and art historical value. British Library released 3 million British National Bibliography records to the public through assigned URIs for each record that make them dereferencable. The LODAC Museum project in Japan built an LOD of museum collection information advancing data sharing. LODAC Museum is an LOD of museum information that consists of 40 million triples of 114 museums and institutes in Japan and that was conducted by the LODAC project in 2012.
Some practical projects had done to build Linked Data in America. The Smithsonian American Art Museum linked the dataset to hub datasets of DBpedia and the Getty Vocabularies linking 41,000 objects and 8,000 artists in 2013. The project team points out that the database-to-RDF mapping process was very complex. OCLC makes 290 million bibliographic LOD-friendly records in WorldCat accessible on the Web by retrieving as embedded RDFa or in other formats: RDF/XML, JSON, text/turtle, and plain text (Cole, Han, Weathers, & Joyner, 2013). UIUC explored adding links in MODS metadata transformed from MARCXML, transforming into non-library specific LOD-friendly semantics, such as VIAF and LCSH URIs, and deploying as RDF to maximize the utility of these records (Cole, Han, Weathers, & Joyner, 2013). But, it seems not fully map into RDF and LOD from MODS transformed from MARCXML. They report one of challenges, which “there are too many semantics options available for creating RDF representations of bibliographic records. Since the traditional library bibliographic records carrier, MARC, is not suitable for LOD and the Semantic Web environment, early experimenters of library LOD often have developed their own namespaces and semantics when publishing their catalog records as LOD data sets… As a result, there are too many semantic sets used for library LOD data sets. No single semantic set seems sufficient for describing library bibliographic catalog records” (Cole, Han, Weathers, & Joyner, 2013).
Harvard had tried to expose bibliographic information of MARC for Countway Library’s digital collections as Linked Open Data and to enhance searches utilizing new data points in 2012. A chosen set of 67 records are converted into a tap-separated file after selecting a subset of all available MARC fields and subfields to map to RDF. They grouped the fields into date, international serial number, linking, name, subject, and title. They identified and linked to external data sources corresponding term URI for MeSH and LC sources. And they developed a simple interface to demonstrate the utility of searching record by Linked open data, using SPARQL query. But, they report the challenges to develop code to parse the name field because of many variations (I assume it is due to sub-codes) and different locations of the data (Cheng, 2012). Harvard, Cornell and Stanford libraries project is building Linked Data for the three libraries records since December 2013.
Nevertheless some practices, to date, there is no general acceptable standard to build Linked Data. Depending different schemas that committee use, methodology in building Linked Data is very different. Especially, some challenges are reported for MARC bibliographic records that have high degree of specificity. Therefore, we need a standard way to build Linked Data to improve discovery, visibility, and access materials of libraries or archives. As Common Terminology (CT) is another fundamental way to achieve interoperability across multiple schemas, implementing a Linked Data using CT will propose an innovative and essential way to implement Linked Data across multiple schemas that communities use according to their needs in libraries and/or archives.
1.2. Implementing a Linked Data by Common Terminology (CT)
The CT expanded project that seeks Grant is to implement and test a Linked Data using CT that achieves and improves interoperability across multiple schemas minimizing loss of information and preserving much accuracy information of original. CT is simple but informative to represent bibliographic records across multiple schemas.
As a rapid prototype, CT project will build a Linked Data for cooperation libraries that provide their online accessible records: Harvard, Massachusetts Institute of Technology (MIT), and University of Illinois at Urbana-Champaign (UIUC) university libraries, and Digital Public Library of America (DPLA). Harvard Library Innovation Lab provided 1.5 million Library Cloud online assessable records on June 2014. MIT provided 20,000 Qualified Dublin Core records on June 2014. UIUC provided 400,000 MARCXML records and will provide more records. DPLA will provide their online accessible records in the late spring of 2015 after finishing an undergoing process with their metadata ingestion system.
The rapid prototype will experiment and test how their original different schemas’ records can be transformed into CT; in the converted records by CT, which elements/common terms can be connected to be discoverable; how a Linked Data can be built for three university libraries and DPLA. Any risks of the rapid prototype could not be expected, because a Linked Data will be built in a secure https://ct.iopdl.org website to protect records of the cooperating organizations, and the original records will be preserved separately. The experiment will give only advantages for the prospect future in the library field.
The rapid project will give a significant advancement to build efficiently and simply Linked Data using CT no matter what schemas libraries use. It tests how effectively new tool, CT, can implement a Linked Data saving cost especially for MARC records. It explores the potential for libraries and archives to adapt easily to build their Linked Data, to connect their resources on the Web, and to share their collections by innovative new type of CT regardless which standards communities use. The prototype results in that CT can be widely used and adapted, and will impact broadly on library field to address challenges in improving interoperability and building Linked Data.
1.3. Building a Single Search Interface by CT Union Catalog and the Linked Data
After building a Linked Data with CT, a union catalog using CT will be built for the four cooperating libraries and be saved in https://ct.iopdl.org as a relational database. A new developed tool for the task will put closely the related and connected records together regardless organizations or other facts. But, CT itself informs where the record comes from originally by the ‘source’ qualifier/sub-property in the ‘identifier’ common term/property, and by which URIs users can access to the material.
With the built CT union catalog, a single search interface will be designed for users of the four libraries such as professors, staff, students, professionals, etc. It will make online accessible materials of the four libraries be discoverable and accessible to users of the four libraries, which gives direct benefits for them. The single search interface by CT will increase search performance by shortening response time and maximizing relevance of search results for the four libraries.
Moreover, it will be a promising model to share libraries infrastructure among libraries and archives. It will be a corner stone to develop a single portal for many national digital libraries to share their infrastructure and to connect their collections for International Open Public Digital Library (IOPDL). IOPDL is to make collections all over the world freely accessible to the public by cooperation with existing Digital Libraries as a non-profit organization. It is to enlarge their educational opportunities at home, school, or elsewhere anytime.
IOPDL project seeks Grants to establish a single international digital library platform connecting rich collections all over the world, and to maximize efficiency in (re)use, share, and access the world digital infrastructure saving cost significantly. The rapid prototype of CT project will contribute as an essential prototype for cost savings for the task 3: Establishing Cutting-edge Technology with Common Terminology (CT) to Improve Interoperability, and the task 4: Developing a Single Portal and Interface Connecting Well-Designed Digital Libraries all over the world.
-
Project Work Plan
The CT expanded project shall be directly responsible to President of IOPDL, and may be consulted by Dean Smith and Professor Dubin of GSLIS at UIUC for two years: 2015-2017.
2.0. Designing CT SKOS Crosswalk for MARC, MODS, DC, QDC and MAP to CT Mappings – 2 Month
The task is to design semantic crosswalk between MARC, MODS, DC, QDC and CT by CT SKOS concept. Currently, the MARC, MODS, DC and QDC to CT version 1.1 Crosswalk is developed to show clearly how they are semantically and lexically mapped into CT minimizing loss of information and maximizing preserving the specificity and precision of the source metadata records. For empirical experiments with MIT (QDC) and UIUC (MARCXML), MIT (QDC) to CT Crosswalk and UIUC (MARCXML) to CT Crosswalk have developed. However, CT project team sees the necessity to design a CT SKOS crosswalk for the four schemas based on above crosswalks, so that communities can design their own crosswalks for their needs.
The task will also investigate which vocabularies, terms, or elements are used to describe records of each library, and their usage to decide which information of records should be preserved in order to minimize loss of data. It bases on the experiments in developing CT by MARC tag usage of Harvard and UIUC, and by QDC element usage of MIT.
In addition to, the task will be conducted with Harvard online accessible records retrieved from their Library Cloud, which provided on June, 2014. They are not MARC records but their own descriptions to build their Library Cloud from MARC records. For their description, CT SKOS crosswalk will be expanded to adapt their records. Also, Metadata Application Profile (MAP) records of DPLA should be included in SKOS crosswalk, because it is a new way of DPLA to achieve interoperability based on DC and Europeana Data Model (EDM). Designing CT SKOS crosswalk will propose a common way for CT to be used practically among communities across multiple standards. The task is scheduled to be conducted for April – May 2015.
2.1. Developing Conversions – 4 Months
The task explores a common way how communities can develop their crosswalks and conversions for their schemas with the developed CT SKOS crosswalk. With the CT SKOS crosswalk, conversions with Python programming language will be developed, which convert Harvard Library Cloud records, MIT QDC, UIUC MARCXML and DPLA MAP records into CT. CT project team has already developed conversions for empirical evaluation of CT performance such as Conversion MIT (QDC) records into CT1.1 and Conversion UIUC (MARCXML) 400,000 records into CT1.1. But they should be upgrade to be generalized with the CT SKOS crosswalk and according to their new descriptions that provided on June 2014 and did not used for the evaluations. Also, we need to develop a conversion for DPLA MAP records based on the CT SKOS crosswalk. The task is to produce four conversions or a combined conversion that converts four different types of records into CT. To develop a combined conversion will be challenge, but it will be worthy to adapt as many schemas as possible for the future. Also, the task results in the converted CT records from the records of four different libraries. The task is scheduled to be conducted for June – September, 2015.
2.2. Building a Linked Data for the Four Cooperating Libraries – 8 Months
By the developed conversion and as a result of the task 2.1, we will have the converted CT records from different schemas’ records. With the productions, the task 2.2 is to develop a Python program that builds a Linked Data for the four cooperating libraries. To date, Harvard library has built Library Cloud for own library, and involves in a project to build Linked Data with Cornell and Stanford libraries. MIT implemented the mapping from QDC to RDF as an effort to harvest from DSpace@MIT to populate the SIMILE Longwell tool. UIUC is trying connecting the library records into LCSH through the complex processing: transforming MARCXML to MODS; deploying it as RDF; linking them into LCSH. DPLA indicates that the MAP will be developed into more linked-data friendly methods. That is, they could not have developed a linked data for their libraries. Thus, this task will give significant effect saving cost to link records of four libraries and further many libraries and archives.
The task shall find out which common terms (properties) can be selected as primary terms to link the converted CT records of four libraries, such as title, contributor, subject, date, publisher, or etc. The program that will be developed to build a Linked Data will connect records that have the same title or subjects with linked data structure. It will evaluate first which records are more closely related by key word density. According to the evaluation, the priority to connect records will be decided. Since CT is enough simple but informative, the processing will be simple than BIBFRAM and other methods. Through the task, a Linked Data will be built for the four libraries. It will be a certain but simple and easy way to build Linked Data that many libraries or organizations can adapt to build it for their units. The task is scheduled to be conducted for October, 2015 – May, 2016.
2.3. Designing a CT Union Catalog – 4 Months
With the developed Linked Data through the task 2.2, a CT union catalog will be designed as a relational database. According to the order of the connected records that represent how much they are closely related to, records are closely located in the relational database, so that it maximizes performance of a search engine. The task develops a Python program that designs a CT union catalog, and results in a designed CT union catalog. The task is scheduled to be conducted for June – September, 2016.
2.4. Designing a Single Portal for Harvard, MIT, UIUC, and DPLA and Evaluations – 4 Months
With the developed CT union catalog by the built Linked Data through the task 2.2 and task 2.3, the task 2.4 is to design a single portal and search interface for Harvard, MIT, UIUC and DPLA. The task tests how much effective and efficiency the designed search engine by the built Linked Data and CT union catalog. It will explore to present a common way how libraries share their collections by Linked Data and union catalog. The task also evaluates how much performance of searching can be improved, such as response time by locating closely related records, and relevance by the developed Linked Data that links according to the related degree. The task is scheduled to be conducted for October, 2016 – January, 2017.
2.5. Preparing the White Paper – 2 Month
The task 2.5 is to document all processes, productions, and results of the rapid prototype project as a white paper. It is to describe how the prototype can lead a promising way to build a Linked Data and Search, and be adapted broadly for libraries and archives according to their needs. Also, it includes how the prototype is different from existing efforts to link their data on the Web, and how this method gives advantage and benefits for libraries and archives. It will be shared with the library community. The task is scheduled to be conducted for February, 2017 – March, 2017.
-
Project Results
3.1. Project Results
The rapid prototype to build a Linked Data and search interface for Harvard, MIT, UIUC, and DPLA results in:
- CT SKOS crosswalk that represents semantic relationships across multiple schemas such as MARC, MODS, DC, QDC, and MAP. And it proposes a common way for communities to design a crosswalk with CT according to their needs. It can be widely applicable to design crosswalks for different schemas by semantic concepts.
- Four conversions or a combined conversion that converts four different types of records into CT. A combined conversion will be worthy to adapt as many schemas as possible for the future.
- The converted CT records from the records of Harvard Library Cloud records, MIT QDC, UIUC MARCXML, and DPLA MAP.
- The built Linked Data for the four libraries that suggests a certain effective, simple and easy way to build Linked Data saving cost, and that many libraries or organizations can adapt to build it for their units.
- CT union catalog that represents how much maximizes performance of a search engine by locating closely related records according to the related degree.
- The designed single portal and search interface for Harvard, MIT, UIUC and DPLA.
- Evaluations that test how much effective and efficiency the designed search engine by the built Linked Data and CT union catalog is. It also evaluates how much performance of searching can be improved, such as response time by locating closely related records, and relevance by the developed Linked Data that links according to the related degree.
- The white paper that documents all processes, productions and results of the rapid prototype project in order to be shared with the library community. It describes how the prototype can lead a promising way to build a Linked Data and Search, and be adapted broadly for libraries and archives according to their needs. Also, it includes how this method gives advantage and benefits for libraries and archives.
- Programs:
- Usage program that investigates which vocabularies, terms, or elements are used to describe records of each library, and how many times they are used in total records.
- Four conversions: Harvard Library Cloud to CT, MIT QDC to CT, UIUC to CT, and DPLA MAP to CT, or
- A combined conversion program: Four different descriptions to CT
- Linked Data program that builds a Linked Data and investigates the degree of relations by word density.
- CT union catalog program that builds a relational database, and designs a CT union catalog.
- Searching program that conducts the activities involving in searching and retrieving records.
- Evaluation program that evaluates how much performance of searching can be improved measuring response time and relevance.
3.2. Project Results Impact
The rapid CT expanded prototype will give an assured solution to improve interoperability for university libraries and further for many libraries minimizing loss of information. The prototype will address how to build Linked Date with the developed Common Terminology (CT) from their original records, how to build a CT union catalog to maximize performance of searching, and how this method gives advantage and benefits for the library community. The prototype suggests a promising way for libraries to adapt and practically use the method in order to share their collections regardless which standards communities use. The prototype proves CT can build effectively a Linked Data saving cost and minimizing loss of data especially for MARC records. The prototype can result in that CT can be widely used and adapted and will impact broadly on the library field to address challenges in improving interoperability. Lastly, it will propose an economic and effective model to develop a single portal of many digital libraries for International Open Public Digital Library (IOPDL) saving cost significantly.
-
Estimated Budget for Two Years
Salaries: $360,000 (4 developers)
Total Request Grant Fund: $360,000