CT Performance
To achieve metadata interoperability at multiple levels, the Common Terminology is suggested. Because existing metadata standards have very different degrees of generality and specificity, it is difficult to achieve interoperability among them without a loss of information. The Common Terminology (CT) is a bridge terminology among different standards, allowing communities to use their own standards but providing uniformity to searching.
As a prototype and a case study, CT 1.1 is developed to achieve interoperability among four significantly different standards (MARC, MODS, DC and QDC) and their metadata from three universities (Harvard-MARC, MIT-QDC and UIUC-MARCXML).
CT 1.1 improves interoperability at multiple levels, using existing techniques that achieve interoperability at each metadata level such as crosswalks, conversions, etc.
At the schema level, CT that maximizes lexical and semantic interoperability is chosen as Common Terms (properties) and qualifiers (subproperties). At the schema definition language level, the developed CT is represented in ct.xsd (XML schema), ct.rdf (RDF schema), and ctskos.rdf (SKOS concept).
At the record level, performance of CT in achieving and improving metadata interoperability is proved through empirical evaluations with the designed conversion by Python programs. The experiments for CT are conducted with Harvard (MARC), MIT (QDC), and UIUC (MARCXML) metadata records through cooperation of three universities in the USA.
As a result of MIT (QDC) to CT conversion evaluation with 20,000 QDC records, CT shows 99.99537% transfer rate, 98.7% lexical match rate, and 100% semantic match rates. No transfer rate 0.00463% means loss of information rate is extremely low and preserve much information.
As a result of UIUC (MARCXML) to CT conversion evaluation with 400,000 MARCXML, CT shows 95.2709% transfer rate and 100% semantic match rate by SKOS concept (exactMatch rate: 55.347% and broadMatch: 44.6527%). Tag Match Rate out of Total transfer rate, 95.2709%. Non-transfer rate, loss of information rate from MARC records to CT is 4.729%. 4.729% loss of information rate is very low rate, considering that CT has only 12 common terms (less than Dublin Core) and 58 qualifiers (many fewer than MARC tags).
The results proves that CT minimizes considerably loss of information reducing the gaps between MARC and QDC and CT. CT increases significantly accuracy in mappings showing high lexical and semantic match rates. It reduces significantly the gap of different degrees of generality and specificity.
Provided Metadata Records
- Harvard University: Co-Director David Weinberger of Harvard Law Library Innovation Lab provided the link of Harvard MARC records, http://openmetadata.lib.harvard.edu/bibdata, on May, 2013. According to him, they are “the complete MARC records for over 99% of all the works in Harvard Library.” Total 12 million MARC records were provided as 14 files.
- University of Illinois at Urbana-Champaign (UIUC): 10 million MARCXML records are provided through Myung-Ja Han, Metadata Librarian and Associate Professor by the link, https://uofi.box.com/s/77xpmaavo16xopqswqvj. It was generated in 2010 with 89 xml files.
- Massachusetts Institute of Technology (MIT): According to MIT digital library systems manager Carl Jones who provided Qualified Dublin Core, OAI harvesting for DSpace metadata only gives simple unqualified Dublin Core. To get qualified Dublin Core via OAI, they should change a DSpace configuration setting. By changing it, he provided the URL to harvist QDC from dome-dev.mit.edu, http://dome-dev.mit.edu/oai/request?verb=ListRecords&metadataPrefix=qdc&set=hdl_1721.3_82443, which is one of their production repositories, dome.mit.edu. Total 20 thousand QDC records were harvested via OAI as 28 xml files and 3 csv files.
- Harvard and UIUC records were used to investigate MARC tag usage and to build MARC to CT crosswalk. They were used to conduct the mapping experiment with the conversion, UIUC (MARCXML) to CT. MIT (QDC) records were used to investigate QDC element names usage and to build (Q)DC to CT crosswalk.
Designed Crosswalks
Designed Conversions
The designed conversion Python program is not only converting MIT QDC records into CT or UIUC MARCXML to CT, but also measuring transfer rate, lexical match rate, and semantic match rates together. The reported rates measure the percentage of elements over every metadata statement that are mapped and not mapped to CT in the input records.
Methodology
First, CT namespaces are defined like the below to validate in XML:
<CT xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”‘ xsi:schemaLocation=”http://www.ct.iopdl.org/1.1/ http://www.ct.iopdl.org/1.1/ct1-1.xsd” xmlns=”http://www.ct.iopdl.org/1.1/”>
MIT QDC to CT Mapping Experiment
Wih total 20278 records in 31 files: 28 xml files and 3 csv files, MIT QDC to CT Mapping Experiment was conducted in March and June, 2014. It is to measure lexical and semantic match rates, and the degree how much an element name is exactly matched lexically and semantically. More detail about methodology and results are on MIT QDC to CT Mapping Experiment.
UIUC MARCXML to CT Mapping Experiment
Several methods are used to measure transfer, non-transfer, degree of tag match rates, and semantic match rate. To measure transfer and non-transfer rates, total transferred tags with different subcodes and indexes are counted. Since matching with all information with MARC tag, indexes and subcode is very rarely happen, to measure the tag match rate, three methods are used:
- Perfect match rate matched by all of MARC tag, indexes and subcode
- Subcode match rate matched by tag, and any subcodes or subcode ‘a’ as a default
- General match rate matched by MARC tag number only
To measure semantic match rate, semantic relations are investigated for each MARCXML to CT mapping based on SKOS mapping concept schemes (e.g., skos:exactMatch, skos:closeMatch, skos:broadMatch, skos:narrowMatch, and skos:relationMatch). More details for methodology and results of MARCXML to CT mapping experiment are on http://www.ct.iopdl.org/1.1/ReportMARCXMLtoCTconversionexperiment.pdf. More detail about methodology and results are on UIUC MARCXML to CT Mapping Experiment.
Conversion Programs
- Conversion MIT (QDC) records into Common Terminology 1.1
- Conversion UIUC (MARCXML) 400,000 records into Common Terminology 1.1
Results
MIT QDC to CT mapping experiment Results
As a result of the conversion of mapping experiments, total transfer rate from (Q)DC of MIT to CT is 99.9%. Lexical and semantic match rates are 98.7% and 100%. Loss of information rate is extremely lower as 0.00463%.
UIUC MARCXML to CT mapping experiment Results
As a result along with the converted CT from MARCXML of UIUC records, performance of CT is as follows:
- Total transfer rate from MARCXML of UIUC to CT is 95.2709%.
- Non-transfer rate, loss of information rate from MARC records to CT is 4.729%.
4.729% loss of information rate is very low rate, considering that CT has only 12 common terms (less than Dublin Core) and 58 qualifiers (many fewer than MARC tags).
- Semantic match rate by SKOS concept is 100%
- exactMatch rate: 55.347% and
- broadMatch: 44.6527%.
- Tag Match Rate
- The perfect match rate matched by all of MARC tag, indexes and subcode is 16.1834%;
- subcode match rate by tag and subcodes is 43.068%; and
- general match rate for MARC tag number only is 40.748%.
Conclusion
CT shows very visibly the information of MIT (QDC) and UIUC MARCXML records. It can be very easily understood and described by anyone. It is the most strong point of the developed CT. It also describes visibly where the resource comes from with source=”MIT” or source=”UIUC” in CT:identifier. Using authorities and qualifiers (sub-properties) such as authority=”DCMItype,” the value of CT can be defined by authorities and qualifiers. The qualifiers allow us describing detail like MARC and MODS. They play an obvious bridge role between MARC (detailed) and (Q)DC.
CT maximizes lexical and semantic interoperability reducing significantly the gaps of different degrees of generality or specificity. CT minimizes considerably loss of information at multiple levels. Remarkably, CT shows also very high performance in the MARCXML to CT mapping experiment, although CT has only 12 common terms (less than Dublin Core) and 58 qualifiers (many fewer than MARC tags). Through the result of the experiments, we conclude that CT shows higher performance in achieving and improving metadata interoperability, minimizing loss of information and preserving the specificity and precision of the source metadata records.
Showing very high performance and very low loss of information rate is founded on which we developed CT based on MARC tag usage in Harvard, UIUC, and WorldCat records and in search interfaces. Over 50% used tags are considered as common terms or qualifiers. Also, the Common Terminology concept, a set of common terms of commonly used standards as a bridge terminology, is very effective to achieve interoperability among different standards (even very different degree of specificity and generality such as MARC, MODS and DC & QDC). Through the experiments, it is proved that finding and reusing commonly used terms among existing standards is a very crucial way to build interoperability, instead of creating new schema.
Papers
- MIT QDC to CT Mapping Experiment is in Chapter 5 of the paper, ‘A Model and Roles of a Common Terminology to Improve Metadata Interoperability’ available on http://hdl.handle.net/2142/50100.
- UIUC MARCXML to CT Mapping Experiment is available on http://www.ct.iopdl.org/1.1/ReportMARCXMLtoCTconversionexperiment.pdf.