MIT QDC to CT Mapping Experiment

MIT QDC to CT Mapping Experiment

MIT QDC to CT Mapping Experiment is conducted to prove how much CT improves metadata interoperability at record level. For the experiment, a conversion is designed with Python language to convert MIT (QDC) records into the Common Terminology 1.1. This is a part of mapping experiments with conversions involving Harvard (MARC), MIT (QDC) and UIUC (MARCXML) metadata records. The other conversion for Harvard (MARC) and UIUC (MARCXML) to CT 1.1 is under development. It is to achieve and improve metadata interoperability at the record level among three universities’ libraries and among MARC, QDC, and CT.

As a result of MIT (QDC) to CT conversion evaluation with 20,000 QDC records, CT shows 99.99537% transfer rate, 98.7% lexical match rate, and 100% semantic match rates. No transfer rate 0.00463% means loss of information rate is extremely low and preserve much information.

Provided Metadata Records

Total 20 thousand QDC records were harvested via OAI as 28 xml files and 3 csv files. They were used to investigate QDC element names usage and to build (Q)DC to CT crosswalk. They become a foundation to conduct the mapping experiment with MIT (QDC) to CT conversion measuring transfer, lexical match, and semantic match rates.

Designed Crosswalks

  • MIT (QDC) to CT Crosswalk – July 28, 2014

Download (MITQDCtoCTcrosswalkVweb.pdf, PDF, 29KB)

Methodology

The designed conversion Python program is not only converting MIT QDC records into CT, but also measuring transfer rate, lexical match rate, and semantic match rates together. The reported rates measure the percentage of elements over every metadata statement that are mapped and not mapped to CT in the input records. First, CT namespaces are defined like the below to validate in XML:

<CT xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”‘          xsi:schemaLocation=”http://www.ct.iopdl.org/1.1/ http://www.ct.iopdl.org/1.1/ct1-1.xsd” xmlns=”http://www.ct.iopdl.org/1.1/”>

MIT QDC records in an xml file are split by ‘<record>’ to measure how many records are in the file. Total records are 20278 in 31 files: 28 xml files and 3 csv files. Because of two different file formats, two functions are designed: MITQDCtoCTconversion and MITQDCcsvtoCTconversion. Also, the transferred values from MIT QDC include special characters that XML cannot validate. Thus, these character were changed: ‘&’ into ‘&amp;’, ‘<‘ into ‘&lt;’, and ‘>’ into ‘&gt,’ before being ransferred. The transferred CT from MIT QDC is saved into new xml files.

To measure lexical and semantic match rates, the degree how much an element name is exactly matched lexically and semantically is measured.

  • Perfect lexical and semantic matches example:
  • dc.contributor.author (MIT) is mapped into CT:contributor role=”author” authority=”LCMARCrelators“. The author role is defined in authority LCMARCrelators. In this case, obviously both contributor and author terms of QDC are exactly lexically and semantically matched into CT. It is counted as a perfect lexical and semantic match.
  • Partial lexical but perfect semantic match example:
  • dc.date.created is mapped into CT:date type=”issued.” CT:date type=”issued” is defined as ‘the date that the resource was published, released, or issued (MODS) including a created date, the date of creation of the resource (MODS). Thus, it is counted as partial lexical (date is the same but issued is different with created term), but perfect semantic match.
  • No lexical matched but perfectly semantic matched example:
  • dc.source is mapped into CT:’relation type=”original.”  CT:’relation type=”original” is defined as “Information concerning an original form of the resource (MODS) including Source concept, ‘a related resource from which the described resource is derived (DCMI).’ Thus, although they are not lexical matched but perfectly semantic matched.

More detail explanations are in the Chapter 5 of the paper, ‘A Model and Roles of a Common Terminology to Improve Metadata Interoperability’ available on http://hdl.handle.net/2142/50100.

Conversion Program

Conversion MIT (QDC) records into Common Terminology 1.1

Results

A Converted Record from Original MIT QDC record

According to the conversion program based on MIT(QDC) to CT crosswalk, the converted CT for the sample MIT (QDC) record is as follows.

Download (CTMIT07ex.pdf, PDF, 46KB)

Original MIT QDC record example, before conversion into CT

 

MIT QDC records to CT (program result) Transfer and Match Rates Result

The designed program gives statistic data for transfer rate, lexical and semantic match rates. Since the qualifiers of CT 1.1 are developed to meet communities’ needs like QDC, the transfer rate of MIT (QDC) to CT 1.1 mapping is very high with 99.99537%. No transfer rate 0.00463% means loss of information rate in mapping MIT (QDC) to CT.   Only eprint.grantNumber elements in MIT QDC records were not transferred that has no mapping into CT. It is significantly low loss of information rate that has not happened in the metadata field.

Total lexical match rate including perfect and partial matched rates is very high rate, 98.7%, in the MIT (QDC) to CT 1.1 mapping experiment. It is a very high lexical match rate that improves significantly lexical interoperability.

  • In detail, the perfect lexical match rate is 54.02% that all terms of CT are matched with both QDC elements and qualifiers (e.g., ‘description.abstract’: ’description type=”abstract”,’ ‘contributor.author’:’ contributor role=”author” authority=”LCMARCRelators”,’ etc.).
  • The partial lexical match rate is 44.707%, matched by either elements or qualifiers (e.g., ‘contributor.department’: ‘contributor name=”corporate”,’ ‘coverage.spatial’:’ subject type=”spatial”,’ etc.).
  • No lexical match rate is very low, 1.265 such as ‘dc.source’:’relation type=”original”‘ not matched lexically at all. However, lexically partially and no matched terms of QDC are reinvestigated whether they are semantically related.

Total semantic match rate is perfectly 100% including perfect and partial sematic match.

  • In detail, the perfect semantic match rate is high 85.836% including the mappings such as ‘dc.date.created’: ‘date type=”issued”‘, ‘dc.description.statementofresponsibility’: ‘rights type=”holder”.’ Although they are very different lexically, they are semantically matched.
  • The partial semantic match rate is very low, 14.164% including ‘dc.contributor.advisor’: ‘contributor role=”other” authority=”LCMARCRelators”.’

The improved interoperability lexically and semantically means CT 1.1 minimizes loss of information at schema and record levels. Also, it means CT 1.1 reduces significantly the gap of generality and specificity degrees among the selected four standards (MARC, MODS, DC and QDC).

Conclusion

CT shows very visibly the information of MIT (QDC) and UIUC MARCXML records. It can be very easily understood and described by anyone. It is the most strong point of the developed CT. It also describes visibly where the resource comes from with source=”MIT” or source=”UIUC” in CT:identifier. Using authorities and qualifiers (sub-properties) such as authority=”DCMItype,” the value of CT can be defined by authorities and qualifiers. The qualifiers allow us describing detail like MARC and MODS. They play an obvious bridge role between MARC (detailed) and (Q)DC.

By the mapping experiment with the conversion, CT shows 99.99537% transfer rate, 98.7% lexical match rate, and 100% semantic match rates. As a result, CT minimizes incredibly loss of information (0.00463%) over every metadata statement in the input records. CT increases significantly accuracy in mappings showing high lexical and semantic match rates. It reduces significantly the gap of different degrees of generality and specificity.

Through the result of the experiments, we conclude that CT shows higher performance in achieving and improving metadata interoperability, minimizing loss of information and preserving the specificity and precision of the source metadata records. 

Comments are closed.