UIUC MARCXML to CT Mapping Experiment
UIUC MARCXML to CT Mapping Experiment is conducted to examine performance of the developed Common Terminology (CT) in achieving and improving metadata interoperability. Empirical evaluations are planned and progressing with Harvard (MARC), MIT (QDC), and UIUC (MARCXML) records through cooperation of three universities. As a prototype and a case study, a Common Terminology of MARC, MODS, DC and QDC is developed to achieve and improve metadata interoperability among very different degree of specificity and generality standards. As a bridge terminology, CT aims to embrace many standards that many communities are using according to their needs, but to provide uniformity to search.
To convert (Q)DC of MIT records to CT, first, a conversion with Python language is designed and developed. It measures transfer and non-transfer rates, and lexical and semantic match rates. As a result of the conversion of mapping experiments, total transfer rate from (Q)DC of MIT to CT is 99.9%. Lexical and semantic match rates are 98.7% and 100%. Loss of information rate is extremely lower as 0.00463%. CT maximizes lexical and semantic interoperability reducing significantly the gaps of different degrees of generality or specificity. CT minimizes considerably loss of information at multiple levels.
Along with the successful result, another mapping experiment is done with 400,000 MARCXML records of UIUC library. The conversion for MARCXML to CT mapping is developed in Python language during September 2014. For the conversion, MARC to CT crosswalk in csv form is developed based on MARC to CT 1-1 crosswalk.
A special data structure for MARC tag, indexes and subfield is designed by encoding and decoding methods. Several methods are suggested to measure transfer, non-transfer, degree of tag match rates, and semantic match rate. Remarkably, in the result, CT shows very high performance in the MARCXML to CT mapping experiment, although CT has only 12 common terms (less than Dublin Core) and 58 qualifiers (many fewer than MARC tags). Through the result of the experiment, we conclude that CT shows higher performance in achieving and improving metadata interoperability, minimizing loss of information and preserving the specificity and precision of the source metadata records.
As a result of UIUC (MARCXML) to CT conversion evaluation with 400,000 MARCXML, CT shows 95.2709% transfer rate and 100% semantic match rate by SKOS concept (exactMatch rate: 55.347% and broadMatch: 44.6527%). Tag Match Rate out of Total transfer rate, 95.2709%. Non-transfer rate, loss of information rate from MARC records to CT is 4.729%. 4.729% loss of information rate is very low rate, considering that CT has only 12 common terms (less than Dublin Core) and 58 qualifiers (many fewer than MARC tags).
Provided Metadata Records
- University of Illinois at Urbana-Champaign (UIUC): 10 million MARCXML records are provided through Myung-Ja Han, Metadata Librarian and Associate Professor by the link, https://uofi.box.com/s/77xpmaavo16xopqswqvj. It was generated in 2010 with 89 xml files.
- The UIUC records were used to investigate MARC tag usage and to build MARC to CT crosswalk. They were used to conduct the mapping experiment with the conversion, UIUC (MARCXML) to CT.
Designed Crosswalks
First of all, the development of MARC to CT crosswalk is fundamental for the mapping experiment of MARC to CT. The development bases on MARC, MODS, DC and QDC to CT 1.1 crosswalk. The crosswalk does not include detail mappings including indexes and subcodes. However, for the actual mappings with the conversion program, the detail mappings with indexes and subcodes should be specified clearly. MARC to CT crosswalk in csv form clarifies the mapping with both MARC tag and indexes and subcodes.
- UIUC (MARCXML) to CT Crosswalk
Download (MARCtoCTcrosswalk.pdf, PDF, 39KB)
Methodology
The designed conversion Python program is not only converting UIUC MARCXML to CT, but also measuring transfer rate, lexical match rate, and semantic match rates together. The reported rates measure the percentage of fields over every metadata statement that are mapped and not mapped to CT in the input records.
First, CT namespaces are defined like the below to validate in XML:
<CT xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”‘ xsi:schemaLocation=”http://www.ct.iopdl.org/1.1/ http://www.ct.iopdl.org/1.1/ct1-1.xsd” xmlns=”http://www.ct.iopdl.org/1.1/”>
Since settling a data structure to describe all MARC tags including various indexes and subcodes was not simple, a special data structure for MARC tag, indexes, and subfield is designed by encoding and decoding methods.
Encoding MARC tags is done by adding MARC field to ‘I’ for expressing index 1, ‘J’ for index 2, and ‘S’ for subcodes. For example, 264I1J0Sc means MARC tag number is 264; index 1 is 1; index 2 is 0; and subcode is c. To decode it, Queue data structure is used. The class for defining MARC tag with indexes and subcode is the below with Python language.
Download (marcTag.pdf, PDF, 8KB)
The conversion program has mainly three parts to convert MARCXML to CT: leader; control field (e.g., 001, 005, 008); and data field parts for other tags, because UIUC records consist of these three parts. Leader and control field parts are converted into CT and their some information is also interpreted into several CT to preserve much information. For example, the content of leader is ‘01026cam a2200301Ka 4500’ in the below UIUC record. The ‘leader’ information is transferred into ‘description type=”recordinfo”’ of CT. And its content according to 6th and 7th column is interpreted with typeGenre with authority, Library of Congress MARC type, and with description type=”issuance.”
<description type=”recordinfo”>01026cam a2200301Ka 4500</description>
<typeGenre authority=”LCMARCtype”>text</typeGenre>
<typeGenre authority=”LCMARCtype”>Monograph/Item</typeGenre>
<description type=”issuance”>multipart monograph</description>
To measure transfer, non-transfer, degree of tag match rates, and semantic match rate, several methods are used. To measure transfer and non-transfer rates, total transferred tags with different subcodes and indexes are counted. For example, in the below MARCXML record, leader, 001 tag, 005, 008, 035 subcode a, and 035 subcode 9 are checked whether each of them is transferred (counted as transferred) or not (counted as non-transferred) into the developed CT. Since matching with all information with MARC tag, indexes and subcode is very rarely happen, to measure the tag match rate, three methods are used:
- Perfect match rate matched by all of MARC tag, indexes and subcode:
- The perfect match means the encoded tag with all information of MARC tag, indexes and subcode (e.g., 086I0Sa) is mapped into CT according to MARCtoCTcrosswalk, For example, 043 field, geographic area code, with subcode ‘a’ (043Sa) is perfectly matched with ct:subject type=”spatial” based on MARC to QDC crosswalk. 260Sa, 260Sb, 260Sc, 533Sb, 533Sd, and 533Se encoded tags show perfect match.
- Subcode match rate matched by tag, and any subcodes or subcode ‘a’ as a default:
- If the encoded tag with all information is not matched into CT, only MARC tag and subcode information except indexes is retrieved from the encoded tag. And subcode match is investigated. There are two kinds of subcode match: MARC tag and subcode match except indexes; and MARC tag and subcode ‘a’ match as a default except indexes. As a first subcode match example, 245 field with subcode ‘b,’ 245Sb, is matched with ct:title type=”subtitle.” 245Sb, 245Sc, 650Sz are the first subcode match tags. The second subcode match examples with subcode default ‘a’ are 020, 035, 040, 100, 245, 300, 440, 490, 504, 650, 830, 852, etc.
- General match rate matched by MARC tag number only:
- If the encoded tag is not perfect matched nor subcode matched, the general match is considered. The general match means that only MARC tag number is matched. As an example of general match, 300 field is matched with ct:format type=”extent,” regardless indexes and subcode information. 035, 040, 082, 084, 111,300, 490, 700, 776, 8, 99430, 852, etc. are general matched tags.
To measure semantic match rate, semantic relations are investigated for each MARCXML to CT mapping based on SKOS mapping concept schemes (e.g., skos:exactMatch, skos:closeMatch, skos:broadMatch, skos:narrowMatch, and skos:relationMatch). The investigated semantic relation is added in the MARC to CT crosswalk.
- Exactly matched mapping example: MARC 041 field, language code, is mapped into ct:language with different authorities according to subcodes (e.g., iso 639-2, rfc1766, rfc3066, rfc 4646, and iso 639-3). It is semantically exactly matched mapping.
- Another exactMatch example: MARC 044 field, country of publishing/producing entity code, is mapped into ct:publisher; 044 with subcode ‘a’, into publisher type=”place”; 044 with subcode ‘c’ into publisher type=”place” authority=”iso3166.” Since the mapping with subcode is much exactly mapped with sub-property type=”place” that specifies the published place, the mapping is considered as exactMatch.
- broadMatch exaple: on the other hand, 044 without subcode is generally mapped into ct:publisher. Thus, this mapping is considered as broadMatch. The below table is a part of MARC to CT crosswalk with SKOS semantic relations.
- More details for methodology and results of MARCXML to CT mapping experiment are on http://www.ct.iopdl.org/1.1/ReportMARCXMLtoCTconversionexperiment.pdf.
Conversion Programs
Results
An example of the converted CT from MARCXML of UIUC records
Original UIUC MARCXML record example
UIUC MARCXML records to CT (program result) Transfer and Match Rates Result
As a result along with the converted CT from MARCXML of UIUC records, performance of CT is as follows:
- Total transfer rate from MARCXML of UIUC to CT is 95.2709%.
- Non-transfer rate, loss of information rate from MARC records to CT is 4.729%.
4.729% loss of information rate is very low rate, considering that CT has only 12 common terms (less than Dublin Core) and 58 qualifiers (many fewer than MARC tags).
- Semantic match rate by SKOS concept is 100%
- exactMatch rate: 55.347% and
- broadMatch: 44.6527%.
- Tag Match Rate out of Total transfer rate, 95.2709%
- The perfect match rate matched by all of MARC tag, indexes and subcode is 16.1834%;
- subcode match rate by tag and subcodes is 43.068%; and
- general match rate for MARC tag number only is 40.748%.
- Non-transferred MARC tags
- The dictionary for non-transferred tags shows which tags have no mapping in the MARCtoCTcrosswalk. The most often used but not transferred tag is 049 and it is for local holdings for OCLC. 092 tag is for local Dewey call number of OCLC. But, these non-transferred tags do not critically affect loss of information rate.
- Total loss of information rate, non-transfer rate, is only 4.729% for 400.000 MARCXML records of UIUC.
- The noMatchtag dictionary shows the sorted non-transferred tags with frequency of use.Sorted noMatchtagDic: OrderedDict([(‘696’, 1), (‘315’, 1), (‘793’, 1), (‘526’, 1), (‘361’, 1), (‘512’, 1), (‘514’, 1), (‘565’, 1), (‘956’, 1), (‘946’, 1), (‘098’, 2), (‘695’, 2), (‘939’, 2), (‘935’, 2), (‘996’, 2), (‘790’, 2), (‘917’, 2), (‘556’, 2), (‘952’, 2), (‘697’, 3), (‘544’, 3), (‘592’, 3), (‘263’, 3), (‘791’, 4), (‘302’, 4), (‘656’, 4), (‘866’, 4), (‘380’, 5), (‘261’, 5), (‘913’, 7), (‘960’, 7), (‘598’, 8), (‘096’, 9), (‘524’, 9), (‘346’, 9), (‘945’, 10), (‘599’, 11), (‘853’, 13), (‘547’, 14), (‘902’, 16), (‘257’, 17), (‘581’, 18), (‘069’, 20), (‘585’, 21), (‘596’, 22), (‘863’, 27), (‘545’, 29), (‘017’, 31), (‘400’, 32), (‘027’, 33), (‘765’, 36), (‘886’, 41), (‘351’, 48), (‘692’, 49), (‘055’, 51), (‘767’, 52), (‘344’, 54), (‘753’, 60), (‘270’, 71), (‘563’, 72), (‘911’, 85), (‘256’, 89), (‘777’, 116), (‘212’, 132), (‘350’, 132), (‘030’, 140), (‘586’, 150), (‘799’, 206), (‘070’, 214), (‘410’, 260), (‘570’, 265), (‘088’, 354), (‘525’, 370), (‘006’, 392), (‘301’, 404), (‘693’, 523), (‘074’, 538), (‘963’, 552), (‘025’, 583), (‘699’, 683), (‘012’, 850), (‘254’, 853), (‘949’, 886), (‘503’, 963), (‘691’, 1346), (‘938’, 1574), (‘840’, 1832), (‘099’, 2162), (‘305’, 2754), (‘539’, 2795), (‘936’, 2824), (‘247’, 3041), (‘011’, 3098), (‘508’, 3192), (‘588’, 3265), (‘891’, 3552), (‘262’, 4678), (‘590’, 4895), (‘850’, 5692), (‘910’, 7005), (‘306’, 8154), (‘066’, 8421), (‘037’, 10974), (‘265’, 11254), (‘048’, 16695), (‘987’, 23711), (‘090’, 39528), (‘690’, 85477), (‘092’, 128942), (‘880’, 142277), (‘049’, 204944)])
Conclusion
To examine the developed Common Terminology, the mapping experiments have been done with MIT(QDC) records and UIUC(MARCXML) records. The paper reports 400,000 refined MARCXML records of UIUC to CT mapping experiment. Considering complex MARC records’ information with indexes and subcodes, encoding and decoding methods is used. Several mapping methods are used. Surprisingly, the result of the MARCXML to CT conversion proves amazing performance of CT, although CT has only 12 common terms that are less than 15 core elements of DC, and 58 qualifiers that are many fewer than MARC tags and subcodes. Total transfer rate is 95.27%; Non-transfer rate, loss of information rate is only 4.729%. Semantic match rate by SKOS concept is 100%.
In light of these results and the result of MIT(QDC) to CT mapping experiment (transfer 99.9%, lexical 98.7%, semantic match rate 100%, and loss of information rate 0.00463%), we conclude that CT shows higher performance in achieving and improving metadata interoperability, minimizing loss of information and preserving the specificity and precision of the source metadata records.
Showing very high performance and very low loss of information rate is founded on which we developed CT based on MARC tag usage in Harvard, UIUC, and WorldCat records and in search interfaces. Over 50% used tags are considered as common terms or qualifiers. Also, the Common Terminology concept, a set of common terms of commonly used standards as a bridge terminology, is very effective to achieve interoperability among different standards (even very different degree of specificity and generality such as MARC, MODS and DC & QDC). Through the experiments, it is proved that finding and reusing commonly used terms among existing standards is a very crucial way to build interoperability, instead of creating new schema.