As different library departments tend to choose specialized software and platforms, information is separated. Instead of locating information at one stop, users are forced to visit multiple websites to look for library resources due to disparate systems. The electronic theses and dissertations (ETDs), an important part of the online scholarship for higher education, usually reside in institutional repositories (IR) and cannot be accessed via a library's discovery portal unless configuring special settings.
To achieve a unified platform for ETDs and other library collections, metadata librarians face several challenges. While Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) provides a low-barrier metadata harvesting method, some repositories do not support specified collection harvesting based on collection name. For academic libraries with large digital resources, it is challenging to customize the settings to collect only ETD metadata to meet their needs. In addition, since the discovery system and IR may use different metadata schemas, librarians may have metadata crosswalk challenges for ETD collection, including loss of metadata granularity and inconsistency of metadata between print copies of theses and dissertations and ETDs.
SDSU has identified a method to harvest only ETD collection and configured a PRIMO external resources import profile for ETD XML record import. We chose XML rather than DC because XML have more granular fields.
This guide includes the following sub-sections:
From 9/30/2020 to 9/23/2021, the Library received 68 patrons’ requests about SDSU Thesis and Dissertation Collection on LibChat.
Request |
Count |
Find (a) specific thesis/theses/dissertation(s) |
35 |
Ask the publication timeline of thesis and dissertation |
10 |
Ask where or how to find a thesis/dissertation |
9 |
Acquire a digital/print copy of a thesis/dissertation |
10 |
Remove a thesis/dissertation from the Library website |
2 |
Others |
2 |
Table 1. Reasons of Requesting ETD Collection Materials
Request made by the author |
49 |
Request not made by the author |
19 |
Table 2. Patron’s Relation with the Author
Metadata elements that patrons used to search |
Count |
Note |
Title |
8 |
|
Author name |
17 |
|
Year |
11 |
|
Department name |
12 |
Geology; ART; music; Physics; History; Philosophy; Political Science; Chemistry; Biology |
Program name |
5 |
MPH (2); MBA (1); MPA (1); ECE (1) |
Course name |
1 |
BA765 |
Degree level |
8 |
Doctoral; master; graduate |
Topic |
5 |
Table 3. Counts and Details of Metadata Elements that Patrons Use to Search
Based on the above data, SDSU decided to add department name, advisor, and program information into the ETD metadata
Piping ETD records from Islandora to Primo need to be performed monthly. It does not require much manual work. All Python scripts that were developed for this project can be found in the GitHub Repository
Before running any scripts, one needs to maker sure there are four folders in the same directory as the python scripts: idfiles, single_xml, merged_pre_upload, and final_output.
Two scripts are included in this folder.
HarvestFromIslandora.py: This file check for new ETD records in Islandora, download new records, and create one merged XML record in the merged_pre_upload folder. To run this script, run the following line in the command "python (path of this script) last_page_number_of_ETD_collection_in_Islandora full_path_of_the_folder_that_holds_the_four_folders_above date"
example: python "F:/123/456/789/HarvestFromIslandora.py" "535" "F:/123/456/789/" "01032023"
ChangeURI.py: After manual validating the merged XML file, update the Identifier[@type=url] element for each record, which will be the access method for users. To run this script, run the following line in the command "python (path of this script) full_path_of_the_merged_xml_in_the_merged_pre_upload_folder full_path_of_the_new_XML_(should be in the final_output folder)"
example: python "F:/123/456/789/ChangeURI.py" "F:/123/456/789/merged_pre_upload/output01032023.xml" "F:/123/456/789/final_output/final01042023.xml"
Manual validation needed:
After running HarvestFromIslandora.py, one needs to manually check the script generated file to remove/replace any incorrect elements. Usually, one may encounter the following errors:
An example of final output file can be viewed via this link.
If uploaded records have new changes, one may need to harvest those record using ID. To harvest a single record:
To harvest a list of record, create a TXT file, one row per ID like this file. Run the HarvestSelectedRecords.py script in this Github Repository by putting the following line in the command "python (path of this script) full_path_of_the_TXT_file_for_IDs full_path_of_the_new_XML_(should be in the final_output folder) date. e "F:/xxx/xxx/xxx/final_output/final20221128.xml 20230105"
For example: python [path to HarvestSelectedRecords.py] "F:/xxx/xxx/xxx/id.txt" "F:/xxx/xxx/xxx" "20230105"
Two local fields are created: department name and program information.
To create new local field,
If you want to add an indexed field:
After creating new local fields, don't forget to add the fields to the test or the discovery view in production. To do that:
Norm rules to change XML fields to DC fields and local fields. An example of discovery norm rule can be viewed in this Google Doc. After creating new norm rules for discovery, one needs to create a new process task for this norm rule.
Before create a new import profile for Primo VE, please follow the instruction in the link to create new search profiles.
After a new search profile created, create a new import profile.
After creating the import profile, one can run a job to import new records by clicking the ellipsis and selecting run. One can also reload the records. Reload will re-run all the jobs in history. Be cautious when using the reload option. To delete records, please see this guide.