Data from: Toward Enhanced Reusability: A Comparative Analysis of Metadata for Machine Learning Objects and Their Characteristics in Generalist and Specialist Repositories
Supplemental Table A - Characteristics of the UC San Diego Library Digital Collections
File Size |
|
File Format |
|
Description | Table based on the Generalist Repository Comparison Chart produced by Stall et al. 2020 (https://doi.org/10.5281/zenodo.3946720). |
Supplemental Table B - Metadata Crosswalk
File Size |
|
File Format |
|
Description | Equivalent properties across the eight repositories mapped onto a common property name. |
Supplemental Table C - Number of machine learning objects published by year and repository, 2000-2021 (Figure 1 data)
File Size |
|
File Format |
|
Description | Data used to generate Figure 1, machine learning objects by year published and repository, 2000-2021. |
Supplemental Table D - File Format Categories
File Size |
|
File Format |
|
Description | File extensions extracted from file names in metadata records grouped into format categories representing broader groups. |
Supplemental Table E - Percentage of objects containing file formats (Figure 2 data)
File Size |
|
File Format |
|
Description | Data used to generate Figure 2, percentage of objects containing file format category in repositories. |
Supplemental Table F - License Normalization
File Size |
|
File Format |
|
Description | Licenses categorized broadly into one of fourteen types according to rights reserved, from least restrictive (“No rights reserved,” “Attribution,” “Attribution-NoDerivs,” etc.) to most restrictive (“All rights reserved"), as described in the SPDX License List (https://spdx.org/licenses/), published by the Linux Foundation. |
Code for Metadata Extracts and Data Analysis (GitHub Repository)
File Size |
|
File Format |
|
Description | Code used to extract metadata via API or web scraping from repositories. [Note that APIs and UIs change regularly - consult the website documentation before reusing this code.] Jupyter Notebooks and R scripts to process raw JSON metadata extracts and calculate summary statistics. Includes code to create Supplemental Tables C, D, and E, as well as code to create Figure 1, Figure 2, and Table 3 in Labou et al. 2024. |
Repository Metadata Extracts
File Size |
|
File Format |
|
Description | Full metadata extracts from seven repositories, in JSON format, and the UC San Diego Library Digital Collections extract in .xlsx format. |
Resource Types for Figshare and Zenodo Objects
File Size |
|
File Format |
|
Description | Full list of resource types for Figshare and Zenodo objects. Labou et al. 2024 limited analyses to the subset of Figshare objects tagged as “dataset”, “software”, or “model”, and to the subset of Zenodo objects classified as “Dataset” or “Software”. |
- Collection
- Cite This Work
-
Labou, Stephanie; Pennington, Abigail; Yoo, Ho Jung S.; Baluja, Michael (2024). Data from: Toward Enhanced Reusability: A Comparative Analysis of Metadata for Machine Learning Objects and Their Characteristics in Generalist and Specialist Repositories. UC San Diego Library Digital Collections. https://doi.org/10.6075/J0JS9QMH
- Description
-
This dataset contains data reported in the paper, Labou et al. 2024, which aims to understand how researchers are currently documenting ML research outputs for sharing, and the extent to which repository metadata fields enable reuse of ML objects. Contents of the dataset include: Supplemental Tables referenced in the paper, a snapshot of the code used to query or web scrape data repositories for ML objects, metadata extracts from the repositories, and a snapshot of the code used to analyze the extracts.
- Creation Date
- 2021 to 2023
- Date Issued
- 2024
- Authors
- Programmer
- Technical Details
-
Code used to access all APIs was developed using Python 3.10.0, with tested compatibility back to Python 3.9.2. This implementation relies heavily on packages such as "requests" for the public APIs, and "selenium" and "beautifulsoup4" for web scraping when necessary. See requirements.txt files in Code for Metadata Extracts and Data Analysis for full details.
- Funding
-
Librarians Association of the University of California (LAUC) 2020-2021; Research Data Curation Program, UC San Diego Library.
- Topics
Formats
View formats within this collection
- Language
- English
- Identifier
-
Identifier: Abigail Pennington: https://orcid.org/0000-0002-9364-1995
Identifier: Ho Jung S. Yoo: https://orcid.org/0000-0001-9677-0947
Identifier: Stephanie Labou: https://orcid.org/0000-0001-5633-5983
- Related Resources
- Labou, Stephanie G., Abigail Pennington, Ho Jung S. Yoo, and Michael Baluja. 2024. "Toward Enhanced Reusability: A Comparative Analysis of Metadata for Machine Learning Objects and Their Characteristics in Generalist and Specialist Repositories." Journal of eScience Librarianship 13 (2): e685. https://doi.org/10.7191/jeslib.685
- Dryad (datadryad.org/stash)
- Figshare (figshare.com)
- Harvard Dataverse (dataverse.harvard.edu)
- Kaggle (kaggle.com/datasets)
- OpenML (openml.org)
- SPDX License List (spdx.org/licenses)
- UC San Diego Library Digital Collections (library.ucsd.edu/dc)
- UCI Machine Learning Repository (archive.ics.uci.edu)
- Zenodo (zenodo.org)
- Code for Metadata Extracts and Data Analysis (GitHub Repository): http://github.com/stephlabou/comparative-machine-learning-metadata
- PyCurator GitHub Repository: https://github.com/michaelbaluja/PyCurator
Primary associated publication
Source data
Other version
- License
-
Creative Commons Attribution 4.0 International Public License
- Rights Holder
- UC Regents
- Copyright
-
Under copyright (US)
Use: This work is available from the UC San Diego Library. This digital copy of the work is intended to support research, teaching, and private study.
Constraint(s) on Use: This work is protected by the U.S. Copyright Law (Title 17, U.S.C.). Use of this work beyond that allowed by "fair use" or any license applied to this work requires written permission of the copyright holder(s). Responsibility for obtaining permissions and any use and distribution of this work rests exclusively with the user and not the UC San Diego Library. Inquiries can be made to the UC San Diego Library program having custody of the work.
- Digital Object Made Available By
-
Research Data Curation Program, UC San Diego, La Jolla, 92093-0175 (https://lib.ucsd.edu/rdcp)
- Last Modified
2024-06-28