Data from: Exploration and Explanation in Computational Notebooks
Sample notebook data
File Size |
|
File Format |
|
Scope And Content | A smaller, starter dataset with 1000 randomly selected repositories containing ~6000 notebooks |
Notebook files - part 1
File Size |
|
File Format |
|
Scope And Content | Each .ipynb file was downloaded directly from GitHub using the url listed in 'notebooks.csv'. The files are named based on a 'nb_id' we assigned to each notebook in 'notebooks.csv'. |
Notebook files - part 2
File Size |
|
File Format |
|
Scope And Content | Each .ipynb file was downloaded directly from GitHub using the url listed in 'notebooks.csv'. The files are named based on a 'nb_id' we assigned to each notebook in 'notebooks.csv'. |
Notebook files - part 3
File Size |
|
File Format |
|
Scope And Content | Each .ipynb file was downloaded directly from GitHub using the url listed in 'notebooks.csv'. The files are named based on a 'nb_id' we assigned to each notebook in 'notebooks.csv'. |
Notebook files - part 4
File Size |
|
File Format |
|
Scope And Content | Each .ipynb file was downloaded directly from GitHub using the url listed in 'notebooks.csv'. The files are named based on a 'nb_id' we assigned to each notebook in 'notebooks.csv'. |
Notebook files - part 5
File Size |
|
File Format |
|
Scope And Content | Each .ipynb file was downloaded directly from GitHub using the url listed in 'notebooks.csv'. The files are named based on a 'nb_id' we assigned to each notebook in 'notebooks.csv'. |
Notebook files - part 6
File Size |
|
File Format |
|
Scope And Content | Each .ipynb file was downloaded directly from GitHub using the url listed in 'notebooks.csv'. The files are named based on a 'nb_id' we assigned to each notebook in 'notebooks.csv'. |
Notebook metadata
File Size |
|
File Format |
|
Scope And Content | Each JSON file is the result of a query to GitHub asking for all the Jupyter Notebooks within a certain range of byte sizes (e.g. 0 to 10 bytes). We had to subdivide this query by byte size because GitHub's API will only return up to 1000 results to a query at a time. GitHub further paginates results into pages with up to 100 results each. For example, a query returning 977 results would be broken up into ten different pages, the first nine with 100 results each and the last with 77 results. Each file name includes the byte size range and page number of the results it contains. Most of the relevant data from these files are summarized in the 'notebooks.csv' file. |
Repository metadata
File Size |
|
File Format |
|
Scope And Content | Each JSON file is GitHub's response to a query for metadata relating to a repository that included a notebook in our dataset. Each file is named using a repository id that Github assigns. This data is summarized in `repositories.csv`, however, the raw JSON files include additional information such as how many times the repository had been forked, stared, or watched. |
Repository READMEs
File Size |
|
File Format |
|
Scope And Content | Each JSON file is GitHub's response to a query for the top level README file associated with a repository containing a Jupyter Notebook. Some of these JSON files may be empty if there was not a README file in the repository's top directory. Each JSON file is named using the unique id that GitHub assigns each repository. This README data, including the content of each README file, is summarized in 'readmes.csv'. Note that the README content is Base64 encoded in the CSV and JSON files. |
Summary CSV data
File Size |
|
File Format |
|
Scope And Content | CSV files summarizing and indexing the notebooks, repositories, and READMEs |
Log files
File Size |
|
File Format |
|
Scope And Content | Log files documenting when each file was downloaded |
Analysis scripts
File Size |
|
File Format |
|
Scope And Content | Scripts for our initial analysis of the dataset |
- Collection
- Cite This Work
-
Rule, Adam; Tabard, Aurélien; Hollan, James D. (2018). Data from: Exploration and Explanation in Computational Notebooks. UC San Diego Library Digital Collections. https://doi.org/10.6075/J0JW8C39
- Description
-
In July 2017, our team queried, downloaded, and analyzed approximately 1.25 million Jupyter Notebooks in public repositories on GitHub. By our calculation this was about 95% of all Jupyter Notebooks publicly available on GitHub at the time. This dataset includes:
~1.25 million Jupyter Notebooks
Metadata about each notebook
Metadata about each of the nearly 200,000 public repositories that contained a Jupyter Notebook
Top level README files for nearly 150,000 repositories containing a Jupyter Notebook
In addition to this core data, these data include:
A smaller, starter dataset with 1000 randomly selected repositories containing ~6000 notebooks
CSV files summarizing and indexing the notebooks, repositories, and READMEs
Log files documenting when each file was downloaded
Scripts for our initial analysis of the dataset - Date Collected
- July 2017
- Date Issued
- 2018
- Creators
- Technical Details
-
The following software was used in the querying and analysis of this data:
python==3.6.1
jupyter notebook==5.0.0
matplotlib==2.0.2
numpy==1.12.1
pandas==0.20.1
re==2.2.1
requests==2.14.2
scipy==0.19.0
seaborn==0.7.1 - Funding
-
This research was funded by NSF grants #1319829 and #1735234 as well as NLM grant #T15LM011271.
- Topics
Format
View formats within this collection
- Language
- English
- Identifier
- Related Resources
- Rule A, Tabard A, and Hollan J. (2018). Exploration and Explanation in Computational Notebooks. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’18). ACM Press, New York, NY. https://doi.org/10.1145/3173574.3173606
- Analysis scripts on GitHub: https://github.com/activityhistory/jupyter_on_github
Primary associated publication
Reference
- License
-
Creative Commons Attribution 4.0 International Public License
- Rights Holder
- UC Regents
- Copyright
-
Under copyright (US)
Use: This work is available from the UC San Diego Library. This digital copy of the work is intended to support research, teaching, and private study.
Constraint(s) on Use: This work is protected by the U.S. Copyright Law (Title 17, U.S.C.). Use of this work beyond that allowed by "fair use" or any license applied to this work requires written permission of the copyright holder(s). Responsibility for obtaining permissions and any use and distribution of this work rests exclusively with the user and not the UC San Diego Library. Inquiries can be made to the UC San Diego Library program having custody of the work.
- Digital Object Made Available By
-
Research Data Curation Program, UC San Diego, La Jolla, 92093-0175 (https://lib.ucsd.edu/rdcp)
- Last Modified
2023-05-22