Data from: Exploration and Explanation in Computational Notebooks

Sample notebook data

File Size	1.46 GB
File Format	ZIP Format
Scope And Content	A smaller, starter dataset with 1000 randomly selected repositories containing ~6000 notebooks

Download file View file Download file

Notebook files - part 1

File Size	43.7 GB
File Format	ZIP Format
Scope And Content	Each .ipynb file was downloaded directly from GitHub using the url listed in 'notebooks.csv'. The files are named based on a 'nb_id' we assigned to each notebook in 'notebooks.csv'.

Download file View file Download file

Notebook files - part 2

File Size	42.3 GB
File Format	ZIP Format
Scope And Content	Each .ipynb file was downloaded directly from GitHub using the url listed in 'notebooks.csv'. The files are named based on a 'nb_id' we assigned to each notebook in 'notebooks.csv'.

Download file View file Download file

Notebook files - part 3

File Size	44.9 GB
File Format	ZIP Format
Scope And Content	Each .ipynb file was downloaded directly from GitHub using the url listed in 'notebooks.csv'. The files are named based on a 'nb_id' we assigned to each notebook in 'notebooks.csv'.

Download file View file Download file

Notebook files - part 4

File Size	41.7 GB
File Format	ZIP Format
Scope And Content	Each .ipynb file was downloaded directly from GitHub using the url listed in 'notebooks.csv'. The files are named based on a 'nb_id' we assigned to each notebook in 'notebooks.csv'.

Download file View file Download file

Notebook files - part 5

File Size	43.1 GB
File Format	ZIP Format
Scope And Content	Each .ipynb file was downloaded directly from GitHub using the url listed in 'notebooks.csv'. The files are named based on a 'nb_id' we assigned to each notebook in 'notebooks.csv'.

Download file View file Download file

Notebook files - part 6

File Size	56.6 GB
File Format	ZIP Format
Scope And Content	Each .ipynb file was downloaded directly from GitHub using the url listed in 'notebooks.csv'. The files are named based on a 'nb_id' we assigned to each notebook in 'notebooks.csv'.

Download file View file Download file

Notebook metadata

File Size	492 MB
File Format	ZIP Format
Scope And Content	Each JSON file is the result of a query to GitHub asking for all the Jupyter Notebooks within a certain range of byte sizes (e.g. 0 to 10 bytes). We had to subdivide this query by byte size because GitHub's API will only return up to 1000 results to a query at a time. GitHub further paginates results into pages with up to 100 results each. For example, a query returning 977 results would be broken up into ten different pages, the first nine with 100 results each and the last with 77 results. Each file name includes the byte size range and page number of the results it contains. Most of the relevant data from these files are summarized in the 'notebooks.csv' file.

Download file View file Download file

Repository metadata

File Size	229 MB
File Format	ZIP Format
Scope And Content	Each JSON file is GitHub's response to a query for metadata relating to a repository that included a notebook in our dataset. Each file is named using a repository id that Github assigns. This data is summarized in `repositories.csv`, however, the raw JSON files include additional information such as how many times the repository had been forked, stared, or watched.

Download file View file Download file

Repository READMEs

File Size	209 MB
File Format	ZIP Format
Scope And Content	Each JSON file is GitHub's response to a query for the top level README file associated with a repository containing a Jupyter Notebook. Some of these JSON files may be empty if there was not a README file in the repository's top directory. Each JSON file is named using the unique id that GitHub assigns each repository. This README data, including the content of each README file, is summarized in 'readmes.csv'. Note that the README content is Base64 encoded in the CSV and JSON files.

Download file View file Download file

Summary CSV data

File Size	203 MB
File Format	ZIP Format
Scope And Content	CSV files summarizing and indexing the notebooks, repositories, and READMEs

Download file View file Download file

Log files

File Size	6.7 MB
File Format	ZIP Format
Scope And Content	Log files documenting when each file was downloaded

Download file View file Download file

Analysis scripts

File Size	965 KB
File Format	ZIP Format
Scope And Content	Scripts for our initial analysis of the dataset

Download file View file Download file

Collection

Data from: Exploration and Explanation in Computational Notebooks

Cite This Work

Rule, Adam; Tabard, Aurélien; Hollan, James D. (2018). Data from: Exploration and Explanation in Computational Notebooks. UC San Diego Library Digital Collections. https://doi.org/10.6075/J0JW8C39

Description

In July 2017, our team queried, downloaded, and analyzed approximately 1.25 million Jupyter Notebooks in public repositories on GitHub. By our calculation this was about 95% of all Jupyter Notebooks publicly available on GitHub at the time. This dataset includes:
~1.25 million Jupyter Notebooks
Metadata about each notebook
Metadata about each of the nearly 200,000 public repositories that contained a Jupyter Notebook
Top level README files for nearly 150,000 repositories containing a Jupyter Notebook

In addition to this core data, these data include:
A smaller, starter dataset with 1000 randomly selected repositories containing ~6000 notebooks
CSV files summarizing and indexing the notebooks, repositories, and READMEs
Log files documenting when each file was downloaded
Scripts for our initial analysis of the dataset

Date Collected

July 2017

Date Issued

2018

Creators

Technical Details

The following software was used in the querying and analysis of this data:
python==3.6.1
jupyter notebook==5.0.0
matplotlib==2.0.2
numpy==1.12.1
pandas==0.20.1
re==2.2.1
requests==2.14.2
scipy==0.19.0
seaborn==0.7.1

Funding

This research was funded by NSF grants #1319829 and #1735234 as well as NLM grant #T15LM011271.

Topics

Format View formats within this collection

Language

English

Identifier

Doi: https://doi.org/10.6075/J0JW8C39

Related Resources

Primary associated publication

Rule A, Tabard A, and Hollan J. (2018). Exploration and Explanation in Computational Notebooks. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’18). ACM Press, New York, NY. https://doi.org/10.1145/3173574.3173606

Reference

Analysis scripts on GitHub: https://github.com/activityhistory/jupyter_on_github

License

Creative Commons Attribution 4.0 International Public License

Rights Holder

UC Regents

Copyright

Under copyright (US)

Use: This work is available from the UC San Diego Library. This digital copy of the work is intended to support research, teaching, and private study.

Constraint(s) on Use: This work is protected by the U.S. Copyright Law (Title 17, U.S.C.). Use of this work beyond that allowed by "fair use" or any license applied to this work requires written permission of the copyright holder(s). Responsibility for obtaining permissions and any use and distribution of this work rests exclusively with the user and not the UC San Diego Library. Inquiries can be made to the UC San Diego Library program having custody of the work.

Digital Object Made Available By

Research Data Curation Program, UC San Diego, La Jolla, 92093-0175 (https://lib.ucsd.edu/rdcp)

Last Modified

2023-05-22