Login Sign up

Monday Oct. 12, 2020, 5:45 p.m.–Oct. 12, 2020, 6 p.m. in Jupyter Community: Tools

Using Qri (“query”) to fetch, query, combine and publish datasets.

Brendan O'Brien

Audience level:
Novice

Brief Summary

Qri is an open-source tool for sharing version-controlled datasets, built on a decentralized network. Using qri-python we can use qri directly from a jupyter notebook, accessing countless community-published datasets for free exploration. In this talk we'll walk through loading version-controlled datasets into a dataframe, running an SQL join, & finally cleaning & publish a dataset of our own.

Outline

Objective: To show attendees the power of applying principles of open-source to common data resources.

Outline: In this talk we’ll first show examples of pulling community-created datasets into qri. We’ll start by browsing http://qri.cloud, pull down a relevant dataset to show single-command access to all qri data without needing to leave Jupyter. We’ll point users to the issue queue for feedback & questions, where they can build an understanding of a dataset.

From there we’ll demonstrate publishing a dataset that others can use. We’ll walk through an example that executes an SQL query to joins two qri datasets, perform additional cleanup & annotation, publish a version, and view it on qri cloud.

Finally, we’ll talk about some of the challenges of creating a robust data commons, and how qri uses decentralization to solve difficult problems of cost, data availability, and synchronization.

By the end we hope attendees will buy into a vision of a world where open data has the same support and tooling as open-source software.

Relevant source code: https://github.com/qri-io

Video: https://www.youtube.com/watch?v=P2qeY2nPK3Q