Qri is an open-source tool for sharing version-controlled datasets, built on a decentralized network. Using qri-python we can use qri directly from a jupyter notebook, accessing countless community-published datasets for free exploration. In this talk we'll walk through loading version-controlled datasets into a dataframe, running an SQL join, & finally cleaning & publish a dataset of our own.
Objective: To show attendees the power of applying principles of open-source to common data resources.
Outline: In this talk we’ll first show examples of pulling community-created datasets into qri. We’ll start by browsing http://qri.cloud, pull down a relevant dataset to show single-command access to all qri data without needing to leave Jupyter. We’ll point users to the issue queue for feedback & questions, where they can build an understanding of a dataset.
From there we’ll demonstrate publishing a dataset that others can use. We’ll walk through an example that executes an SQL query to joins two qri datasets, perform additional cleanup & annotation, publish a version, and view it on qri cloud.
Finally, we’ll talk about some of the challenges of creating a robust data commons, and how qri uses decentralization to solve difficult problems of cost, data availability, and synchronization.
By the end we hope attendees will buy into a vision of a world where open data has the same support and tooling as open-source software.
Relevant source code: https://github.com/qri-io