Login Sign up

Code as Data – Notebooks for Software Analytics and Repository Mining

Andreas Schreiber, Lynn von Kurnatowski

Audience level:

Brief Summary

Analyzing data of existing software, from source code, the development process, or humans is called "Software Analytics". You will learn how to use Jupyter notebooks with Python to get data from software repositorys such as GitHub, store it in graph databases like Neo4j, and visualize it with libraries such as Bokeh or plot.ly.


Software Analytics is about analyzing data related to existing software artifacts, such as source code, development processes, or human-centered data from developers and users. The goal of Software Analytics is to get insights into development and status of software systems. These insights can be used to support decisions, for example, how to improve the development process. You will learn how to use Python for:

  1. Getting data from software repositorys with repository mining, from development processes with recording development provenance, and from developers or users with eye-tracking data.

  2. Storing and analyzing the software data with graph databases (mainly Neo4j, but support for AWS Neptune is in progress),

  3. Extracting knowledge from the software data using graph queries to get insights into the software and the development process.

  4. Visualizing the software data with in 2D graph drawing and the statistical data with Python plotting libraries.

Used Pythonic libraries and tools—besides Jupyter—include pandas, PyGitHub, py2neo and more.

To scale up, we show support to use Cloud resources (up to now, we support AWS) for the repository mining and the graph database.