Learnings from the 2016 Data Science Jam with Mozilla Science Lab

By Editor
Data Science Lab
  Published 10 Aug 2016
Share this Article

Words by Anthony Ndung'u

As the 2016 Data Science Jam comes to an end, the Data Science Lab was proud to host various members from the Mozilla Science Lab. It is an initiative from the Mozilla Foundation made up of researchers, developers and librarians making research open and accessible. One of the ways they hope to achieve this is by advocating for open data. It is based on the idea that some data should be freely available to everyone to use and republish, as they wish, without restrictions from copyright.




Among those giving their presentations was Joey Lee. He works with data and technology to make maps and other visual output about spaces and places. He presented various geography and design tools for mapping. One problem that mappers face is getting the right tools that would fit the data well and that would communicate what one is trying to say. Maps both shape and are shaped by the tools (and people) that make them. This can be if people are making the maps for fun or science. The process of collecting data, analyzing and visualizing it through maps has changed over the past few years. For instance, drawing a circle on the screen would require a few lines of code but now it can be achieved using drag and drop. There are a number of graphical user interfaces for doing that. Some of them include QGIS, an open source GIS tool which has a lot of functions that can be applied on datasets. Another tool is GDAL which performs geoprocessing using the command line. It can be used to process large satellite imagery or make polygon imagery for your maps. Another tool is BROC-CLI-GEO, an online, command line mapping tool which can be run in the browser. The R studio also offers scripting and visualization packages such as GIStools, maptools, rgdal, rgeos and ggplot2. Some of the map interactions Joey presented showed topographical maps of a city and trips occurring in flow patterns. Another map showed the different countries and how much money each owed the United States in the form of parking tickets. A similar one showed the US agricultural exports by state.


Another Mozilla Science Lab fellow was Richard Smith, a computational plant biologist from Cambridge University. His topic was massive-scale reuse of scientific outputs. He highlighted the number of scientific publication bodies that control the market. He argued that if you are an aspiring scientist coming into a specific field there is no possible way of reading through everything that has been published. There are literally thousands of publications that one cannot parse through. This makes it hard to find just the right source of material that will comprehensively cover a given research topic. He assumed the role of someone studying yellow fever. Using an API, he was able to generate a query which brought back results from which he selected a few to be downloaded as PDFs. This was, as aforementioned, because there were too many for someone to read; even through several months. A visual plot of that data was generated showing the top twenty genes associated with yellow fever. The top five genes showed a higher mention rate in all of the books combined. Therefore, as someone studying yellow fever, it would make more sense to focus on the top genes instead of having to go through millions of papers. The applications of such a method of research can not only be applied on yellow fever, but also on a variety of topics. This would help reduce time spent and overdependency on the big scientific publication bodies that control the publishing sector.


Christie Bhlai, a quantitative ecologist and researcher made her topic Data Science in Ecology. She said that people spend time and effort collecting data for a long time, and these ideas are important for developing the world, but once a student leaves and graduates, it's essentially not used anymore. These ideas should not be discarded as they can be useful in improving the world in some fields such as agriculture. Christie also added that teaching in schools is done in a 'silo' environment because there is little interaction with the outside world. There's the professor co-ordinating the undergraduates and most of the work is done within the lab. Another problem ecology researchers face is attributed to the fact that they conform to methods that were used by the pioneers, which sometimes do not apply in a modern-day setting. For instance, a student is needed to use proprietary software and books that are not easily accessible. Because of this, any breakthrough discovery by an individual is seen as heroic. In order to improve things, Christie advocated for re-evaluating how we train students through project-based learning. The idea is to take real data and take students through realistic approaches using data science on a global scale. Documentation, statistics, basic data processing and plotting foster the application of skills by seeing that what they are doing can be applied to the real world.


Jason Bobe, a biotechnologist from the ICAHN Institute talked about research as it pertains to health and finding cures to diseases. He argued that when looking for these cures, medical practitioners focus on the ill. They spend so much time and effort on them collecting samples and analyzing the underlying cause, which works, sometimes. However, he offered a different solution that includes paying as much attention to the healthy ones and even more to those that are resistant to such diseases. One example was a story about a man named Stephene Crohn from Manhattan, whose genetic mutation was resistant to the HIV/AIDS virus. He came forward years after he figured that he'd been multiply exposed yet no signs of it showed. That fascinated doctors who decided to conduct tests on his blood and after a series of them, it was deduced that his cells had a mutation that impeded the AIDS virus from taking over. As a result, doctors were able to come up with a preventive medicine that triggered similar characteristics in other people's cells in preventing the spread of the virus.

Another similar case was about a man resistant to Alzheimer's. It was common in his family upon reaching a certain age and it came as a surprise to him when he didn't show symptoms. Upon medical check-ups, it was found that he was indeed resistant to Alzheimer's. Jason concluded that there is much that can be learnt from the resistant individuals in tackling ailments that affect our health. He pointed out that there are bodies such as openhumans.org where citizens can contribute and gather valuable data about themselves that can be used in medical research. By signing up with them, we'll be playing an active role in eradicating diseases that would otherwise endanger the lives of many.


We'd like to thank all that came to the Data Science Jam and the Mozilla Science Lab for their incredible contributions to our learning. It was a pleasure to have hosted such brilliant minds. Look out for next year's edition and shoot any inquiries to [email protected]

comments powered by Disqus