The Role of Data Science in Research
You’ve probably heard of Machine Learning, Cluster Computing, and famous of them all - Big Data Analytics. While some these concepts have been around for years, it’s only recently that they’ve been put together to create the Data Science field - a blessing to the research community. Data science methodologies now feature at the core of research processes. Here are a few reasons why.
1.Breaking Research Boundaries
Long-established research methodologies relied upon data collected through observation or inquiry from study subjects. However, due to the proliferation of new data types, such as sensors data, social media data, and geospatial data, new methodologies are being developed to go beyond previous research domains.
In 2013, iHub Research conducted a study on the viability, validity, and verification (3Vs) of crowd-sourced information during the Kenyan election. A total of 2.6 million tweets were collected with aim of identifying ‘newsworthy’ information. Through use of machine learning algorithms, it was possible to separate the signal from the noise and study aspects of information propagation among Kenyans on Twitter (KoT).
Currently, iHub Research’s Data Science Lab is developing data analysis tools to enable Ma3Route track and study accidents as reported by the crowd. Due to the lack of a centralised accident database, mining for accident information on Twitter provides a new landscape for mapping traffic infractions in the country.
2. Accelerating Scientific Discoveries
At the heart of recent scientific discoveries lie computational techniques and hardware for acceleration of data analysis. High energy physics and gene sequencing rely on supercomputers to discover new particles and create DNA profiles respectively. With advent of big data on social media, more techniques have been developed to enable analysis of large volumes of data.
iHub Research’s Data Lab set-up a High Performance Computing (HPC) cluster to provide analytics capabilities for computationally intensive problems. A recurring procedure in most analytical processes undertaken at the Data Lab involve comparing data-points for similarities. These execution time for these processes grows linearly with the dataset, requiring days to perform analysis in just 1 GB of data. The HPC provides parallel processing which enables sequential tasks such as similarity tests to be executed concurrently. One such problem was encountered when removing similar/duplicate tweets from a large corpus. A process that required 3 days to execute was now done under 10 minutes.
The Data Lab is looking to deploy the HPC in other computational intensive processes such as image analysis and GIS.
3.Behavioural Analysis in Social Sciences
Behavioural science has been a long-standing study on human behavior through natural observation and disciplined scientific inquiry. Social interactions around the web have provided a new landscape to evaluate human behavior. Text analytics techniques such as sentiment analysis, subjectivity analysis and network analysis furnish a wide array of methods for probing human behavior.
iHub Research’s flagship project,‘Umati’, seeks to study the propagation of online dangerous speech. The automation of the analysis enables identification of the influence of a speaker, polarity of speech, and other factors that identify hate speech. These features are built into the Umati Data Logger (currently in alpha) and will be available to the public later this year. (You can check out the codebase here).
We also convene monthly Data Science Meetups and run an annual Data Science Jam. Here’s a list of the services we offer at the Data Lab.
Want to engage us for any of our services and stay up to date on upcoming events? Contact us via data[at]ihub[co]ke.
(Image courtesy of www.dataeconomy.com)