Data Gathering, Cleanup and Analysis

It is said enough times that data is to 21st century what oil was to 20th century. But as the crude oil needs to be refined, similarly data needs to be refined before it can be made useful. There are various steps involved starting from raw data to reach to a useable insight.

One often starts with a form of data which can be structured, semi-structured or unstructured. A structured data, as the name suggest is following some kind of structure, and easy to make sense of. This data might be in databases in form of tables or structured data files like excel sheets. A semi structured data also follows some structure but it is not as clean, for example a JSON or XML formatted data. All remaining data is unstructured, i.e. log files, video/ audio streams, text files etc.

When we get the data, first step is to filter out useful data. Next step is to clean up data for example removing null values by meaningful data (use mean, median or mode), cleaning up the text data, finding keywords and tags etc. Once we have all the relevant data cleaned up (tidy data along with meta data), we need to find a good sample data set for our study.

A good example of sampling is exit polls we see generally during election times. When predicting election results, a news agency cannot check with all the voters, so they have to get opinion from a sample set. It is important to choose sample data set correctly to get accurate results, for example if we have 30 percent high income groups voter, 40% middle income and 30% low income group in a constituency, whereas the news agency sampled 70% of high income group voters, they will definitely get incorrect results.

Another important aspect is data visualisation. You need to find correct form of data visualisation to make sure data reaches out and makes sense to all stake holders. Histogram, Scatter plots, Box plots, Strip charts etc are used for visualising the data.

We do have a lot of tools to help us in whole data gathering, cleanup , analysis and visualisation process. For example we can use Kafka to get the data to system.
Hadoop Map reduce / Spark handle huge amount of data and apply relevant cleaning and organising algorithms. Tools like Pig are used to clean the data and Hive to store the data. Finally
Mahut/ R/ Python reads the data and find the results to be used by stakeholders.