In the current day and age, people are surrounded by data which explains why the data science field continues to grow.
All our activities generate data, but its true potential is unlocked only when it provides insights into the future. From optimising general processes to solving complex machine learning problems – there are many ways in which data can be leveraged to bring value to organisations. Converting raw data into insights is what a data scientist, like me, does.
Real-life data is often messy. It could be presented in the form of poorly designed sheets with information entered manually. Or it could be huge volumes of machine-generated data with too many data points to make sense for the human eye. It could also be anything in between. The reality is that data comes in varying levels of complexity and interpretability. The first step, therefore, for anyone trying to extract information, is to read, process, and cleanse the available data. Then, and only then, do we come to the process of analysing, visualising, and drawing data-based conclusions or making predictions.
I use various tools for this process – Python, R, and Tableau being the most common. Python for its versatility. R programming language for its capabilities in statistical computing. And Tableau for its interactive dashboards, ease of deployment and use. These are some of my preferred go-to tools for providing customer-specific solutions to data science problems.
As an example, based on my data science experience, reading data is one of the most underrated steps of the process, but in terms of complexity, it can be a nightmare at times. I have worked with pretty looking but extremely disorganised data files whose data was very hard to read into any programming/analysis tool as-it-is. Even if one finally manages to figure out a way to import this data for analysis, it is only a temporary victory because, next week, month or year, one has to read data from a similar file again. It is unproductive to re-iterate all the steps repeatedly to make data usable, every time a similar situation arises. At the very least, entering changes manually might negatively impact the outcomes due to limitations of the human memory. A good data scientist would offer their client a more re-usable and sustainable data format proposition.
Many a time, the people creating data files do not realise that their methods are problematic. Often, the data that looks organised and easily readable to the human eye is the complete opposite to a machine. So, following data science best practices while recording data eases the process of generating valuable insights from it, on a regular basis. Hence, I aim to create client awareness in this very area.
One of my projects required an analysis of job durations and trends. This data was recorded in excel format – see a small snapshot of it, below.
Now, for any visual analysis, an actual number for the job duration is needed. However, extracting that number from this kind of format is very difficult because it is inconsistent for starters. To represent seconds, the user used several formats including “s” , “sec” , “sec.” , “secs” etc., with or without spaces after numbers. The same goes for hours and minutes.
Once I understood how this data gets fetched, having considered the limitations of the team, I recommended a new format. Now, every month, the data is recorded in the same format you see below. It is clear, consistent, and easier to work with.
As an example of the real value my analysis brought to a client, the above-mentioned project, involves generating data visualisations and monthly job statistics reports for multiple teams at the end of every month. Based on the figures and discovered trends, I made suggestions on how this information can be used to bring remarkable job performance improvements. When our client implemented some of these solutions, I show-cased the results, comparing them with the existing process. This built trust and pointed the client in the right direction.
Above, you see a snapshot of one of the jobs that runs on selective days – the change in pattern is clear. While the average duration – before the changes – was around 26 mins, the same – after the recommendations – was around 13 mins (50% less). When multiple jobs are optimised collectively, the overall run-time of the entire job stream gets reduced, thus making the whole process more efficient.
I went further ahead and implemented another solution, automating this entire process of monthly report generation, which is then published for the client’s use. This allows me to focus on providing more result-oriented solutions than on the process itself, thus saving around 48 man-hours.
To sum it up, as a data scientist I get to be a storyteller who paints the whole picture with data. This allows a client to make key business/operational decisions, making me a problem-solver and an integral part of the client’s business.
Founded in 2011, On Device Solutions is a specialist IT consultancy providing enterprise mobility products and services. We are an SAP Gold Partner helping customers get anywhere and anytime access to the information stored in their enterprise systems like SAP, unlocking their information’s value. Click Here to Contact Us Now.