Most data scientists and the organizations that employ them don’t seem to understand how data science is actually done, nor what it is exactly. They sort of jumped on the bandwagon — without really understanding it, nor why it was important to them in a very visceral way.
Many organizations approach data science as though it was a marketing tool — relabeling things that they already do as ‘data science’ as it involves the use of data. That is not real data science, and it completely misses the point of engaging in data science. It would be the equivalent of comparing kids playing in their sandboxes with the operations of the oil majors when they are scouting for oil. The core value of data science, which appears to be overlooked, is the word science.
Science is not merely predictive — at its heart, it is explanatory as well as diagnostic. Science leads to engineering — a systematic mathematical approach to creating technology solutions based on the exploitation of some natural phenomenon.
Winning Kaggle competitions is not data science; though, it is a reasonable start, I suppose – even though the best models in Kaggle are actually built by machines running genetic algorithms, where natural selection drives the outcome. For all its limitations, Kaggle is certainly a good training ground to get one’s feet wet.
Data science is about understanding the underlying generative process, or mechanism, that results in the data that you observe. It is about exploiting that knowledge to derive statistically significant pockets of value, to drive operational change into an enterprise, resulting in the creation of measurable ROI. It is about systematically driving the decision-making process, in a repeatable, scalable, and iterative way.
When you can translate business voodoo into an engineered revenue stream — that is when you can claim you have done real data science — it means you fundamentally understand how your business works at a very granular level.
Yes, “80%+ of the job of a data scientist is cleaning” as it is oft repeated — but that isn’t just some low-level thoughtless job — cleaning intelligently requires you to understand the solution as you refine the solution iteratively by paying careful attention to: + what matters and + why it matters and + how it matters. The word cleaning should be eliminated in favor of the word curation.
If you don’t understand the endgame, you will inevitably botch the launch off the starting line — and then wonder why you don’t see any results for all the work you’ve put in. You are constructing a well curated data set that conforms to a certain standard of quality to ensure that your model reflects the simple truth you are trying to uncover, capture, and/or replicate. This requires some intuition about what you are modeling and its inherently complex, possibly layered, structure. Merely curve fitting and claiming that “you have a model” is barely table stakes, and certainly offers no sustainable competitive advantage over your competition. The real question is whether you understand the science of your business.
You need to know when you are throwing out the baby with the bathwater. There is a fine line between feature engineering and data cleansing — you might just be cleansing out the most important stuff that is telling you what is really going on! So, NO, any random fresh graduate is unlikely to get this right — it’s just not that simple. It is actually telling that many data scientists I interview don’t understand that data cleansing is also modeling in a very real sense — because to identify noise, you have to have a model of the signal! There’s a reason why companies still pay top dollar for the 0.1% talent pool.
To read the next part of the blog, click here.
The author, “Prad” for short, is a senior analytics executive and experienced data science practitioner with a distinguished track-record of driving AI thought leadership, strategy, and innovation at enterprise scale. His focus areas are Artificial Intelligence, Machine/Deep Learning, Blockchain, IIoT/IoT, and Industry 4.0.