A two-series blog by Pradyumna S. Upadrashta, Ph.D., Chief Data Scientist, Mastech InfoTrellis
Most data scientists and the organizations that employ them don’t seem to understand how data science is actually done, nor what it is exactly. They sort of jumped on the bandwagon — without really understanding it, nor why it was important to them in a very visceral way.
Science is not merely predictive — at its heart, it is explanatory as well as diagnostic. Science leads to engineering — a systematic mathematical approach to creating technology solutions based on the exploitation of some natural phenomenon.
Winning Kaggle competitions is not data science; though, it is a reasonable start, I suppose – even though the best models in Kaggle are actually built by machines running genetic algorithms, where natural selection drives the outcome. For all its limitations, Kaggle is certainly a good training ground to get one’s feet wet.
Data science is about understanding the underlying generative process, or mechanism, that results in the data that you observe. It is about exploiting that knowledge to derive statistically significant pockets of value, to drive operational change into an enterprise, resulting in the creation of measurable ROI. It is about systematically driving the decision-making process, in a repeatable, scalable, and iterative way.
When you can translate business voodoo into an engineered revenue stream — that is when you can claim you have done real data science — it means you fundamentally understand how your business works at a very granular level.
Yes, “80%+ of the job of a data scientist is cleaning” as it is oft repeated — but that isn’t just some low-level thoughtless job — cleaning intelligently requires you to understand the solution as you refine the solution iteratively by paying careful attention to: + what matters and + why it matters and + how it matters. The word cleaning should be eliminated in favor of the word curation.
If you don’t understand the endgame, you will inevitably botch the launch off the starting line — and then wonder why you don’t see any results for all the work you’ve put in. You are constructing a well curated data set that conforms to a certain standard of quality to ensure that your model reflects the simple truth you are trying to uncover, capture, and/or replicate. This requires some intuition about what you are modeling and its inherently complex, possibly layered, structure. Merely curve fitting and claiming that “you have a model” is barely table stakes, and certainly offers no sustainable competitive advantage over your competition. The real question is whether you understand the science of your business.
You need to know when you are throwing out the baby with the bathwater. There is a fine line between feature engineering and data cleansing — you might just be cleansing out the most important stuff that is telling you what is really going on! So, NO, any random fresh graduate is unlikely to get this right — it’s just not that simple. It is actually telling that many data scientists I interview don’t understand that data cleansing is also modeling in a very real sense — because to identify noise, you have to have a model of the signal! There’s a reason why companies still pay top dollar for the 0.1% talent pool.…to be continued.
The author, “Prad” for short, is a senior analytics executive and experienced data science practitioner with a distinguished track-record of driving AI thought leadership, strategy, and innovation at enterprise scale. His focus areas are Artificial Intelligence, Machine/Deep Learning, Blockchain, IIoT/IoT, and Industry 4.0.