Data Generation, Analysis, and Usage - Current Scenario
Last decade has seen an exponential increase in the data being generated from across traditional as well as non-traditional data sources. International Data Corporation (IDC)report says that, data generated in the year 2020 alone will be a staggering 40 zettabytes which would constitute a 50-fold growth from 2010. The data generated per second has increased to 2.5 Quintillion bytes and with the advent of latest innovations like the Internet of Things; it is poised to grow even more rapidly. This increase in data generation coupled with growing ability to store various types of data that is being generated has ensued in a vast repository of data which is now available for scrutiny.
According to reports by wealth management firm Merrill Lynch,among all these data,80 percent of business-relevant information originates in unstructured form. Now unstructured data refers to information which either does not tailor to a pre-defined data model or is not organized in a pre-defined manner. These could be images, videos, emails, social media data or even sonar readings. Essentially these are data points which cannot be captured in our traditional relational databases.
Analysis of Unstructured Data
As the ability to store varied data increased so did our ability to analyze and derive actionable insights from it. Thecompanies started realizing the significance of analyzing unstructured data along with structured data,started investing more into it andas a result, thepotential benefits which could be harnessed from these previously useless data became more apparent.The personalized loan offerings from banks or the customized offers from e-commerce sites or exclusive loyalty discounts offered by retail chains are just a few examples of how organizations have started deep diving into the unstructured data to come up with tailored offerings.
This blog post brings out the significance of the data storage Repositories namely Data Warehouse and Data Lake, does a comparative analysis and suggests on the different approaches to be adopted based on the implementation decision and architecture.
Traditional Data Warehouse Challenges
Storage and Performance:
A Data Warehouse is a conceptual architecture that helps to store structured, subject-oriented, time variant, non-volatile data for decision making. Historical as well as real-time data from various sources are transformedto load to a structured form.
While a traditional Data warehouse can act as a master repository for all the structured data across the organization, its inability to store unstructured data prevents it from acting as a unified data source for analytics thereby hampering its ability to successfully garner value from such hugedata. Because unstructured data constitutes such large chunk of business-related information, enterprises can no longer afford to neglect it, and leaving this data out of the purview of analytics could prove detrimental for companies.
Also with the exponential increase in the data being generated each day, storing these data in traditional databases could prove expensive for organizations. And as a result of such humongous data being stored, the performance also suffers unless we invest more heavily in the hardware configurations.
From an implementation standpoint one of the main challenges a data warehousing project poses is pertaining to the data quality. Often when we try to combine inconsistent data from disparate sources it would result in duplicates, inconsistent data, missing data and logical conflicts. Varied level of standardizations across different databases also adds to the issue. These would create a problem at a later stage and will result in faulty reporting and analytics thereby affecting optimal decision making.
By the virtue of having data from across different databases, data warehouse projects often cater to varied reports and analytics as per user demand.Data warehouses being ‘schema on-write’,such reporting and analytics need to be taken into design considerations upfront as we need to define the schema before loading data into the databases. However, envisioning all such reports at the onset might be difficult for business users who are not exposed to the capabilities of the tools and will often result in rework for the technical team.
Because data warehouse projects are structure driven, it does not adapt itself easily to changes. The effort and resource required to adapt to any such changes are invariably exorbitant and will most likely drive up the cost significantly. For instance, if a new business requirement emerges at a later point, which fundamentally changes the original data structure, it would necessitate remodeling of Data Warehouse and this can be extremely time-consuming.
How can Data Lake solve some of these challenges?
Storage, Performance, and Analytics:
An enterprise level Data Lake not only enables democratized access to data via a single, unified view of data across the organization but its ability to store a vast amount of varied data (structured/unstructured) ensures it can act as a perfect foil for performing advanced analytics. Availability and ability to process complex data ensure discovery of correlations and potential causalities by processing seemingly unrelated data sets thereby creating a more precise and higher quality analytics model.
The key feature which differentiates it with traditional data warehouses is its ability to store data regardless of its type or its size. So even if your enterprise is generating a huge amount of unstructured or semi-structured data, you can rest assured that Data Lake will be able to store these data in their native form. Few types of data stored in the Data Lake might not be relevant at present, but because we have the capability to store these data now, it opens up possibilities of processing these data in future which could potentially createmore value. Also there is no concept of deleting the data from Data Lake since it provides a large amount of storage space. This, in turn, helps us if we have to go back to any point to do a detailed analysis.
Cost Advantage and Scalability:
Furthermore, data lakes are designed for low-cost storage of huge volumes of data and thereby provide a more economical way of storing the data. The ability of Data Lake to provide such large storage at a minimum cost is owing to the fact that most of the data technologies are open software thereby doing away with the cost associated with licensing and community support and also because most of these data technologies are designed to be installed on low-cost commodity hardware.Off-the-shelf servers combined with cheap storage makes it more easily scalable than the traditional data warehouse. The processing of data is again decentralized and distributed across multiple servers thereby enabling faster processing.
Defining your Schema:
Since Data Lake follows ‘schema on-read’ and not ‘schema on-write’ it is not required to define a schema before the data is ingested into the data lake. So the raw data can be loaded as-is and we define a shape and structure to it only when we are ready to use it.The schema hence is defined ad-hoc when the users query the data. And because the data is available in the unadulterated form, before it is transformed, cleansed or structured, it lends itself to novel ways of exploring and thereby enabling users to get to their results faster than the traditional data warehouse approach.
Data Lake Benefits:
As more and more enterprises embrace Data Lake, we are able to see their results firsthand. Whether it is a retail chain identifying the behavior pattern and spending capacity of a regular customer and offering him an exclusive additional discount on a product he is likely to be interested in, or a clothing store displaying the exact same dress to a customer who has been looking it up on their website. These provide just a glimpse of the vast potential a managed Data Lake coupled with intelligent analytics can do.
Problems associated with Data Lake
With all these, are we saying Data Lake is a bed full of roses? Nope! It comes with its own set of challenges. Often organizations are so keen to join the Big Data bandwagon without identifying the exact use case which a Data Lake is going to resolve. Just dumping data from various sources into the Data Lake does not add any value. It rather becomes Data Swamp and it becomes extremely challenging to find some value out of it. Hence it is important to recognize the business problem we are trying to solve by using Data Lake.
One of the other pitfalls that people often miss is that Data Lake by its nature would provide tangible benefits only when multiple verticals or lines of businesses come together for its implementation. It makes more sense when cross-functional data is available in the lake so that the business insights derived will have a 360 degree view. However in several cases it is only one domain or team that is involved in the implementation of Data Lake. The retrieval of information present in Data Lake is again more difficult compared to Data warehouses as the data present is unstructured and not organized. Since Big Data technologies are relatively newer, their security aspects are also in the maturing stage. Although there have been significant improvements during the last few years, some clients still have apprehensions regarding their data security.
While the above-mentioned factors do pose few concerns, these are set to be addressed as the Big Data ecosystem is evolving rapidly. The advantages that a Data Lake brings to the table far outweighs those compared to traditional data warehouses. Its reliance on commodity hardware and open source software ensures that it will continue to be the more economical option and its potential to answer possible business-critical questions in a faster manner will make it more difficult for organizations to look away. Industry experts believe, on a short run, we will see more enterprises moving to a combination of a Data Warehouses and Data Lakes while Data Warehouse might continue to be the preferred option for dealing with structured data alone. In the longer run, however Data Lakes are expected to dominate when the Big Data ecosystem has evolved further and have matured.
About the Author
Vaisakh, Project Manager at Mastech InfoTrellis has an overall experience of 5 years and has managed complex Big Data and Master Data Management projects.