Traditional data warehouses require significant time and resource to administer, especially for large datasets. In addition, the financial cost associated with building, maintaining, and growing self-managed, on-premise data warehouses is very high. As your data grows, you need to always exchange-off what data to stack into your data warehouse and what data to archive in storage so you can oversee costs, keep ETL complexity low, and deliver good performance.
This blog post details on how Amazon Redshift can make a significant impact in lowering the cost and operational overheads of a data warehouse, how to get started with Redshift, what are the steps involved in migration, prerequisites for migration, post migration activities.
Key business and technical challenges faced:
- What kind of analysis do the business users want to perform?
- Do you currently collect the data required to support that analysis?
- Where is the data?
- Is the data clean?
- What is the process for gathering business requirements?
Data Quality –Data comes from many disparate sources of an organization. When a data warehouse tries to combine inconsistent data from disparate sources, it runs into errors. Inconsistent data, duplicates, logic conflicts, and missing data all result in data quality challenges. Poor data quality results in faulty reporting and analytics necessary for optimal decision making.
Understanding Analytics – When building a data warehouse, analytics and reporting will have to be taken into design considerations. In order to do this, the business user will need to know exactly what analysis will be performed. Envisioning these reports is a great challenge.
Quality Assurance – The end user of a data warehouse makes use of Big Data reporting and analytics to make the best decisions possible. Consequently, the data must be 100 percent accurate. This high reliance on data quality makes testing a high priority issue that will require a lot of resources to ensure the information provided is accurate. Successful STLC process has to be completed which is a costly and time intensive process.
Performance –A data warehouse must be carefully designed to meet overall performance requirements. While the final product can be customized to fit the performance needs of the organization, the initial overall design must be carefully thought out to provide a stable foundation from which to start.
Designing the Data Warehouse – Lack of clarity in defining what is expected from a data warehouse by the business users’ result in miscommunication between the business users and the technicians building the data warehouse. Hence the expected end results are not delivered to the user which calls for fixes after delivery adding up to the existing development fees.
User Expectation – People are not keen to changing their daily routine especially if the new process is not intuitive. There are many challenges to overcome to make a data warehouse that is quickly adopted by an organization. Having a comprehensive user training program can ease this hesitation but will require planning and additional resources.
Cost – Building a data warehouse in house to save money though a great idea has multitude of hidden problems. The required levels of skill sets to deliver effective result is not feasible with few experienced professionals leading a team of non-BI trained technicians. The do it yourself efforts turn out costlier than expected.
Data Structuring and Systems Optimization - As you add more and more information to your warehouse; structuring data becomes increasingly difficult and can slow down the process significantly. In addition, it will become difficult for the system manager to qualify the data for analytics. In terms of systems optimization, it is important to carefully design and configure data analysis tools.
Selecting the right type of Warehouse – Choosing the right type of warehouse from the variety of warehouse types available in the market is challenging. You can choose a pre-assembled or customized warehouse. Choosing a custom warehouse saves time building a warehouse from various operational databases, but pre-assembled warehouses save time on initial configuration. Depending on the business model and specific goals the choice has to be made.
Data Governance and Master Data – Information being one of the crucial assets should be carefully monitored. Implementing data governance is mandatory because it allows organizations to clearly define ownership and ensures that shared data is both consistent and accurate.
Redshift is a managed data warehousing and analytics service from AWS, It will make it easy for developers and businesses to set up, operate and scale a clustered relational database engine suitable for complex analytic queries over large data sets. It is fast, utilizing columnar technology and compression to reduce IOs and spreading data across nodes and spindles to parallelize execution. It is disruptively cost-efficient, removing software licensing costs and supporting a pay-as-you-go and grow-as-you-need model. It is a managed service, greatly reducing the hassles of monitoring, backing up, patching and repairing a parallel, distributed environment. It is standards-based, using PostgreSQL as the basic query language and JDBC/ODBC interfaces, enabling a variety of tool integrations.
Amazon Redshift also includes Amazon Redshift Spectrum, allowing you to directly run SQL queries against exabytes of unstructured data in Amazon S3. No loading or transformation is required, and you can use open data formats.
For more details on Amazon Redshift Spectrum refer https://aws.amazon.com/redshift/spectrum/
Why Amazon Redshift?
- Cost optimization models
- Managed Service (Administration)
- Highly Secure (Enterprise Grade Encryption)
- Quicker Migration Time
- 160 + Security and compliance features
- Faster with better performance
- Ease of use
- No upfront investment
- Pay as you go
- Automatic patching and backing up
- Back-up storage for user defined retention period
Stay tuned for Part2 of this 3 part series on Data Warehouse Migration to Amazon Redshift. Please send us a note with your queries and feedback
About the Author
Prabhu Chennupati is an avid technologist with extensive experience in enterprise application design and development. He leads the technical teams in Master Data Management (MDM) space spread across multiple continents, delivering high quality solutions on-time.
Asha Shyam Sundar is a Senior Technical Writer at Mastech InfoTrellis and has 13 plus years of experience in technical documentation spanning across various technologies.