Development of a Cloud DWH

2022-07-06 | Article | Insights

With the dramatic increase in the amount of data generated in every company, data warehouses (DWH) have become a hot topic. Data is being generated at such a high rate and from so many different sources nowadays that the capabilities of traditional databases can no longer keep up. This is why a new enterprise system emerged to overcome this. A data warehouse is a decentralised integration of different data sources for the use of reporting and data analysis.

The use of data warehouses has increased due to some of its major advantages:

  • It is a fully managed service that allows you to focus only on the data.
    • Removed the need to implement hardware maintenance
  • It can scale up and down as your business grows.
    • It is cost-effective because you only pay for what you use.
  • It supports data streaming that allows you to get insights in real-time.
    • Allows the user to track any changes instantly
  • It supports machine learning initiatives to get the most out of your data.
    • Implementing AI improves predictive insights

Stages

In general, the development of a Cloud Data Warehouse consists of 4 different stages. Each of them has its requirements and challenges.

1. Identify sources to work with

The first stage of building a Cloud Data Warehouse is to identify all the sources that your company has and wants to consolidate into a single source of truth (a single location for aggregated data that can be accessed by the information systems). It is crucial to have a complete list to organise and plan how to connect the data sources to the DWH. The type can vary depending on the market but can generally be summarised within the following categories:

  • Data located in other public clouds
  • On-premises first-party data (e.g. financials, HR data, etc.)
  • Digital advertising platforms
  • Social media platforms
  • Customer relationship management software (CRMS)
  • Web analytics tools
  • Logistics
  • Ecommerce
  • Emails

After defining the sources, there are two additional actions to consider. Firstly, you need to define what metrics and/or dimensions you want to extract from each source. Once the data you want to extract is defined you need to set up the schema of the raw data in the DWH. For this you need to know what type of variables you are working with and if different sources interact with each other.

2. Data ingestion

As soon as the full list of sources is ready, it is time to decide how the data will be ingested. You can build your connectors and data pipelines or use third-party tools that do it on your behalf. We are focusing this article on the first option as the main goal is to develop a Cloud Data Warehouse from scratch.

Challenges

Building your data pipelines and connectors might be a bit troublesome at first. As we mentioned earlier, the data sources can be stored in many different locations, thus you have to create a specific connector for each of them. That means a lot of scripts to interact with different APIs, database connections, and cloud service providers. As a result, there are different parameters to take into consideration:

  • Different types of authentication you have to deal with
    • API Key Authentication, HTTP Basic Authentication, OAuth Authentication, HMAC encryption, etc.
  • Connections to different types of databases
  • Orchestration in the data extraction
    • Cadence: Output of one source might be the input of another
    • Timing: the extraction of one source might need to precede another
  • Interaction with other cloud service providers.
  • At what time and how often the extraction will happen.
  • Monitor that the scripts are executed on time and successfully.
  • From when do you want to backfill your dataset.
    • Need to define time frame clearly
  • Streams in real-time can be difficult to manage as you are more exposed to potential failures

Actively managing the way you put together your data will make approaching the aforementioned challenges easier. The recommended way to tackle this is using an orchestration tool like Airflow which gives you the ability to schedule, monitor, backfill, orchestrate tasks, launch pods in a Kubernetes cluster, and much more.

In addition, you can also deploy fully managed services on the Cloud which helps clients manage their resources or infrastructure. For instance, Cloud Composer (for GC) and Managed Workflows (for AWS) are the fully managed services offered for Airflow. These tools allow the user to tackle these challenges in a centralised manner.

Using real-time data is challenging, but depending on your use case you can develop single ingestion for complex events. There are dedicated services for tackling real-time data. For example, a real-time messaging service like PubSub or Kafka, or directly ingesting data to the DWH if that allows you to do so (e.g. BigQuery has the Streaming API to directly stream millions of events per second).

3. Data processing

Data processing/transformation is necessary before analysing it with a business intelligence platform. This is when data is transformed and prepared for analysis, during this phase data can be added, removed, aggregated, or even edited. The standard order of operations for data delivery for many years was ETL (extract, transform, and load). However, in Data Warehouses hosted in the Cloud, the pipeline can be converted to ELT (extract, load, and transform), and the transformation can be done purely in SQL language inside the DWH. There are many advantages to taking this approach:

  • Raw data retention creates a rich historical archive.
  • Transformation only happens in the subset of the raw data.
  • Works well with large datasets.
  • Supports real-time and flexible data analysis.
  • Lower cost and maintenance.

There are still instances where you will still need to use the ETL order. For example, when dealing with personally identifiable information (PII) which needs to be removed or encrypted prior to the loading phase. Fortunately there are good cloud based ETL tools like Apache Beam and Apache Spark which can be used in batch processing and stream processing.

Challenges

As with any phase in the development of a cloud data warehouse there are a couple of challenges. The most important of which being data harmonisation and consistency. Data harmonisation refers to the work needed to combine data from different sources to show users a comparable view and to be able to extract a higher utility. This is not an easy task, and can only be accomplished with a clear and detailed understanding of the data from the various sources. This is achieved through multiple meetings with the different parties involved(e.g. stakeholders).

Data consistency refers to whether the data is displayed the same across all sources and is non-redundant. When aggregating data sources you want to make sure that you are not duplicating data nor creating data errors. For example, you might have financial data in different currencies that can’t be joined together without a proper transformation or when using timestamps in different denominations (e.g. hours vs minutes). Again, having a detailed understanding of the data will help you limit such errors.

4. Machine learning insights and activation

At this point, the data should be consolidated and clean of any redundancies and errors meaning that the data is aligned and stored in the DWH. This now allows us to leverage the AI platforms to implement machine learning algorithms in the DWH. These AI platforms are what adds value and allows us to perform many types of analyses depending on the use case. For example, getting a predictive analysis of the lifetime value for a specific user or getting a prediction of a customers conversion probability, and many others.

Another benefit of DWH’s is that they allow you to create a visual dashboard with meaningful content in a much more visual approach. This can lead to the development of weekly reports, interactive dashboards, visual representations for stakeholders, among many other things. Additionally, you can use these insights in order to aid in your decision making and setting your parameters for activation activities. Specifically helpful marketing and running ads by narrowing your target audience. All of this can be automated in order to streamline the process.

Final outcome

The increased demand for the use of cloud DWH’s is backed by the decrease in complexity that it brings in comparison to on-premise systems. This combined with the growth in the data output rates and processing power of cloud services gives DWHs many advantages. Some of them being focus, scalability, and cost reductions, improved speed and performance, and better integration for machine learning and marketing. However, there are of course still challenges associated with DWHs, and these become more apparent with increasing the number and types of sources. This leads to more scripts being needed to integrate different sources at the ingestion stage, as well as the potential for data harmonisation and consistency issues at the processing stage.

Nevertheless, proactively tackling these problems and implementing the correct tools allows you to greatly lower their chance of occurring and makes implementation much easier. Firstly, implementing the correct orchestration tools and fully managed services into your cloud DWH, will make setting up all of the data links and APIs calls easier. Secondly, having a good understanding of what you want to receive from the data and which fields are necessary will make the processing much more efficient. The goal of all of this is to allow your machine learning and activation algorithms and scripts to do their intended job efficiently which in terms helps the institutions using the DWH achieve their goals.

Do you need more Info?

Contact us