Prof Martin Kersten
A Cloud Data Warehouse Diagnosis
For over a decade, Cloud-enabled database systems have emerged. The generic architecture is:
The persistent store is the Cloud global file store, such as S3, GFS and Azure Files. It is the place where all data comes to rest. The middle layer consists of an adaptive cluster of machines to do the actual analytical work. They take any shape and reliability. An attached SSD storage device can act as a semi-permanent cache. The top layer controls the complete cluster, regulates user access, and balances the load.
The key selling point is that all users can see the complete database for querying.
It is easy to sign up for the service as all provisioning is done behind the scene on behalf of the user. Policies are in place to shut down extra compute nodes. The difficulty for the vendor lies in the load balancing over the number of compute nodes, which is still an art. The same holds for the elasticity aspect.
The business model is relatively straightforward. The vendor buys resources from the Cloud provider and sells them at a premium to the user. A factor of 5 in price increases is not uncommon. The second source of income is the tuning of resources, e.g. elasticity and caching. All traditional DBA tasks in enterprise environments are covered by SLAs when outsourced. Furthermore, paid consultancy support will leverage the cost of the army of marketeers/ sales teams.
Current DWH products are mostly designed from a technical perspective. It quickly leads to misinformation about the fitness of a solution. One of the overlooks is to ignore the many different user profiles for analytics. Especially in an amplifier glass marketing game where the results of a simple ‘ant’ query is turned into a powerful ‘elephant’ [Firebolt.] Multiple Venture investments are needed to keep marketing presence focused and fund the foreseeable years of core development to get a mature product. An industrial-strength database is more than glueing together several lego bricks of open-source projects.
Moreover, the query performance becomes less of an issue if we go down the list. For example, a Machine Learning training run can take hours, and then a sample query that runs within a few seconds is meaningless.
The Green Data Warehouse Manifesto
Looking into the future, we see a few decades unfolding where the primary question is how and where to reduce our footprint on climate and energy consumption. A self-settling ship reacts in seconds while its lawyers have years for the court case based on the collected data.
Although in our research field, the climate gains may not be gigantic, new systems should be designed from the ground up to anticipate the following requirements:
Respect the history of your database
A database is more than a pile of data appends
Time travel to older versions is an energy saver
Respect the data freshness needs of users
Freshness should recognise last-day-of-use
Freshness is a time-based dependency over query result sets
Respect the performance needs of users
There is a big difference in the needs for bulk loading, operational analytics, and data science
Subsecond response times is only needed in operational analytics. For analytics processing, a response time of a few seconds is acceptable
Respect the data resource budget limits of users
Avoid going back to the manager for more budget
Scalability should raise satisfaction, not deplete your cash
Respect the incremental nature of the workflow
The actions of yesterday cover 90% of the requests of today
Don't repeat the same query over stable data, but recognise the query resultset as a first-class citizen
Respect the need to communicate among users
Data warehouse results are shared between users
Seamlessly integrate with communication channels
Respect the time difference with reality in data
Avoid pretending to have zero-time difference with reality
Respect the self-service expertise of users
Respect the user’s need for pre-advice and hand-holding
Satisfaction depends on balancing the parameters, which in turn calls for pre-advise and the ‘what-if’ questions
Respect balance between cost/speed/freshness
Avoid high volatility of similar tasks and be energy savvy.
Next to "green", another factor is to "relax". In this society where everything must be more and faster, one actually can't continually improve the "user satisfaction" level by stressing out oneself. Instead, relax, take a step back, and concentrate on giving the users what they need; one can achieve more (results) with less (costs & waste).
April. 13, 2022