The Green Data Warehouse Manifesto

Prof Martin Kersten




A Cloud Data Warehouse Diagnosis

For over a decade, Cloud-enabled database systems have emerged. The generic architecture is:

The persistent store is the Cloud global file store, such as S3, GFS and Azure Files. It is the place where all data comes to rest. The middle layer consists of an adaptive cluster of machines to do the actual analytical work. They take any shape and reliability. An attached SSD storage device can act as a semi-permanent cache. The top layer controls the complete cluster, regulates user access, and balances the load.

The key selling point is that all users can see the complete database for querying.


It is easy to sign up for the service as all provisioning is done behind the scene on behalf of the user. Policies are in place to shut down extra compute nodes. The difficulty for the vendor lies in the load balancing over the number of compute nodes, which is still an art. The same holds for the elasticity aspect.


The business model is relatively straightforward. The vendor buys resources from the Cloud provider and sells them at a premium to the user. A factor of 5 in price increases is not uncommon. The second source of income is the tuning of resources, e.g. elasticity and caching. All traditional DBA tasks in enterprise environments are covered by SLAs when outsourced. Furthermore, paid consultancy support will leverage the cost of the army of marketeers/ sales teams.


Current DWH products are mostly designed from a technical perspective. It quickly leads to misinformation about the fitness of a solution. One of the overlooks is to ignore the many different user profiles for analytics. Especially in an amplifier glass marketing game where the results of a simple ‘ant’ query is turned into a powerful ‘elephant’ [Firebolt.] Multiple Venture investments are needed to keep marketing presence focused and fund the foreseeable years of core development to get a mature product. An industrial-strength database is more than glueing together several lego bricks of open-source projects.

Moreover, the query performance becomes less of an issue if we go down the list. For example, a Machine Learning training run can take hours, and then a sample query that runs within a few seconds is meaningless.



The Green Data Warehouse Manifesto

Looking into the future, we see a few decades unfolding where the primary question is how and where to reduce our footprint on climate and energy consumption. A self-settling ship reacts in seconds while its lawyers have years for the court case based on the collected data.

Although in our research field, the climate gains may not be gigantic, new systems should be designed from the ground up to anticipate the following requirements:


  1. Respect the history of your database

  2. A database is more than a pile of data appends

  3. Time travel to older versions is an energy saver

  4. Respect the data freshness needs of users

  5. Freshness should recognise last-day-of-use

  6. Freshness is a time-based dependency over query result sets

  7. Respect the performance needs of users

  8. There is a big difference in the needs for bulk loading, operational analytics, and data science

  9. Subsecond response times is only needed in operational analytics. For analytics processing, a response time of a few seconds is acceptable

  10. Respect the data resource budget limits of users

  11. Avoid going back to the manager for more budget

  12. Scalability should raise satisfaction, not deplete your cash

  13. Respect the incremental nature of the workflow

  14. The actions of yesterday cover 90% of the requests of today

  15. Don't repeat the same query over stable data, but recognise the query resultset as a first-class citizen

  16. Respect the need to communicate among users

  17. Data warehouse results are shared between users

  18. Seamlessly integrate with communication channels

  19. Respect the time difference with reality in data

  20. Avoid pretending to have zero-time difference with reality

  21. Respect the self-service expertise of users

  22. Respect the user’s need for pre-advice and hand-holding

  23. Satisfaction depends on balancing the parameters, which in turn calls for pre-advise and the ‘what-if’ questions

  24. Respect balance between cost/speed/freshness

  25. Avoid high volatility of similar tasks and be energy savvy.


Next to "green", another factor is to "relax". In this society where everything must be more and faster, one actually can't continually improve the "user satisfaction" level by stressing out oneself. Instead, relax, take a step back, and concentrate on giving the users what they need; one can achieve more (results) with less (costs & waste).


martin.kersten@cwi.nl

April. 13, 2022

25 views0 comments