Skip to main content

Data Lakehouse

A data lake house combines together the best bits of data warehouses and data lakes.

Data Lake

Data Lake is the name we give to a collection of tools that are often used together to process large amounts of data. Typically it includes a storage system like S3 or HDFS and a processing system like Apache Spark or Hadoop.

  • Store lots of data - often in its raw "unprocessed" form in pseudo-real-time
  • Process a subset of data in real-time or in batch modes
  • Provide language-agnostic language runtimes for data analysis.

 

Data Warehouse

A data warehouse is usually where data that has been processed is stored and used directly in downstream applications. Data warehouses don't scale easily and typically have a lot more validation and processing associated with them.

Data Lakehouse

A data lakehouse attempts to combine elements of both Data Lake and Data Warehouse - again it is typically the name given to a group of systems architected together to provide this set of functionality. It normally supports Extract, Load and Transform paradigm.

References