By IT Brew Staff
less than 3 min read
Definition:
A “data lake” is a collection of structured and unstructured data: PDFs, spreadsheets, audio, video—throw it all in!
Whatever’s in the pool is collected and available to data scientists, developers, and other IT pros to process for business insights, dashboards, AI-model training, and general user queries.
“What you gain is the ability to do cloud-native data analytics at multi-petabyte scale,” Matt Radolec, VP of incident response, cloud operations, and sales engineering at data security company Varonis, told IT Brew.
Lakes vs. warehouses vs. data lakehouses
The data lake option differs from the similar term “data warehouse,” which refers to information that has already been categorized ahead of time and readied for analysis.
Like neatly arranged aisles of pallet stacks, a data warehouse organizes information into tables and columns. Data has defined descriptors, and tables can be sorted by type, which query tools can look for and follow.
A hybrid “data lakehouse” combines the two: the combination of raw and structured data can support AI and machine-learning workloads.
Some lake options
Vendors like Amazon, Google, and Microsoft offer customers a cloud-based data lake, where they can then use additional tools to run functions like predictive analytics. Other vendor options include Snowflake and Databricks.
“The number-one use case for having a data lake is just to make sure that you have retention of the data in the easiest, lowest-cost form possible,” Alex Merced, head of developer relations at data lakehouse platform Dremio, told IT Brew in July 2024.
Still, stay organized
Data lakes are often structured in zones, depending on how the data will be consumed, according to Gartner.
The market-intelligence firm recommends data lake users prioritize metadata maintenance and an archiving strategy that ensures the most popular data is readily available for orchestrating data pipelines and traceability.
“A properly architected data lake avoids evolving into a data swamp,” read a recent report from Gartner.