What is the best way to easily ingest data into a Data Lake?
May 08, 2018
Why you need a Data Lake is a topic that has been discussed at length. Today, I want to answer a commonly asked customer question –
“What is the best way to easily ingest data into our Data Lake?”
Hint – the answer involves serverless native AWS services and is deliberately simple to configure… but before I explain why-we-do-what-we-do (part of our core values), we need to understand why-we-don’t-do-what-we don’t-do. Here are three rules for building a Data Lake and ingesting data into it…
1) We don’t model ALL the data. A highly normalised enterprise data model where we can fit ‘all the data’ is an attractive proposition. However, one of the downsides we see again and again is that cycle times are in the multiple months to ingest a new source of data into the enterprise data model and publish this to production. This gives us rule 1.
The time to configure the system to ingest a new file shall be days not months.
2) We don’t run Monolithic ETL servers. Traditional ETL tools are expensive and can bottleneck due to the need to push all the data through their server. In addition, they are servers, with all the operational engineering, monitoring and costs required to keep these servers running reliably. Expanding a bit upon this gives us rule 2.
The ingestionprocess should be scalable, have low operational complexity, costs should be low, and costs should scale linearly up and down with the data volume ingested
3) We don’t create an opaque and insecure Data Lake. Sometimes the ease of use and scalability of cloud services can work against us, loading data into S3 without recording what is present and where it is located can create a mess of datasets. To satisfy our friends in security and risk, we also need to guarantee that access to data is controlled and appropriately granted. We also assume that all manual data governance process will fail over time. This gives us rule 3.
All data entering the data lake shall be known, recorded and governed by default.
One solution that adheres to these rules is a serverless ingestion pipeline built on AWS native services.
The pipeline is built upon the following AWS services and open source software:
· AWS Step to manage and orchestrate the event driven process from file arrival to dataset publishing.
· AWS Lambda to execute short running synchronous tasks.
· AWS Batch to execute long running tasks in serverless containers.
· AWS Glue for writing the columnar files.
· AWS Athena for querying the published files with SQL.
· Customised open source libraries for data validation.
This diagram shows how to ingest a file.
1) We check the provided ingestion package to make sure we have our required manifest
2) A batch job hashes the file to check integrity
3) The hash is used to determine if this file is a unique file
4) If required various compliance checks are run (Anti-Virus, PI data sniffing etc.)
5) The file is moved to an appropriately controlled location
6) The schema is checked, and field level data validation takes place
7) We transform the file into a columnar format and prepare it for efficient querying by Athena
The only configuration required to ingest a dataset to the Raw Zone is to record its existence in the metadata store such that the manifest describe the dataset and its access policies. Dataset validation is a simple matter of providing a schema and rules json file to the ingestion pipeline for files that are to be published to Athena.
The end result is a system that meets our three rules and easily delivers data queryable in Athena or accessible in Spark/Glue for further transformation.
What types of insights are our customers deriving?
1. It could be a holistic view of customers to better target offers,
2. It could be combining records from a transactional system with their channel interactions to improve their customers’ experience, or
3. It could be merging internal and external data sources to build a dataset to measure their churn propensity
Our serverless ingestion pipeline accelerates the ability of the business to provision data for their data scientists and analysts. Combined with our Data Lake deployment pattern this is a key accelerator to allow our customers to unlock their data to derive these types of insight and base their decisions on facts and not intuitions.
Henry has worked with the whole spectrum of Data projects, from Data warehousing, to Big Data and cloud platforms to delivering machine learning outcomes. When he’s not shuffling ones and zeros around, you can probably find him kite surfing in Altona.