Use Delta Lake as the Master Data Management (MDM) Source for Downstream Applications
Data Engineering in Towards Data Science
by Manoj Kukreja
18h ago
In this article, we will try to understand how the output from Delta Lake change feed can be used to feed downstream applications Image by Satheesh Sankaran from Pixabay As per ACID rules, the theory of isolation states that “the intermediate state of any transaction should not affect other transactions”. Almost every modern database has been built to follow this rule. Unfortunately, until recently the same rule could not be effectively implemented in the big data world. What was the reason? Modern distributed processing frameworks such as Hadoop MapReduce and Apache Spark perform co ..read more
Visit website
How to Build an ELT with Python
Data Engineering in Towards Data Science
by Marie Truong
18h ago
Extracting, Loading and Transforming Data Photo by JJ Ying on Unsplash ELT (Extract, Load, Transform) is a modern approach to data integration that differs slightly from ETL (Extract, Transform, Data). ETL transforms data before loading it inside the data warehouse, whereas in an ELT, the raw data is loaded directly inside the data warehouse and transformed using SQL. Building ELTs is a very important part of data and analytics engineers’s job, and it can also be a useful skill for data analysts and scientists with a wider scope, or job seekers building a complete portfolio. In this ..read more
Visit website
Easily Validate User-Generated Data Using Pydantic
Data Engineering in Towards Data Science
by Charles Mendelson
2d ago
How to us Pydantic to validate Excel data Photo by DeepMind on UnsplashUsing Pydantic to validate Excel data As a data engineer, I frequently encounter situations where I have built pipelines and other automations based on user-generated data from Excel. Excel’s flexibility allows it to be used by a wide variety of users, but unfortunately, that flexibility leads to invalid data entering the pipeline. Before I discovered Pydantic, I wrote incredibly complex Pandas functions to check and filter data to make sure it was valid. What is Pydantic? Pydantic is a Python library th ..read more
Visit website
How to Setup a Simple ETL Pipeline with AWS Lambda for Data Science
Data Engineering in Towards Data Science
by Brian Roepke
3d ago
How to setup a simple ETL pipeline with AWS Lambda that can be triggered via an API Endpoint or Schedule and write the results to an S3 Bucket for ingestion Photo by Rod Long on UnsplashIntroduction to ETL with AWS Lambda When it comes time to build an ETL pipeline, many options exist. You can use a tool like Astronomer or Prefect for Orchestration, but you will also need somewhere to run the compute. With this, you have a few options: Virtual Machine (VM) like AWS EC2 Container services like AWS ECS or AWS Fargate Apache Spark like AWS EMR (Elastic Map Reduce) S ..read more
Visit website
Data Integration Strategies for Time Series Databases
Data Engineering in Towards Data Science
by Yitaek Hwang
5d ago
Exploring popular data integration strategies for TSDBs including ETL, ELT, and CDC Photo by fabio on Unsplash As digital transformation reaches more industries, the number of data points generated is growing exponentially. As such, data integration strategies to collect such large volumes of data from different sources in varying formats and structures are now a primary concern for data engineering teams. Traditional approaches to data integration, which have largely focused on curating highly structured data into data warehouses, struggle to deal with the volume and heterogeneity o ..read more
Visit website
Think in SQL — Avoid Writing SQL in a Top to Bottom Approach
Data Engineering in Towards Data Science
by Chengzhi Zhao
6d ago
Think in SQL — Avoid Writing SQL in a Top to Bottom Approach Write Clear SQL By Comprehend Logical Query Processing Order Photo by Jeffrey Brandjes on Unsplash You might find writing SQL challenging due to its declarative nature. Especially for engineers familiar with imperative languages like Python, Java, or C, SQL is gear-switching and mind shifts to many people. Thinking in SQL is different than any imperative language and should not be learned and developed the same way. When working with SQL, do you write in the top to bottom approach? Do you start developing in SQL with t ..read more
Visit website
BigQuery Optimization Strategies 3: Table Flattening
Data Engineering in Towards Data Science
by Martin Weitzmann
1w ago
About Exploding Tables and Imploding Arrays Arrays are one of the coolest features for an analytics databases you can think of, because it can store additional information right at the place and time it happened. Let’s explore some basic examples and then have a look at arrays in Google Analytics 4. Photo by Torsten Dederichs on Unsplash For storing sales history, for example, we can just store the products bought in an array together with the purchase event and not in a separate table — it’s better to save all the SQL join hassle later in the analyses. And while arrays are not super intu ..read more
Visit website
How to Load Multiple CSV Files into a Pandas DataFrame
Data Engineering in Towards Data Science
by Giorgos Myrianthous
1w ago
Importing and concatenating multiple CSV files into one pandas DataFrame Photo by Daniel K Cheung on Unsplash CSV (Common Separated Values) is a popular file format used to store and exchange data. In fact, this type of source is commonly used for relatively small volumes of data. pandas is a commonly used Python package that lets developers process and transform data as part of analytical and data science tasks. However, beform perforning any task pandas needs to load all the data into memory. This means that the package can only be used for relatively small volumes of data — well t ..read more
Visit website
FinOps: Four Ways to Reduce Your BigQuery Storage Cost
Data Engineering in Towards Data Science
by Xiaoxu Gao
1w ago
Don’t overlook the cloud storage cost Photo by Nathan Dumlao on Unsplash With the current state of the economic situation, it’s more important than ever to maximize our cash on hand and develop a series of cost optimization strategies. The growing use of cloud services has brought not only many opportunities for the business but also the potential for management challenges that can lead to cost overruns and other issues. FinOps, a newly introduced concept, is an evolving operational framework and cultural shift that allows organizations to get maximum business value by bringing ..read more
Visit website
How to Compare Two Tables For Equality in BigQuery
Data Engineering in Towards Data Science
by Giorgos Myrianthous
1w ago
Compare tables and extract their differences with standard SQL Photo by Zakaria Ahada on Unsplash Comparing tables in BigQuery is a crucial task when testing the results of data pipelines and queries prior to productionizing them. The ability to compare tables allows for the detection of any changes or discrepancies in the data, ensuring that the data remains accurate and consistent. In this article we will demonstrate how to compare two (or more) tables on BigQuery and extract the records that differ (if any). More specifically, we will showcase how to compare tables with identical ..read more
Visit website

Follow Data Engineering in Towards Data Science on Feedspot

Continue with Google
OR