
Data Engineering in Towards Data Science
1000 FOLLOWERS
Read writing about Data Engineering in Towards Data Science. Your home for data science.
Data Engineering in Towards Data Science
18h ago
In this article, we will try to understand how the output from Delta Lake change feed can be used to feed downstream applications Image by Satheesh Sankaran from Pixabay
As per ACID rules, the theory of isolation states that “the intermediate state of any transaction should not affect other transactions”. Almost every modern database has been built to follow this rule. Unfortunately, until recently the same rule could not be effectively implemented in the big data world. What was the reason?
Modern distributed processing frameworks such as Hadoop MapReduce and Apache Spark perform co ..read more
Data Engineering in Towards Data Science
18h ago
Extracting, Loading and Transforming Data Photo by JJ Ying on Unsplash
ELT (Extract, Load, Transform) is a modern approach to data integration that differs slightly from ETL (Extract, Transform, Data). ETL transforms data before loading it inside the data warehouse, whereas in an ELT, the raw data is loaded directly inside the data warehouse and transformed using SQL.
Building ELTs is a very important part of data and analytics engineers’s job, and it can also be a useful skill for data analysts and scientists with a wider scope, or job seekers building a complete portfolio.
In this ..read more
Data Engineering in Towards Data Science
2d ago
How to us Pydantic to validate Excel data Photo by DeepMind on UnsplashUsing Pydantic to validate Excel data
As a data engineer, I frequently encounter situations where I have built pipelines and other automations based on user-generated data from Excel. Excel’s flexibility allows it to be used by a wide variety of users, but unfortunately, that flexibility leads to invalid data entering the pipeline. Before I discovered Pydantic, I wrote incredibly complex Pandas functions to check and filter data to make sure it was valid.
What is Pydantic?
Pydantic is a Python library th ..read more
Data Engineering in Towards Data Science
3d ago
How to setup a simple ETL pipeline with AWS Lambda that can be triggered via an API Endpoint or Schedule and write the results to an S3 Bucket for ingestion Photo by Rod Long on UnsplashIntroduction to ETL with AWS Lambda
When it comes time to build an ETL pipeline, many options exist. You can use a tool like Astronomer or Prefect for Orchestration, but you will also need somewhere to run the compute. With this, you have a few options:
Virtual Machine (VM) like AWS EC2
Container services like AWS ECS or AWS Fargate
Apache Spark like AWS EMR (Elastic Map Reduce)
S ..read more
Data Engineering in Towards Data Science
5d ago
Exploring popular data integration strategies for TSDBs including ETL, ELT, and CDC Photo by fabio on Unsplash
As digital transformation reaches more industries, the number of data points generated is growing exponentially. As such, data integration strategies to collect such large volumes of data from different sources in varying formats and structures are now a primary concern for data engineering teams. Traditional approaches to data integration, which have largely focused on curating highly structured data into data warehouses, struggle to deal with the volume and heterogeneity o ..read more
Data Engineering in Towards Data Science
6d ago
Think in SQL — Avoid Writing SQL in a Top to Bottom Approach Write Clear SQL By Comprehend Logical Query Processing Order Photo by Jeffrey Brandjes on Unsplash
You might find writing SQL challenging due to its declarative nature. Especially for engineers familiar with imperative languages like Python, Java, or C, SQL is gear-switching and mind shifts to many people. Thinking in SQL is different than any imperative language and should not be learned and developed the same way.
When working with SQL, do you write in the top to bottom approach? Do you start developing in SQL with t ..read more
Data Engineering in Towards Data Science
1w ago
About Exploding Tables and Imploding Arrays
Arrays are one of the coolest features for an analytics databases you can think of, because it can store additional information right at the place and time it happened. Let’s explore some basic examples and then have a look at arrays in Google Analytics 4.
Photo by Torsten Dederichs on Unsplash
For storing sales history, for example, we can just store the products bought in an array together with the purchase event and not in a separate table — it’s better to save all the SQL join hassle later in the analyses.
And while arrays are not super intu ..read more
Data Engineering in Towards Data Science
1w ago
Importing and concatenating multiple CSV files into one pandas DataFrame Photo by Daniel K Cheung on Unsplash
CSV (Common Separated Values) is a popular file format used to store and exchange data. In fact, this type of source is commonly used for relatively small volumes of data.
pandas is a commonly used Python package that lets developers process and transform data as part of analytical and data science tasks. However, beform perforning any task pandas needs to load all the data into memory. This means that the package can only be used for relatively small volumes of data — well t ..read more
Data Engineering in Towards Data Science
1w ago
Don’t overlook the cloud storage cost Photo by Nathan Dumlao on Unsplash
With the current state of the economic situation, it’s more important than ever to maximize our cash on hand and develop a series of cost optimization strategies. The growing use of cloud services has brought not only many opportunities for the business but also the potential for management challenges that can lead to cost overruns and other issues.
FinOps, a newly introduced concept, is an evolving operational framework and cultural shift that allows organizations to get maximum business value by bringing ..read more
Data Engineering in Towards Data Science
1w ago
Compare tables and extract their differences with standard SQL Photo by Zakaria Ahada on Unsplash
Comparing tables in BigQuery is a crucial task when testing the results of data pipelines and queries prior to productionizing them. The ability to compare tables allows for the detection of any changes or discrepancies in the data, ensuring that the data remains accurate and consistent.
In this article we will demonstrate how to compare two (or more) tables on BigQuery and extract the records that differ (if any). More specifically, we will showcase how to compare tables with identical ..read more