Modeling Slowly Changing Dimensions
Data Engineering in Towards Data Science
by Giorgos Myrianthous
1w ago
A deep dive into the various SCD types and how they can be implemented in Data Warehouses Photo by Pawel Czerwinski on Unsplash In today’s dynamic and competitive landscape, modern organisations heavily invest in their data assets. This investment ensures that teams across the entire organisational spectrum — ranging from leadership, product, engineering, finance, marketing, to human resources — can make informed decisions. Consequently, data teams have a pivotal role in enabling organisations to rely on data-driven decision-making processes. However, merely having a robust and scalable d ..read more
Visit website
How to Supercharge Your Python Classes with Class Methods
Data Engineering in Towards Data Science
by Siavash Yasini
1w ago
There could be more than one way into a Python class — Image by Midjourney, modified by author.Four advanced tricks to give your data science and machine learning classes the edge you never knew they needed Python classes provide a great framework for creating objects that can handle complex data structures, processes, pipelines, algorithms, or machine learning models. Object Oriented Programming (OOP) offers a ton of modularity and reusability, which enables data scientists and machine learning engineers to develop flexible and scalable codebases. Personally, I find structuring my c ..read more
Visit website
How to Build Data Pipelines for Machine Learning
Data Engineering in Towards Data Science
by Shaw Talebi
1w ago
A beginner-friendly introduction with Python code This is the 3rd article in a larger series on Full Stack Data Science (FSDS). In the previous post, I introduced a 5-step project management framework for building machine learning (ML) solutions. While ML may bring to mind fancy algorithms and technologies, the quality of an ML solution is determined by the quality of the available data. This raises the need for data engineering (DE) skills in FSDS. This article will discuss the most critical DE skills in this context and walk through a real-world example. Photo by the blowup on Unsp ..read more
Visit website
My First Billion (of Rows) in DuckDB
Data Engineering in Towards Data Science
by João Pedro
1w ago
First Impressions of DuckDB handling 450Gb in a real project Duck blueprint. Generated by Copilot Designer.Introduction The fields of AI, Data Science, and Data Engineering are progressing at full steam. Every day new tools, new paradigms, and new architectures are created, always trying to solve the problems of the previous ones. In this sea of new opportunities, it’s interesting to know a little about the available tools to solve problems efficiently. And I’m not talking only about the technicalities, but the scope of use, advantages, disadvantages, challenges, and opportunities, someth ..read more
Visit website
Models, MLFlow, and Microsoft Fabric
Data Engineering in Towards Data Science
by Roger Noble
1w ago
Fabric Madness part 5 Image by author and ChatGPT. “Design an illustration, with imagery representing multiple machine learning models, focusing on basketball data” prompt. ChatGPT, 4, OpenAI, 25th April. 2024. https://chat.openai.com. A Huge thanks to Martim Chaves who co-authored this post and developed the example scripts. So far in this series, we’ve looked at how to use Fabric for collecting data, feature engineering, and training models. But now that we have our shiny new models, what do we do with them? How do we keep track of them, and how do we use them to make predicti ..read more
Visit website
The Foundation of Data Validation
Data Engineering in Towards Data Science
by Chengzhi Zhao
1w ago
Discussing the basic principles and methodology of data validation Photo by Vardan Papikyan on Unsplash Although it may not be the most glamorous aspect of data work, data validation is crucial to any data-related task. Data validation can be tedious. When we think of validation of data, what is the first thing that comes into your mind? Endless Spreadsheet? Multiple layers of SubQuery SQL? I wish I could say data validation is fun. Unfortunately, it's not a paved road process. If you are reading this blog post, you may have faced the challenge of data validation before, or you might ..read more
Visit website
The Stream Processing Model Behind Google Cloud Dataflow
Data Engineering in Towards Data Science
by Vu Trinh
1w ago
Balancing correctness, latency, and cost in unbounded data processing Image created by the author. This was originally published at https://vutr.substack.com. Table of contents Before we move on Introduction from the paper. The details of the Dataflow model. Implementation and designs of the model. Intro Google Dataflow is a fully managed data processing service that provides serverless unified stream and batch data processing. It is the first choice Google would recommend when dealing with a stream processing workload. The service promises to ensure correctness ..read more
Visit website
Delta Lake — Type widening
Data Engineering in Towards Data Science
by Vitor Teixeira
1w ago
Delta Lake — Type Widening What is type widening and why does it matter? Photo by Luca Florio on Unsplash Delta Lake is on its way to releasing a new major version, and, of course, there are plenty of features that everyone in the community widely expects. One of them is called Type Widening and this post will be dedicated to explaining what it is and why it is useful. The only constant in life is change — Heraclitus Heraclitus’s quote not only applies to the world we live in but also when talking about Data, the information that describes it. We’re in an era of fast-paced ..read more
Visit website
Transform Data with Hyperbolic Sine
Data Engineering in Towards Data Science
by David Kyle
1w ago
Why handling negative values should be a cinch Photo by Osman Rana on Unsplash Many models are sensitive to outliers, such as linear regression, k-nearest neighbor, and ARIMA. Machine learning algorithms suffer from over-fitting and may not generalize well in the presence of outliers.¹ However, the right transformation can shrink these extreme values and improve your model’s performance. Transformations for data with negative values include: Shifted Log Shifted Box-Cox Inverse Hyperbolic Sine Sinh-arcsinh Log and Box-Cox are effective tools when working with positive data, but ..read more
Visit website
DuckDB and AWS — How to Aggregate 100 Million Rows in 1 Minute
Data Engineering in Towards Data Science
by Dario Radečić
2w ago
DuckDB and AWS — How to Aggregate 100 Million Rows in 1 Minute Process huge volumes of data with Python and DuckDB — An AWS S3 example. Photo by Growtika on Unsplash When companies need a secure, performant, and scalable storage solution, they tend to gravitate toward the cloud. One of the most popular platforms in the game is AWS S3 — and for a good reason — it’s an industry-leading object storage solution that can serve as a data lake. The question is — Can you aggregate S3 bucket data without downloading it? And can you do it fast? The answer is yes to both question ..read more
Visit website

Follow Data Engineering in Towards Data Science on FeedSpot

Continue with Google
Continue with Apple
OR