Data Engineering in Towards Data Science on Feedspot

DuckDB and AWS — How to Aggregate 100 Million Rows in 1 Minute

Data Engineering in Towards Data Science

by Dario Radečić

3d ago

DuckDB and AWS — How to Aggregate 100 Million Rows in 1 Minute Process huge volumes of data with Python and DuckDB — An AWS S3 example. Photo by Growtika on Unsplash When companies need a secure, performant, and scalable storage solution, they tend to gravitate toward the cloud. One of the most popular platforms in the game is AWS S3 — and for a good reason — it’s an industry-leading object storage solution that can serve as a data lake. The question is — Can you aggregate S3 bucket data without downloading it? And can you do it fast? The answer is yes to both question ..read more

Visit website

Mastering SAP’s data models | by Ayoub El Outati

Data Engineering in Towards Data Science

by Ayoub El Outati

4d ago

If you are a curious reader interested in learning more about SAP data models, you’ve come to the right place! Hello Medium readers! I’m excited to share some learnings from a recent project where I dived into the complexity of SAP’s data models. For confidentiality reasons, I’m unable to share all the project details ?. However, I’ll discuss a challenge I faced regarding the complexity of SAP data models: what does the SAP data architecture looks like and how does everything fit into a coherent data model that makes sense for business users? In this project, my primary focus wa ..read more

Visit website

3 Best Practices for Bridging the Gap Between Engineers and Analysts

Data Engineering in Towards Data Science

by Madison Schott

1w ago

Assigning code owners, hiring analytics engineers, and creating flywheels Photo by Alex Radelich on Unsplash As an analytics engineer, one of the most challenging problems I face is bridging the gap between engineering and analytics. Engineering and analytics are often siloed into their own teams, making cross-collaboration quite difficult. Engineering pushes software and data changes that analytics knows nothing about. Analytics is then forced to pivot its work to accommodate these changes. Or worse, analytics must suggest a change that engineering needs to then fight into their tight sc ..read more

Visit website

Create an AI-Driven Movie Quiz with Gemini LLM, Python, FastAPI, Pydantic, RAG and more

Data Engineering in Towards Data Science

by Volker Janz

1w ago

Discover the basics of using Gemini with Python via VertexAI, creating APIs with FastAPI, data validation with Pydantic and the fundamentals of Retrieval-Augmented Generation (RAG) Photo by Kenny Eliason on Unsplash Within this article, I share some of the basics to create a LLM-driven web-application, using various technologies, such as: Python, FastAPI, Pydantic, VertexAI and more. You will learn how to create such a project from the very beginning and get an overview of the underlying concepts, including Retrieval-Augmented Generation (RAG). Disclaimer: I am using data from The Mo ..read more

Visit website

Feature Engineering with Microsoft Fabric and Dataflow Gen2

Data Engineering in Towards Data Science

by Roger Noble

1w ago

Fabric Madness part 3 Image by author and ChatGPT. “Design an illustration, featuring a Paralympic basketball player in action, this time the theme is on data pipelines” prompt. ChatGPT, 4, OpenAI, 15April. 2024. https://chat.openai.com. In the previous post, we discussed how to use Notebooks with PySpark for feature engineering. While spark offers a lot of flexibility and power, it can be quite complex and requires a lot of code to get started. Not everyone is comfortable with writing code or has the time to learn a new programming language, which is where Dataflow Gen2 comes in. Wh ..read more

Visit website

Need for Speed: cuDF Pandas vs. Pandas

Data Engineering in Towards Data Science

by Thomas Reid

3w ago

Image by Author (Dalle-3)A comparative overview What is cuDF Pandas? If you’re a user of the Pandas library in Python, and you want or need to maximise your program run times, then you have a few options available to you. Most of these options revolve around the use of external libraries that supplant existing Pandas operations and are optimised for data processing at scale and speed. Examples of these libraries are VAEX, POLARS, DuckDB and others. The issue with these is that in general, they require you to re-write your code to a greater or lesser extent which may not be something you w ..read more

Visit website

Fabric Madness

Data Engineering in Towards Data Science

by Roger Noble

3w ago

Predicting basketball games with Microsoft Fabric Image by author and ChatGPT. “Design an illustration, focusing on a basketball player in action, the design integrates sports and data analytics themes in a graphic novel style” prompt. ChatGPT, 4, OpenAI, 28 March. 2024. https://chat.openai.com. A Huge thanks to Martim Chaves who co-authored this post and developed the example scripts. At the time of writing, it’s basketball season in the United States, and there is a lot of excitement around the men’s and women’s college basketball tournaments. The format is single elimination, so over t ..read more

Visit website

Navigating Your Data Platform’s Growing Pains: A Path from Data Mess to Data Mesh

Data Engineering in Towards Data Science

by Mahdi Karabiben

3w ago

Unlike their software counterparts, data teams lack established methodologies for overcoming scalability challenges. This article offers a set of guiding principles to effectively scale your data platform while maximizing its business impact. Photo by Jack Anstey on Unsplash When working on software components, developers can leverage a wide range of frameworks, design patterns, and principles to scale their products and seamlessly adjust their architecture to support new use cases and handle increasing usage and complexity. This allows software engineering teams to ensure optimized ..read more

Visit website

How I use Gen AI as a Data Engineer

Data Engineering in Towards Data Science

by Hugo Lu

1M ago

Me using AI. Image by the authorGenerative AI is all the rage. In this article we dive into some practical examples for Data Engineers Introduction Embedding Generative AI within Data Engineering Workflows and Data Pipelines is actually extremely straightforward and gratifying. As a bridge between software and business users, Data Teams are in an unrivaled position to quickly iterate on Generative AI use-cases that have important impacts on the business. Specifically, Generative AI can be used to summarise vast quantities of both structured and unstructured information, which both expand ..read more

Visit website

Four Data Engineering Projects That Look Great on your CV

Data Engineering in Towards Data Science

by Mike Shakhomirov

1M ago

Data pipelines that would turn you into a decorated data professional AI-generated image using Kandinsky In this story, I would like to speak about data engineering career paths and data projects that look great on any CV. If you are an aspiring data practitioner not only willing to learn new tools and techniques but also aiming to build your own data projects portfolio — this article is for you. During my more than 15 years career in data and analytics, I witnessed good and bad CVs showcasing data engineering skills. Data engineering projects you were involved in or responsible for are the ul ..read more

Visit website

Follow Data Engineering in Towards Data Science on FeedSpot