DuckDB and AWS — How to Aggregate 100 Million Rows in 1 Minute
Data Engineering in Towards Data Science
by Dario Radečić
3d ago
DuckDB and AWS — How to Aggregate 100 Million Rows in 1 Minute Process huge volumes of data with Python and DuckDB — An AWS S3 example. Photo by Growtika on Unsplash When companies need a secure, performant, and scalable storage solution, they tend to gravitate toward the cloud. One of the most popular platforms in the game is AWS S3 — and for a good reason — it’s an industry-leading object storage solution that can serve as a data lake. The question is — Can you aggregate S3 bucket data without downloading it? And can you do it fast? The answer is yes to both question ..read more
Visit website
Mastering SAP’s data models | by Ayoub El Outati
Data Engineering in Towards Data Science
by Ayoub El Outati
4d ago
If you are a curious reader interested in learning more about SAP data models, you’ve come to the right place! Hello Medium readers! I’m excited to share some learnings from a recent project where I dived into the complexity of SAP’s data models. For confidentiality reasons, I’m unable to share all the project details ?. However, I’ll discuss a challenge I faced regarding the complexity of SAP data models: what does the SAP data architecture looks like and how does everything fit into a coherent data model that makes sense for business users? In this project, my primary focus wa ..read more
Visit website
3 Best Practices for Bridging the Gap Between Engineers and Analysts
Data Engineering in Towards Data Science
by Madison Schott
1w ago
Assigning code owners, hiring analytics engineers, and creating flywheels Photo by Alex Radelich on Unsplash As an analytics engineer, one of the most challenging problems I face is bridging the gap between engineering and analytics. Engineering and analytics are often siloed into their own teams, making cross-collaboration quite difficult. Engineering pushes software and data changes that analytics knows nothing about. Analytics is then forced to pivot its work to accommodate these changes. Or worse, analytics must suggest a change that engineering needs to then fight into their tight sc ..read more
Visit website
Create an AI-Driven Movie Quiz with Gemini LLM, Python, FastAPI, Pydantic, RAG and more
Data Engineering in Towards Data Science
by Volker Janz
1w ago
Discover the basics of using Gemini with Python via VertexAI, creating APIs with FastAPI, data validation with Pydantic and the fundamentals of Retrieval-Augmented Generation (RAG) Photo by Kenny Eliason on Unsplash Within this article, I share some of the basics to create a LLM-driven web-application, using various technologies, such as: Python, FastAPI, Pydantic, VertexAI and more. You will learn how to create such a project from the very beginning and get an overview of the underlying concepts, including Retrieval-Augmented Generation (RAG). Disclaimer: I am using data from The Mo ..read more
Visit website
Feature Engineering with Microsoft Fabric and Dataflow Gen2
Data Engineering in Towards Data Science
by Roger Noble
1w ago
Fabric Madness part 3 Image by author and ChatGPT. “Design an illustration, featuring a Paralympic basketball player in action, this time the theme is on data pipelines” prompt. ChatGPT, 4, OpenAI, 15April. 2024. https://chat.openai.com. In the previous post, we discussed how to use Notebooks with PySpark for feature engineering. While spark offers a lot of flexibility and power, it can be quite complex and requires a lot of code to get started. Not everyone is comfortable with writing code or has the time to learn a new programming language, which is where Dataflow Gen2 comes in. Wh ..read more
Visit website
Need for Speed: cuDF Pandas vs. Pandas
Data Engineering in Towards Data Science
by Thomas Reid
3w ago
Image by Author (Dalle-3)A comparative overview What is cuDF Pandas? If you’re a user of the Pandas library in Python, and you want or need to maximise your program run times, then you have a few options available to you. Most of these options revolve around the use of external libraries that supplant existing Pandas operations and are optimised for data processing at scale and speed. Examples of these libraries are VAEX, POLARS, DuckDB and others. The issue with these is that in general, they require you to re-write your code to a greater or lesser extent which may not be something you w ..read more
Visit website
Fabric Madness
Data Engineering in Towards Data Science
by Roger Noble
3w ago
Predicting basketball games with Microsoft Fabric Image by author and ChatGPT. “Design an illustration, focusing on a basketball player in action, the design integrates sports and data analytics themes in a graphic novel style” prompt. ChatGPT, 4, OpenAI, 28 March. 2024. https://chat.openai.com. A Huge thanks to Martim Chaves who co-authored this post and developed the example scripts. At the time of writing, it’s basketball season in the United States, and there is a lot of excitement around the men’s and women’s college basketball tournaments. The format is single elimination, so over t ..read more
Visit website
Navigating Your Data Platform’s Growing Pains: A Path from Data Mess to Data Mesh
Data Engineering in Towards Data Science
by Mahdi Karabiben
3w ago
Unlike their software counterparts, data teams lack established methodologies for overcoming scalability challenges. This article offers a set of guiding principles to effectively scale your data platform while maximizing its business impact. Photo by Jack Anstey on Unsplash When working on software components, developers can leverage a wide range of frameworks, design patterns, and principles to scale their products and seamlessly adjust their architecture to support new use cases and handle increasing usage and complexity. This allows software engineering teams to ensure optimized ..read more
Visit website
How I use Gen AI as a Data Engineer
Data Engineering in Towards Data Science
by Hugo Lu
1M ago
Me using AI. Image by the authorGenerative AI is all the rage. In this article we dive into some practical examples for Data Engineers Introduction Embedding Generative AI within Data Engineering Workflows and Data Pipelines is actually extremely straightforward and gratifying. As a bridge between software and business users, Data Teams are in an unrivaled position to quickly iterate on Generative AI use-cases that have important impacts on the business. Specifically, Generative AI can be used to summarise vast quantities of both structured and unstructured information, which both expand ..read more
Visit website
Four Data Engineering Projects That Look Great on your CV
Data Engineering in Towards Data Science
by Mike Shakhomirov
1M ago
Data pipelines that would turn you into a decorated data professional AI-generated image using Kandinsky In this story, I would like to speak about data engineering career paths and data projects that look great on any CV. If you are an aspiring data practitioner not only willing to learn new tools and techniques but also aiming to build your own data projects portfolio — this article is for you. During my more than 15 years career in data and analytics, I witnessed good and bad CVs showcasing data engineering skills. Data engineering projects you were involved in or responsible for are the ul ..read more
Visit website

Follow Data Engineering in Towards Data Science on FeedSpot

Continue with Google
Continue with Apple
OR