Start Data Engineering on Feedspot

Start Data Engineering

by

3d ago

1. Introduction 2. Docker concepts 2.1. Define the OS and its configurations with an image 2.2. Use the image to run containers 2.2.1. Communicate between containers and local OS 2.2.2. Start containers with docker CLI or compose 2.2.3. Executing commands in your docker container 3. Conclusion 4. References 1. Introduction Docker can be overwhelming to start with ..read more

Visit website

Data Engineering Best Practices - #2. Metadata & Logging

Start Data Engineering

by

1M ago

1. Introduction 2. Setup & Logging architecture 3. Data Pipeline Logging Best Practices 3.1. Metadata: Information about pipeline runs, & data flowing through your pipeline 3.2. Obtain visibility into the code’s execution sequence using text logs 3.3. Understand resource usage by tracking Metrics 3.4. Monitoring UI & Traceability 3.5. Rapid issue identification and resolution with actionable alerts 4. Conclusion 5 ..read more

Visit website

Uplevel your dbt workflow with these tools and techniques

Start Data Engineering

by

3M ago

1. Introduction 2. Setup 3. Ways to uplevel your dbt workflow 3.1. Reproducible environment 3.1.1. A virtual environment with Poetry 3.1.2. Use Docker to run your warehouse locally 3.2. Reduce feedback loop time when developing locally 3.2.1. Run only required dbt objects with selectors 3.2.2. Use prod datasets to build dev models with defer 3.2.3. Parallelize model building by increasing thread count 3 ..read more

Visit website

What is an Open Table Format? & Why to use one?

Start Data Engineering

by

5M ago

1. Introduction 2. What is an Open Table Format (OTF) 3. Why use an Open Table Format (OTF) 3.0. Setup 3.1. Evolve data and partition schema without reprocessing 3.2. See previous point-in-time table state, aka time travel 3.3. Git like branches & tags for your tables 3.4. Handle multiple reads & writes concurrently 4. Conclusion 5. Further reading 6 ..read more

Visit website

6 Steps to Avoid Messy Data in Your Warehouse

Start Data Engineering

by

6M ago

1. Introduction 2. Six Steps for a Clean Data Warehouse 2.1. Understand the business 2.2. Make data easy to use with the appropriate data model 2.3. Good input data is necessary for a good data warehouse 2.4. Define Source of Truth (SOT) and trace its usage 2.5. Keep stakeholders in the loop for a more significant impact 2.6. Watch out for org-level red flags ? 3 ..read more

Visit website

Data Engineering Best Practices - #1. Data flow & Code

Start Data Engineering

by

8M ago

1. Introduction 2. Sample project 3. Best practices 3.1. Use standard patterns that progressively transform your data 3.2. Ensure data is valid before exposing it to its consumers (aka data quality checks) 3.3. Avoid data duplicates with idempotent pipelines 3.4. Write DRY code & keep I/O separate from data transformation 3.5. Know the when, how, & what (aka metadata) of pipeline runs for easier debugging 3 ..read more

Visit website

What is a self-serve data platform & how to build one

Start Data Engineering

by

10M ago

1. Introduction 2. What is self-serve? 2.1. Components of a self-serve platform 3. Building a self-serve data platform 3.1. Creating dataset(s) 3.1.1. Gather requirements 3.1.2. Get data foundations right 3.2. Accessing data 3.3. Identify and remove dependencies 4. Conclusion 5. Further reading 6. References 1. Introduction Most companies want to build a self-serve data platform ..read more

Visit website

How to become a valuable data engineer

Start Data Engineering

by

11M ago

1. Introduction 2. Skills 2.1. Business Impact 2.1.1. Know your business 2.1.2. Money & Time 2.2. Technical skills 3. Build impactful projects 4. Conclusion 5. Further reading 1. Introduction So you are a new data engineer (or looking for a DE job) and want to better yourself as a data engineer. However, when you look at job postings or company tech stack, you are overwhelmed by the sheer amount of tools you have to learn ..read more

Visit website

Data Pipeline Design Patterns - #2. Coding patterns in Python

Start Data Engineering

by

1y ago

Introduction Sample project Code design patterns 1. Functional design 2. Factory pattern 3. Strategy pattern 4. Singleton, & Object pool patterns Python helpers 1. Typing 2. Dataclass 3. Context Managers 4. Testing with pytest 5. Decorators Misc Conclusion Further reading References Introduction Using the appropriate code design pattern can make your code easy to read, extensible, and seamless to modify existing logic, debug, and enable developers to onboard quicker ..read more

Visit website

Stitch S3 DB Integration

Start Data Engineering

by

1y ago

Given Source S3 path and file delimiter data warehouse connection details (endpoint, port, username, password and database name) data warehouse schema name and table name Run frequency Steps Log into your stitch account, here Click on the Destination tab and use the data warehouse connection details to establish a destination database. Click on Add Integration button on your dashboard. Select Amazon S3 CSV as the integration in the next page ..read more

Visit website

Follow Start Data Engineering on FeedSpot