Last fall Amazon announced its search for a second headquarters, considering applications from a whopping 200 cities. Now, those initial 200 have been narrowed down to 20 “Amazon HQ2” hopefuls that span the country and include one Canadian city, Toronto. For these shortlisted cities, there is a lot at stake—the winning city stands to gain 50,000 high-paying jobs over the next few years and tens of billions of dollars’ worth of investment.
There has been much speculation as to which city has the best chance of being selected, but here at Trifacta, we decided to take a data-driven approach to predicting Amazon HQ2. Based upon publicly available Census data, and by leveraging Wrangler, we measured how the cities on Amazon’s shortlist stack up in terms of population, education, key occupational profiles and diversity in order to make an educated guess about the ideal HQ2.
Based on our analysis, we have concluded that the Washington DC metro area is the best choice for Amazon’s second headquarters. Within this area, we believe Northern Virginia to be the best sub-location. The rest of this article will describes how we arrived at this conclusion.
We used data provided by the U.S. Census Bureau’s American Community Survey as the basis of our analysis, specifically pulling from the Public Use Microdata Sample (PUMS), which provides rich demographic detail. Relying solely on U.S. Census data meant excluding Toronto from consideration, but we agreed that was acceptable—given the political implications of an international choice, we believe the odds of Toronto winning are very low. That left us with 19 remaining candidates and, using Trifacta’s data preparation platform, we got to work.
The high-level steps that we took to prepare the data in Wrangler were as follows:
First, we imported the 5-year PUMS data for the entire U.S., which was split into four files.
Then, we brought in a mapping table that maps Public Use Microdata Areas (PUMAs) to Combined Statistical Areas (CSAs). This table was produced using a “spatial join” in QGIS, an open source geographical information system.
Next, we combined the four files from step one into a single dataset by using the “union” function in Trifacta and performed a lookup with the results of step 2 to add CSAs to the core dataset.
We limited the dataset to only the necessary rows and columns, and prepared a lookup table from the PUMS data dictionary. We used this lookup table to convert codes in the core dataset to human readable text.
Finally, we ran a job in Trifacta to produce a .TDE file for visualization in Tableau to assess the outcome of our data.
Once in Tableau, the first step was rank-ordering these cities by population. In order for a city to provide Amazon with a deep labor pool and adequate support infrastructure, size is a leading indicator. To do this, we utilized the construct of Combined Statistical Area defined by United States Office of Management and Budget, which takes into account a major city and its surroundings in a consistent way.
Figure 2: Population
Note that some of the “cities” on Amazon’s shortlist (Newark, Montgomery county and Northern Virginia) are actually part of another Combined Statistical Area.
The next key criterion is the availability of a well-educated labor force, which is a fundamental requirement for a robust talent pool. Here is how the cities stack up in terms of proportion of population with bachelor’s degrees or higher.
Figure 3: Education
Washington DC ranks 3rd overall. But when you specifically look at STEM skills, Washington DC rises to the top. For a technology company like Amazon, we believe that access to a rich pool of STEM talent is a key enabler for future growth.
Figure 4: STEM Rank
The census data also provides an alternate perspective of the talent pool based on occupational profiles. In particular, there are five occupational profiles that Amazon has identified as being particularly important: software development, legal, accounting, management and administrative. Washington DC has a robust labor pool in all of these occupations.
Figure 5: Occupational Profile
Having a diverse and multicultural workforce can provide a distinct advantage for a company that wants to grow rapidly and encourage innovation. For this reason, we looked at diversity, both in terms of gender and racial identities. In both areas, Washington DC shines due to its vibrant and diverse workforce.
Washington DC tops the charts in terms of female talent in the “Computer & Math” occupational group. Washington DC also has a diverse population with many races and cultures.
Figure 6: Gender Diversity
Figure 7: Racial Diversity
Due to the reasons listed above, we believe the Washington DC metro area is the ideal location for Amazon’s second headquarters. The Combined Statistical Area consists of 3 cities in Amazon’s shortlist: Washington DC, Northern Virginia and Montgomery County (Maryland). While all 3 are excellent choices, we believe the selection process between them comes down to qualitative factors. We picked Northern Virginia, because it is already the site of one of Amazon’s largest data centers (US-East) and enjoys relative proximity to top-notch Virginia universities that will provide an ongoing source of tech talent.
Please see this interactive dashboard for more details and ability to drill into the data used in this article.
Guest Contributor: Christine is the Director of Business Program Management in Microsoft’s Partner Experience & Startup Team in the Cloud + Enterprise engineering division. Our charter is to win the hearts and minds of our cloud innovation partners and design the pathways for startups to partner successfully with Microsoft. Specifically, Christine leads partner go-to-market for marketplace and go-to-market and co-sell motions for startups. Prior to this role Christine held a variety of leadership roles in product and program management at SMART Technologies, Johnson & Johnson & American Express. Christine is an MBA graduate of the London Business School with an undergraduate degree from Yale University. Outside of work Christine loves to run (although not quickly), play tennis and force friends to try her favorite restaurants.
With today’s fast moving technology and abundance of data sources, gaining a complete view of your customer is increasingly challenging and critical. This includes campaign interaction, opportunities for marketing optimization, current engagement, and recommendations for next best action. To continuously drive business growth, financial services organizations are especially focused on innovation and speed-to-market in this area, as they look to overcome the added challenge of implementing and integrating best-of-breed solutions jointly, to quickly gain that 360-degree view of the customer.
To address these needs in an accelerated way, Bardess is bringing together the technology of Cloudera, Qlik, and Trifacta, along with their own accelerators and industry expertise, to deliver rapid value to customers.
By combining Cloudera’s modern platform for machine learning and analytics, Qlik’s powerful, agile business intelligence and analytics suite, Trifacta’s data preparation platform, and Bardess accelerators, organizations can uncover insights and easily build comprehensive views of their customers across multiple touch points and enterprise systems.
The solution offers a complete platform for Customer 360 workloads available on Microsoft Azure in minutes, enabling:
Modernized data management, optimized for the cloud, to transform complex data into clear and actionable insights with Cloudera Enterprise.
Democratization of your analytics by empowering business users to prep their data for analysis using Trifacta.
Identification of patterns, relationships and outliers in vast amounts of data in visually compelling ways using Qlik.
Artificial Intelligence (AI), Machine Learning (ML), predictive, prescriptive and geospatial capabilities to fully leverage data assets using Cloudera Data Science Workbench.
Building, testing, deploying, and management of workloads in the cloud through Microsoft Azure.
Accelerated implementation and industry best practices through Bardess services.
Learn more about Customer 360 Powered by Zero2Hero on the Azure Marketplace, and look for more integrated solutions, offering best-of-breed technology via a unified buying and implementation experience enabled by a systems integrator, coming to the Azure Marketplace in July.
Connected devices, or what is more broadly referred to as the “Internet of Things (IoT),” have been a growing presence in our everyday lives. From watches that monitor our health to thermostats controlled by smartphones, these devices are increasing the amount of information available to us at our fingertips. Now, the insurance industry is using the same connected device model to alert policyholders about potential disasters. Dubbed “Connected Insurance,” it might include alerts about a fire or gas leak on your mobile device or real-time video interaction to help halt disasters in their tracks. These services rely on sensor equipment to pick up early stage disasters in the making, and can include emergency repair services that stop small-scale issues from turning into vast unwieldy claims.
The emergence of connected insurance will involve a shift from traditional ‘after the event’ policy models towards those that incorporate ways to preemptively reduce risk (and simultaneously improve the overall experience of policy holders). Selling premiums in isolation is a static market already in decline, yet connected Insurance, with loss avoidance as the key driver, will continue to grow. This brave new world of connected Insurance clearly relies on two things: good quality and timely data, yet when I review some of my client’s existing data sets this fully integrated future can seem unattainable, a million miles away. But looking around, connected insurance services are already with us and successfully on the market, so how do they do it?
Connected Insurance Demands a New Approach to Data Preparation
Under the premium-led model most data is historical and stored in traditional legacy system extracts with the usual Extraction, Transform and Load (ETL). This rigid process adequately serves standard reporting efforts— business users request data requirements from IT and, after a certain amount of time, are able to deliver insights “after the event.” With the rise of connected insurance, however, new demands are placing a strain on this approach.
Connected insurance involves more complex data feeds at a higher frequency. Predictive models via machine learning (ML) and artificial intelligence (AI) are not simply ‘nice to have,’ but rather ‘essentials’, which can add to difficulties around data capture and integration. In order to thrive in this new environment, these extra data feeds need to be processed quickly and efficiently through an agile approach that circumvents the central IT team.
Organisations that have achieved success in more quickly leveraging the complex data associated with connected insurance are those that have embraced new approaches to data preparation, such as Trifacta’s data preparation platform. Using this technology allows users to quickly incorporate additional data feeds into an end-to-end connected insurance solution, rather than having to wait and wait while the IT team incorporate new feeds and integration points within the enterprise data warehouse. These tools allow business users to add in new functionality and add-on services independently with minimum support, which of course all serves to accelerate adoption.
Real-life Examples of Connected Insurance
To see connected insurance in action, we only have to look at commercial property insurance, which is leveraging connected devices to insure an office building. Instead of issuing a standard policy, they are now including multiple sensors within the building, including fire alarms systems, carbon monoxide detectors or sprinklers, etc. in order to provide a managed service. Each sensor will play an important role in building varied and rich data sets to determine risk and, as the service matures, the ability to add on further data points and external lookups will greatly enhance available features and offerings. In order to wrangle this data, data preparation technologies are essential in making such scenarios a reality.
The same principles apply to life and health insurance, where early identification of health risks, such as a heart attack or stroke, by way of efficient data preparation is quite literally saving lives. These amazing new services, powered by data from wearables, may become so integrated with the cover itself that the consumer no longer distinguishes between the service provider of vital, life-saving data and the end insurer. Beyond the intended risk prevention, there are other, previously unimaginable uses of sensors that can have transformative outcomes. These incredible new insights opened via analytics might show us new ways to ensure a healthy and long-lived life.
The Future of Connected Insurance
In this new connected world, with sensors collecting data on potential risks, powerful real-time data flows will generate an entire host of beneficial new services, capable of providing tailored value for both individuals and organisations. The ability to control and master rich new data sets is vital to future services in connected insurance, and analytics are key to interpreting value in this brand new world. Both insurers and the InsureTechs now need to work together to successfully deliver a range of exciting new services, harnessing rich data sets and interpreting them with AI/ML, always conscious of their interdependence.
Some may be sceptical, perhaps thinking, ‘we’ve heard it all before’, perhaps recalling the advent of Telematics some years ago, when we all heard promises of how the new technology would transform the face of motoring insurance that largely failed to materialise. This, however, is a fairly typical scenario in the introductory stages of new technology, which tends to pass through a ‘hype cycle’ of enormous initial excitement followed by an element of disillusionment. However, it is in the later phases, when the data becomes fully utilised, that real opportunities materialise. In this manner, risk prevention will prove a rich area for development and growth, capable of offering far greater opportunities than motor insurance has ever seen before. We know this is the future, the way forward, and only have to look at the 300% increase in investments in Insurtech in 2017 to understand the huge appetite for this new and exciting area of disruption.
Standalone software tools designed solely for data preparation have become a necessary part of a modern analytics stack. The rising size and complexity of datasets has resulted in a need for correspondingly powerful, flexible solutions. Trifacta was designed to help reduce data cleaning and preparation time by:
Enabling better up-front assessment of data sources
Supplying smart extraction that learns preferences over time
Providing easy-to-use, intelligent, beautiful visualizations that improve data understanding and speed up insights
Crunch the numbers faster
Combine rows from separate datasets to create a new dataset. Picking and choosing individual columns from multiple datasets results in fewer rows — meaning faster load times for your machine.
Wrangle big data
Work without fear of something becoming too big to open. Trifacta handles datasets of any size.
Point and click
Forget building and troubleshooting complex VLOOKUP formulas. Just point and click to perform data operations.
Easily manipulate data elements
Aggregate and multiply data, creating new columns to reference on the fly. For example, you can calculate a percent change and then store that as its own value to graph.
Debug as you go
Why wait until you’re done to see if you’re building something right? Trifacta’s tools update as you go so you can spot potential problems right away.
Collaborate with coworkers
Save and share joins and lookups with colleagues.
We love Excel, and we use it here at Trifacta all the time, particularly the VLOOKUP function. But we know VLOOKUP is also sometimes painful and cumbersome to use when you’re in a hurry to finish that model your business partner needs or that report everyone seems to be waiting on. Some of the limitations you’ve probably already run into include:
Rigid data structure required
Values being “looked up” must be to the left of the column you want results in, and return values can only be to the right–when your work is structured differently this is a major inconvenience.
Manual typing/copying is error prone
Data sources have different styles or formats, like date formats. So each time you import new data, you have to reformat it all over again. Copying the formula to multiple columns is error prone, and each time you import new data, you run that risk all over again.
Works only for simple data matching
VLOOKUP can only use one lookup value. If you need to use two, it’s difficult to set up and prone to error. Case is disregarded, so you can’t parse on that. In addition, VLOOKUP uses an approximate match as default instead of exact match, so you may get inaccurate data and not know it, especially in large files. Using approximate match, you’ll have to sort in order so that the first result is the true result. And each time you import data, you have to remember to sort (or write a macro).
Hard to share, maintain, and consume colleagues’ efforts
Today, you sort the sheet one way; tomorrow, your colleague sorts another way. Each time, the source data you use is being altered. When a new person arrives, they start all over again. Ultimately, it becomes impossible to read anyone else’s models, let alone maintain and support them. Version control can be a serious problem … and we’ve all been there when the backup failed.
Troubleshooting is extremely difficult
There’s no such thing as checking a VLOOKUP while you’re building it. You have to write the formula, hope you’ve done it perfectly, and let it run. Any mistakes you’ve built in will be hard to isolate, a situation that becomes even worse when you’re using a data source you’re even slightly unfamiliar with or that might contain errors.
Working across datasets is painful
Building a VLOOKUP that will run across separate Excel files is theoretically possible, but so hard to do that it’s easier to simply combine the files (usually through copying and pasting sheets)—which will only work until the datasets become too big to open.
Cartesian limits and speed
Big data crunches and macros can slow down your machine (sometimes even just opening the file), and in some cases can’t be run at all; files past a certain size can’t be manipulated in Excel due to cartesian limits.
Big data getting bigger
The nature of some data sources means they grow in size on a yearly, monthly, or even daily basis. Eventually, your data ends up too big for Excel to handle.
VLOOKUP: powerful, but frustrating. As your business matures and the demands on your analysts grow, you need a tool that’ll scale with them—something robust but flexible, and able to handle a mounting array of disparate sources effectively. What’s the way forward? Find out in Part Two of our exploration coming out next week. Until then, you can start wrangling with Trifacta.
Digital ad spend is often a sizeable chunk of any marketing budget. This year, marketers are expected to allocate $97 billion for digital ads, including paid search, video, display ads and social media ads (with a heavy slant toward mobile), which is a 14% increase from 2017. The upswing in this marketing tactic makes sense. Despite ongoing debate over the effectiveness of digital ads, they remain a surefire way to meet consumers where they are: online, and usually by way of mobile devices.
The rise of digital ad fraud
But there’s another number around digital ads: $19 billion, which is the amount that marketers are expected to lose to ad fraud this year. Ad fraud is rampant across the board, and occurs under a number of scenarios. Digital ads may not be served up to the right customer demographics, which means they generate invalid traffic from the wrong views and clicks. Or, publishers can register fraudulent traffic when bots click or view digital advertisements. In either case, marketers end up paying for ads that quite simply are a waste of money.
Attempts to counteract fraud are notoriously difficult. Identifying the right customer demographics requires segmenting huge volumes of customer data. But this data isn’t always accessible, nor easy to prepare or analyze. Similarly, filtering out traffic generated from bots isn’t always straightforward, either. For example, Facebook made headlines in 2016 for removing viewers that watched ads hosted on its platform for less than three seconds, likely in an attempt to rule out accidental or fraudulent viewers. The decision backfired. It severely skewed the average viewing time and misled consumers who weren’t aware that this meaningful subset of data had been removed. Facebook promised to better identify what constitutes “incorrect” metrics, but the difficulty in parsing out fraudulent data remained.
Given that the investment in digital ads isn’t waning, marketers need better technology to help identify target customer demographics and cull through bot-generated traffic. At Trifacta, we’re seeing companies leverage our data preparation platform as a first, critical step to conducting analysis.
Trifacta for combatting invalid ad traffic
Understanding the demographics of customers is key to better targeted (and more effective) ads. In the pursuit of better understanding the behavior and demographic profiles of its customers, The Royal Bank of Scotland leveraged Trifacta to build more nuanced segments of their customer base. The company was able to prepare huge volumes of customer data across many different databases, and join customer activity with a specific customer ID. Better yet, this work was done by marketing analysts, not data scientists, which allowed them to more appropriately group customer identifiers based upon the questions they needed to answer. The Royal Bank of Scotland is leveraging its enhanced understanding of customers to drive many marketing activities across the bank—product development and customer service training, to name a few—but it can also inform their messaging and digital ad targets.
Trifacta for ad fraud detection
Trifacta also aids in identifying data quality issues related to fraudulent clicks. With its visual interface, it’s easy to spot anomalies, such as spikes in traffic, that signal a bot may have been at work. Marketers can easily wade through data that is meaningful versus poor quality data that muddies analysis. It’s also an opportunity to visually address fraud with media suppliers and save the organization millions in wasted marketing budget.
By Stewart Bond, IDC Research Director, Data Integration and Data Integrity Software
It’s a new world we live and work in. In recent years we have seen dramatic changes in distributed computing, combined with a massive increase of data volume, variety, and velocity as big data fuels digital transformation. In parallel, exponential increases in computing power and advances in machine learning are enabling artificial intelligence (AI) to deliver new solutions that turn the impossible into the possible. One such example is the advent of business-friendly, AI-enabled data preparation software solutions that enable the business to take back its ownership of the data.
According to a 2017 survey conducted by IDC, the 80/20 rule is still in effect: data professionals spend 80% of their time searching for and preparing data compared to 20% of their time performing analytics. At the heart of digital transformation is data; not the data itself, but the ability to leverage the value of analytic and operational insights derived from the data, resulting in actions that improve engagement, experience, and the bottom line. Business-oriented data preparation software solutions can now reduce the time people spend preparing data, thus increasing the time available to uncover new insights.
This new class of data preparation software is also very disruptive in the data integration and integrity (DII) software market, reducing demand for IT-oriented solutions, where data integration vendors have been incumbents. Data preparation software continues to gain momentum, demonstrating growth rates three times that of the overall DII software market, and a forecasted compound annual growth rate of 19% for the five-year period from 2018 to 2022.
Many IT departments are interested in data preparation software as it removes the IT bottleneck in data analytics projects while still allowing a level of control that will satisfy internal policies and external regulations. Governance coupled with collaboration is a key capability required in modern data preparation solutions.
As with any new software implementation, organizations need to build a business case and measure return on investment (ROI). Metrics are critical for success because what isn’t measured cannot be improved. Data preparation software lets users parse, cleanse, standardize, transform, federate, and integrate data. These capabilities are the basis on which metrics can be defined:
Time spent accessing and preparing data
Overall health and quality of data
Relief from support and maintenance costs associated with legacy data preparation software
Time spent accessing and preparing data: Reducing the time spent locating, accessing, and preparing data will change the 80/20 rule. A recent IDC study on the business value of data preparation software measured the ROI associated with this metric. The study identified an 88% reduction in data preparation time spent by analysts and a 20% increase in analyst productivity. The resulting ROI was achieved in a few months. Data quality levels, trust, and compliance also improved as by-products of better data access, profiling, standardization, control, and increased collaboration among analysts. Putting data preparation into the hands of the business user will also relieve IT from the pressure of preparing data, allowing a focus on more innovative initiatives.
Overall health and quality of data: Improving the overall data quality in the organization will increase the level of data trust, ultimately resulting in better data-driven business decisions and outcomes. Data quality is defined and measured by timeliness, completeness, accuracy, and consistency, which can be impacted by data duplication and distribution, as well as errors made during creation, modification, and blending. These are metrics that can be defined and tracked across the organization, and data profiling within and across columns according to standard regular expressions and defined business rules can provide an overall data quality score. As business data owners are given more access and the ability to cleanse their data using data preparation software, the scores should improve over time, demonstrating the software’s value and ROI.
Relief from support and maintenance costs related to legacy data preparation software: Cost savings are also associated with reducing reliance on legacy data preparation technology and its attendant complexities. As data preparation shifts to the end user’s desktop browser, the capacity required in legacy ETL software may be reduced, thereby reducing infrastructure and licensing costs. Data preparation software also provides new opportunities to handle larger data sets and greater varieties of data sets, and to improve the quality of data to uncover deeper insights that can enable better business outcomes through new products, services, or new ways to go to market, which is digital transformation.
If you are looking for opportunities to digitally transform with data, you may need data preparation software. You may also need it if:
Business users cannot access the data they need
IT is burdened with data preparation tasks that hold up business projects
The business no longer trusts the data to make decisions
IT’s budget is consumed supporting legacy data technologies that require specialized resources
Data is delivered too slowly for analysis to be valuable
Trifacta Wrangler, our free personal edition of Trifacta, is being used at hundreds of universities around the world for teaching data prep and data munging concepts. I’ve spoken with half a dozen instructors at colleges and universities in the US and Europe about how data prep fits into a larger data analytics curriculum for their students.
A common theme amongst university professors is the desire to teach students how to handle real-world data. For the instructors I’ve spoken with, that typically means using public data sources or running website scrapers to gather data for analysis; for both of these sources, the data is messy more often than not.
It’s widely known that data cleansing and data prep are the lionshare of data analysis; 80% of an analyst’s time is spent simply preparing data for downstream analytics. It’s appropriate to mention, too, that by “data analysts” we mean anyone who has to analyze a dataset.
“ It was a nice, easy way to introduce data cleansing / data wrangling as a concept.”
John Lochner. Visiting Instructor, Hamline University
Data analytics as a concept is taught in business schools, health science center curriculum, journalism, and other departments with student cohorts who are not necessarily planning to graduate to full-time data analyst jobs. Rather, they see data analysis as a means to an end. Whether they head onto careers as engineers, marketers, scientists, public servants, etc., analyzing data is now a core part of many jobs.
Providing students with clean and tidy datasets – no mismatched data types, no missing data, no spelling or formatting errors, and all formatted perfectly with the right column labels – is setting them up for a future of hurt. Real-world data is messy. Fortunately, most professors I’ve spoken with are having students download publicly available datasets for analysis – data that’s rife with issues and also very much akin to what’s found in real academic, business, or government jobs.
“Wrangler has spoiled us!”
Chris Claterbos, Lecturer Business Analytics Program Associate, University of Kansas
Additionally, while students are more technically capable every year, they are not all SQL wizards or python/R coders, and nor do they necessarily want to be. But, they do want to be able to work with data in whatever form it comes. Microsoft Excel excels as a spreadsheet, but struggles as a data prep tool for multiple datasets that need to be blended together or for large datasets which can cause Excel to crash.
Trifacta Wrangler fits the bill because it has the power to blend data, work with larger datasets, includes SQL-like capabilities such as JOIN and UNION but without any coding, export to CSV and other file types, and allows students to visually wrangle data and step backwards and forwards in their recipe steps, or edit steps to refine their work.
And, of course, it’s free!
Despite Wrangler’s successes in the classroom, there’s always room for improvement. If you are teaching data prep in your classroom, whether you’re using Wrangler or not, we would love to hear from you: what are your challenges and needs for teaching data prep?
Lastly, if you’re not giving students an option to try Wrangler for the data prep portion of your course, we invite you to consider trying it yourself. Every single professor I spoke with said that students picked it up quickly through a short demo or our online tutorials. While not everyone required Wrangler — few, in fact, did — most found it was a great tool for students who were learning the basic concepts.
Accelerate your learning and put your skills to the test with Trifacta’s new training and certification programs. We’re excited to unveil all new training to help people who are preparing data for analysis to take the next step in learning Trifacta. Data preparation skills are in high demand–stand out among your peers by becoming a certified Wrangler!
Our new training course covers the Trifacta basics. Very soon, we will unveil new courses for more advanced training and certification. This first course covers the basics of all that is possible within Trifacta. It will help you learn how to use Trifacta to create robust recipes to transform your data from its raw, messy state to a clean, structured and blended output ready for consumption in downstream tools. It also covers the basics of operationalizing your workflows in Trifacta. The only pre-requisites for the training are a version of Trifacta to work through the exercises on. If you do not have a Trifacta account, you can start with our free version here. While we recommend that you complete Online Training before taking the Certification Exam, if you’re already a Trifacta expert, you can skip ahead to the exam.
The Certification Exam is split into two parts. Part One tests your conceptual knowledge with 15 multiple choice questions. Part Two tests your practical skills by having test-takers complete a use case and answer questions related to the use case. Upon completion of both parts of the exam, you will be granted a badge on your Trifacta community profile and you can share your certification on your LinkedIn profile.
We are really excited to see our community of expert data wranglers grow and hope you join us on this journey! Check out the new training & certification.
At Trifacta, we are constantly looking for ways to improve the data preparation process by learning from our users and applying lessons learned. A new capability we introduced in our Wrangler Pro & Enterprise editions was inspired directly by how our users solve common scenarios today, and leveraging visualization and machine learning to provide interactive assistance while wrangling.
Common Scenario – Data Onboarding
Let’s start of with a simple example. Many of our users belong to industries that rely on external parties for sources of their data. In order for this data to be consumed by internal users or downstream customers, it has to be converted to match an existing schema or data model, often this process is called “data onboarding”.
Insurance → Policy details from freelance agents across the country
Retail → Supplier data coming in from 100s of individual vendors
Healthcare → Patient test results conducted by independent laboratories
The transformations needed to format and cleanse the data into a standardized structure are not complex but there are a lot of details that can be easy to miss.
Source Raw Data
Desired Target Result
In this example, some of the steps required are:
Merge names with a comma
Change the date format to DD/MM/YYYY
Standardize gender to M/F
Split street address into individual attributes
Re-order and rename of columns
Learning from our users
In practice, users would directly compare the source raw data against the target result by opening each in separate tabs/screens and going back and forth trying to match the data column by column. Using the structure, data and format as a set of constraints, the target acted as an intuitive guide for transformation. Unfortunately, this was a tedious and error prone process because it relied heavily on people remembering multiple details across many columns.
When asked if it was better to automatically apply transformations, users’ consistent feedback was that while some level of automation would reduce common errors, the source data varied often enough that they needed to apply manual tweaks to the transformation logic to make everything match correctly.
Taking this feedback, we decided to focus on three areas:
Display the target data structure, format and details – as a reference during wrangling
Use the constraints defined in the target to help validate the source data
Suggest transformations based on the target data, that the user can preview & modify
Introducing Schema Alignment through Target Matching
Target matching enables users to import constraints & data from a target destination into Trifacta for use during the transformation process. As a starting point, users can import common schema constraints from the target destination, such as
Structure of the data (Order of columns, Column data types, Uniqueness)
Data Formatting (Dates – YYYYMMDD or DDMMYYY)
System Constraints (Character limits, Timezone formats)
Trifacta uses these constraints to create a set of rules within a logical target that can be attached to a recipe. Within the transformer, Trifacta applies data validation and transformation suggestions to help users align the source dataset with the target destination schema.
How to use the Target to align your source data in Trifacta
From the flow view, users can import a Target and associate it with a recipe. Import can be from any file, relational database, Big Data system (Hive / Redshift / SQL Data Warehouse) that Trifacta supports as a source dataset.
Once associated with a recipe, the Target appears in the transformer as an overlay above the source data. You can see the Target column names, order, data types and sample of the data itself in relation to the source data you are transforming.
Here’s another example based on real data
This view provides easy validation of mismatches between columns, with the column level indicators indicating how closely the current source matches the Target.
Hovering over these column level indicators, Trifacta suggests specific transformations for the column based on the target. Here are a few examples:
Reordering the source column position to match the Target.
Converting the source data type format to the Target’s format.
If the user chooses to apply these transformation suggestions, they are added to the recipe as a step which can be edited or removed like any other recipe step.
When dealing with very wide datasets with many columns, the column browser view in the transformer provides a broader view of the mismatches.
In cases where there are many partial matches found, Trifacta can provide transformation suggestions that apply across multiple columns by choosing “Apply Matches by Name.”
Applying the automatic algorithm across all the columns results in multiple steps. Each of these steps can be viewed in the recipe where users review, edit or undo the steps as needed.
The goal is to increasing efficiency and reducing errors in the data preparation workflow. While still being able to provide the user full interaction with their data, transparency of the logic being generated and ability for users to easily make custom edits.
Over the next set releases we’ll support additional constraints that can be captured as rules for validation during transformation. We are also investigating fuzzy matching and probabilistic confidence rankings to improve suggestions based on the target. Longer term, the ability to share Targets between users and apply rules to publishing actions are being considered. Target Matching is useful in a wide range of use cases, faster schema alignment in this release is just the start. As we develop more functionality in this area, we’d love to learn about more applications from our users so feel free to share your feedback via our community site.
Schema Alignment through Target Matching is immediately available in our Wrangler Pro & Enterprise editions. If you’d like to get your hands on them schedule a demo with our team.