Loading...

Follow Trifacta on Feedspot

Continue with Google
Continue with Facebook
or

Valid

Trifacta’s May ‘19 Wrangler and Wrangler Pro release brings additional improvements to the Join interface, as well as the addition of the RANK and DENSERANK functions. Both of these improvements come from feedback from our customers and community and we are excited to make them available first to our Wrangler and Wrangler Pro products.

What’s new:

  • Join Improvements
  • RANK and DENSERANK functions

Join Improvements

With the new release of Trifacta, you are now able to select the dataset or recipe to join from a view of the Flow. This makes organization in more complicated flows much easier.

RANK and DENSERANK

Rank and DENSERANK now allow you to rank numeric columns in a single step. RANK Computes the rank of a numeric column. Tie values are assigned the same rank, and the next ranking is incremented by the number of tie values. DENSERANK operates the same when ranking values, however, the next ranking value after a tie is incremented by one.

Examples:

          Value                 RANK            DENSERANK 

3

1 1

6

2 2

6

2 2

9

4

3

12 5

4

12 5

4

15 7

5

To try out the new Join improvements, ranking functions and all of the other great features available in our free Wrangler edition, sign up here.

The post May ‘19 Release – Join Improvements and Ranking Functions appeared first on Trifacta.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

So, you have successfully trained a machine learning model after choosing the best algorithm and high-quality training data. Time to celebrate, right?

Not quite!

The best machine learning algorithms and models are of no use until you actually apply them to make predictions on future outcomes. Then, you must incorporate those predictions into your decision making. The business value of machine learning is realized only when it alters behavior to produce positive outcomes.

In this blog post, we’ll look at how to derive value from a DataRobot model that is designed to answer the following question:

“Which patients are more likely to be readmitted to the hospital?”

Properly addressing this question requires you to periodically collect batches of new patients from the hospital’s systems, scrub them for quality, make predictions, and feed the results back into the hospital’s systems for driving action.

Training, testing, and fine-tuning your model is the first part. The subsequent four steps are as follows:

1. Deploy the model

Make the model available for predictions. In DataRobot, you do this by creating a deployment. This involves selecting an algorithm (usually the best performing model from the training process). You also specify the training dataset which is used as a baseline to detect model drift.

Switch to the “Integrations” tab and make a note of the deployment tokens and keys. You will need them in a later step.

2. Predict and decide

The next step is to build a production workflow that processes incoming data and gets predictions for new patients. We do this using Trifacta.

In Trifacta, create a new flow and import your data. In this case, we’ll load a CSV file containing details of new patients.

Once imported, you can access the URL path to the file. Copy the URL and insert it into the recipe as shown below.

To call the DataRobot API, you need the following information:

  1. API Token
  2. Deployment ID
  3. DataRobot Key
  4. Username

Add a recipe step for UDF (invoke external function) and choose “DatarobotPredict”. Choose the filename column and enter an argument in the form API_TOKEN=api_token,DEPLOYMENT_ID=deployment_id,DATAROBOT_KEY=datarobot_key,USERNAME=username

The results of the DataRobot API call are returned in JSON format. Now use Trifacta to parse out key pieces of information into their own fields. Use the flatten transformation to create individual rows in the output. Choose the prediction column along with a row identifier.

Now merge the prediction results with the original data to produce a combined output. This is accomplished by doing an inner join on rowId. If needed, you can perform additional transformations on this data to support analytics.

You can now use the output of this process to drive decision making. In this case, patients with a higher readmission risk may be assigned to a nurse for additional checkups and preventive measures. All decisions and actions should be carefully recorded for later analysis.

After building the flow, you operationalize it using Trifacta. You do this by setting up a recurring schedule to process new patients on a daily basis. This pipeline feeds data back into the hospital’s systems for the next step.

3. Measure

It is critical to measure the accuracy of the predictions, as well as the effectiveness of the decisions taken based on those predictions.

The first part of this measurement can be implemented in DataRobot, which can detect drift in model predictions. The second part is accomplished by further analyzing patient data from the hospital’s systems and analyzing trends in cost and health outcomes. This step is important to determine the ROI of the machine learning initiative in terms of cost savings as well as well-being of patients.

4. Iterate

The process doesn’t end here. To derive continued benefits from your machine learning initiative, you must iterate on your models to address model drift, as well as to incorporate new insights and additional data gained during the journey. Having a tight feedback loop will ensure that the machine learning initiative continues to provide ROI for a long time.

Additional Resources
  • If you are interested in trying DataRobot with Trifacta, you can request a free trial here.
  • The Trifacta recipes and source code used in this post are available on Github. A video recording of the process is available on YouTube.
  • For more details on the hospital readmission risk model and its implementation in DataRobot, see this page.

The post Four Steps to Take After Training Your Model: Realizing the Value of Machine Learning appeared first on Trifacta.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Last week, Trifacta introduced a free trial for Amazon Redshift. As we’ve seen a growing number of organizations leveraging AWS for their data management and analytics initiatives, we have put forth a ton of investment in ramping up our capabilities on the Amazon stack. In the fall, we announced a new serverless architecture on AWS, giving users the ability to dynamically scale computing up or down on the fly. Additionally, we released Wrangler Pro, a multi-tenant SaaS edition optimized for Amazon, leveraging S3, Redshift, EMR, and EC2, giving data teams the scale and AWS connectivity they need without any software or hardware to manage. Wrangler Pro is the ideal solution for small teams or departments and the perfect complement to Wrangler Enterprise, our customer managed solution for AWS. On the business side, we’ve seen 4x growth in the number of customers deploying Trifacta on AWS. A trend that is only accelerating.

For many data analysts, data scientists, and data engineers, the majority of their work is tied up in this data preparation and cleaning process. Traditional methods might use a combination of tools like SQL and code to blend data from databases and file systems. This process can be manual and cumbersome, requiring IT teams and siloed groups to build code, visualize and analyze the results, and debug when errors inevitably arrive. The explosion of cloud adoption led by AWS has only expanded the volume and variety of data organizations can utilize, but the data quality bottlenecks described above persist and are exacerbated by the speed at which cloud enables businesses to move. Organizations investing in cloud data lakes are landing large volumes of raw data that needs to be profiled, prepped, cleaned, and blended with disparate sources before being published to Amazon Redshift for analysis. Trifacta empowers data professionals of all levels of technical expertise to access, manage, and prepare data in Amazon Redshift and other AWS sources like Amazon S3.

Trifacta is being successfully used on AWS by a variety of different organizations – from some of the world’s largest financial firms, to emerging and innovative startups like Adaptive Analytics. Below are a few notable use cases:

  • Consensus Corporation (a Target subsidiary) leverages Trifacta to decrease the time it takes to feed new data and update their fraud detection models for their retail customers. The company’s fraud detection models help save retailers from selling merchandise to illegitimate customers, saving them massive amounts of otherwise lost revenue every year.
  • Deutsche Borse leverages Trifacta on AWS to drastically decrease the time it takes for their data science team to work with new data sources. Deutsche Borse’s data science team is then able to focus on driving improvements to their data marketplace for their customers
  • Tipping Point Community leverages Trifacta to prepare a variety of publicly available datasets, which ultimately helps them analyze how to break the cycle of poverty for individuals and families in the San Francisco Bay Area. With Trifacta, the Tipping Point Community can quickly join together these data sets and import them directly into Tableau, greatly reducing their time spent on data preparation since switching from Excel.

Trifacta integrates natively into existing AWS services to provide a seamless experience within the AWS ecosystem. We leverage IAM roles to authenticate and respect permissions to data set up within AWS. Trifacta can read data from and write to S3 and Redshift and transformations scale to large volumes of data by leveraging Spark to execute on fully managed EMR clusters, getting all of the benefits of scale that EMR provides with none of the management overhead.

Trifacta’s free trial of Wrangler Pro for Amazon Redshift supports teams of up to 10 users and comes with a compute limit rather than a time limit. Try it out and let us know what you think.

The post Take a Free Test Drive of Trifacta on AWS appeared first on Trifacta.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Since rolling out the General Availability of Google Cloud Dataprep in Sept 2018, we’ve continued to see rapid growth in users and Dataprep jobs from around the world.  While you will see many improvements across the product in our latest release, there are two areas of particular focus: job customization and data quality capabilities.

Simple Dataprep Job Customization
With several 100,000 jobs running per month, customers frequently asked for more options in tuning the Dataflow job that Dataprep creates.  While this could be achieved through Dataflow templates, we wanted to make it much easier and more automated for any user. As a result, users can now choose regional endpoints, zones and machine types directly in the Dataprep job UI or configure it as a project level setting.

Enhanced Data Quality Capabilities
The new suite of capabilities focuses on making data quality assessment, remediation and monitoring more intelligent and efficient. These new capabilities are designed to help address data quality issues that hinder the success of analytics, machine learning, and other data management initiatives within the Google Cloud Platform.

If you want to learn more about these new capabilities and experience them live, we’ll be demonstrating them at Google Next SF 2019 next week.  Please join us at booth S1623 next to the Serverless Analytics section.

Run Cloud Dataprep jobs in different regions or zones with customized execution options

Optimal performance has always been the major driver to optimize data processing performance by executing closest to where the data resides.  More recently, last year’s roll out of GDPR increased the importance with additional legal requirements to comply with data locality.

Specifically, data locality requires that certain data (mostly customer data and Personally Identifiable Information – PII – data) to remain within the borders of a particular country or region. While not new, the laws such as those in the European Union carry a significant cost, meaning not complying can be very high. As the majority of our Dataprep users are found outside of the US, it has become increasingly important to ensure that when physical data is processed and stored it stays within specific geographical regions. This also applies to US companies or other countries processing European Community customer data.

Prior to this release, customers who wanted to run their Dataprep job in a specific region were required to use Cloud Dataflow templates to execute.  While effective, it was cumbersome to setup and maintain, and wasn’t available for scheduled jobs. Now, the most commonly used settings for Cloud Dataflow are available directly in the Dataprep job UI. Users can directly configure the location where Dataprep jobs will be executed to match the data storage locations defined for Cloud Storage and BigQuery.

You can select the regional end point and specific zone that the Dataprep job will submit to the Cloud Dataflow service to initialize and start processing the data. This ensures that the underlying Cloud Dataflow job can execute in the same location where the source and target decide, thereby improving performance (network) and maintaining geo-locality.

Expanding on these options, we’ve also enabled selected GCP compute engine machine types used by Cloud Dataflow. This is particularly useful for transformations that are processing-intensive such as joins, aggregations and flat aggregations, sessioning, windowing, unpivoting, or pivoting that benefit from more power on a single node.  Note: Auto-scaling is turned on for all supported machine-types.

All of these Dataflow execution options are saved for each Dataprep job so that scheduled jobs will automatically pick up the latest settings.  In addition, you can configure these settings at the project level within the user profile settings.

To learn more about Dataflow regional endpoints and zones, please see the product documentation here.

New and enhanced Data Quality capabilities

Gartner Inc. has determined that 40% of all failed business initiatives are a result of poor quality data and data quality effects overall labor productivity by as much as 20%.  With more data being utilized in analytics and other organizational initiatives, brings more risks that inaccurate data will get incorporated into analytic pipelines, leading to flawed insight.  In order to truly capitalize on the unprecedented business opportunity of machine learning and AI, organizations must ensure their data meets high standards of quality.

The new/enhanced features to further support data quality initiatives on Google Cloud include:

Active Profiling
  • A new Selection Model creates a seamless experience that highlights data quality issues and offers interactive guidance on how to resolve these issues.
  • Column selection provides expanded histograms, data quality bars, and pattern information to offer immediate insight to column distributions and data quality issues. These visuals update with every change to the data and offer instant previews of every transformation step.
  • Interaction with profiling information drives intelligent suggestions and methods for cleaning that the user can choose from.

Smart Cleaning

Cluster Clean uses state-of-the-art clustering algorithms to group similar values and resolve them to a single standard value.

Pattern Clean handles composite data types like dates and phone numbers that often have multiple representations. It identifies the datatype patterns in the dataset and allows users to reformat all values to a chosen pattern with a single click.

We’re always interested in feedback about the product and engaging with users on their use cases on our community. Look for us Google Next SF next week or schedule a 1:1 meeting with a data prep expert. 

The post Bringing Enhanced Data Quality Capabilities and Easy Job Customization to Google Cloud Dataprep appeared first on Trifacta.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

The following is a guest blog post from Bob Kelly, Managing Partner at Ignition Partners, a leading early-stage B2B investment firm with offices in Seattle and Silicon Valley.

As Google Cloud Next ’19 kicks off in San Francisco next week, I’ve been thinking about how far cloud capabilities have evolved since my early days at Azure. The AI/ML-driven application world has been promised for so long, and we’re only just beginning to see it unfold. Google’s focus on data-driven applications and its unique relationship with Trifacta are indicators, in my opinion, of the coming explosion of intelligent applications.

That last point is critical to us at Ignition Partners, as our investment focus has moved “up the stack” over the past few years to intelligent apps, either vertically-focused or horizonal/role-based. As we think about these investment opportunities, we devote a lot of time to better understanding how data can create a competitive advantage for companies.

In the past, getting the data right was an open question. Today, however, this problem has largely been solved by Trifacta, one of our portfolio companies that is leading its category and experiencing tremendous growth. Trifacta was born out of the realization that data professionals historically spent 80% of their time preparing data, and only 20% of their time analyzing it. That’s not only backward–it’s highly inefficient.

At a time when data is one of the most valuable assets for companies in every industry, a limited ability to use and understand data is a serious disadvantage.

The data preparation pioneers 

Over the past several years, Trifacta has developed the leading platform for data wrangling, which is the process of curating and preparing data so that it can be analyzed. Trifacta gives anyone the ability to wrangle all kinds of data, from structured to unstructured. The “anyone” piece is big deal, because for too long, the only people who could successfully wrangle data were data scientists and people who knew how to code.

Trifacta democratizes data by giving everyone the ability to work with it. The company believes that having access to huge data sets, and the ability to wrangle it and ask questions in natural language, is truly transformational. It removes the gating process that previously kept data confined to a small section of the business, and opens it up for anyone to use and learn from.

Google takes notice

We’re not the only ones excited about Trifacta’s potential; Google is on board, too. It made a corporate investment in Trifacta in 2018, and subsequently asked Trifacta to white-label its

solution and offer it as Google Cloud Dataprep. That marked the first time Google licensed a 3rd party product and made it part of its core business model.  The solution was announced at Google Cloud Next ’17; the public beta went live around this time last year.

Today, there are 43.5K users on Cloud Dataprep, a number that has grown 77% since September 2018. In that same time frame, the number of total jobs executed on Cloud Dataprep grew by 182% to 1.96 million jobs. By the time Google Cloud Next begins, that number will be well past 2 million.

I can’t think of a better partner for Trifacta than the data company.

Getting to “ah-ha” faster

Collecting data isn’t the challenge. The challenge is what to do with it once you have it. Trifacta helps companies answer that question much more easily, by making it possible to curate data from different sources, clean it, and make sense of it–fast.

Google’s customers are using it to ask questions like: What ads are working? Which cohorts are they working for? What’s not working, and why?

Companies in other industries are asking different types of questions.

Better data, better decisions, better world

Adam Wilson, CEO of Trifacta, likes to say that the company is “the decoder ring for data” – and we’re seeing companies use it to decode customer insights, business opportunities, delivery models, and much more. Here are two examples of how that plays out.

GSK is a pharmaceutical company that wants to dramatically shorten clinical trials, which often drag on for 10-12 years. GSK believes that clean, raw data is critical in making this happen. They’re using Trifacta to allow their clinical researchers to capture and wrangle data themselves, rather than waiting for patients to self-report. The result: They can better understand what’s in the data, and how to use it – which speeds decision-making and allows their projects to move ahead faster.

NationBuilder offers a community outreach platform that helps power smarter campaigns and lower the barrier to leadership. Their data helps political candidates understand how to best prioritize their activities, which means data quality is a top priority. Trifacta is a critical part of the NationBuilder solution, giving their customers the confidence they need to make smart decisions.

Are you building an intelligent application?

If you have an application that would be improved with Trifacta, and are going to be at Google Cloud Next ’19 – drop me a line at Bob@Ignition.vc.  Our team at Ignition is dedicated to investing capital, time, and expertise to help companies like Trifacta build a better world. If you’re building something that will help transform the way people work, whether vertically focused on a specific industry or on a role or function that cuts across industries and companies, we’d love to hear from you.

The post How Trifacta is Helping Companies Realize the Massive Value in their Data appeared first on Trifacta.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

As part of our expanded focus into Data Quality, Trifacta recently announced a new approach aimed at quickly and intuitively resolving data quality issues — Smart Cleaning. This blog will cover two different features of Smart Cleaning —  Cluster Clean and Pattern Clean. Cluster Clean allows users to quickly and flexibly standardize and resolve similar values to a standard value, whether chosen from the existing data or entered by the user. Pattern Clean addresses issues with mismatched formats, for instance date or phone number formats.  Within this framework, Trifacta Smart Cleaning brings scale and flexibility to the traditionally rigid processes of standardization.

Smart Cleaning with Trifacta - YouTube

Smart Cleaning

Cluster Clean

We recently introduced a brand new way for users to standardize and clean misspelled or varied data in Trifacta, Cluster Clean. Many of our users struggle to deal with messy data that can be hard to reconcile. Sometimes it’s data that has been manually entered into systems, other times it’s data coming from multiple sources. We’ve found that traditional methods of clustering and standardizing similar values are slow and brittle. Users choose a clustering method and then spend lots of time manually moving values into the right cluster. When they get new data they need to painstakingly reconcile it with their existing clusters. Our approach is different, we know that no single clustering method does a great job catching every type of issue. We designed Cluster Clean with this in mind, allowing users to quickly explore multiple clustering options and catch new problems. It’s also resilient to new data, easily incorporating new values without being tied down to whatever clustering method happened to work best the first time.

We approached Standardization with user flexibility in mind. Our initial model was centered around allowing users to take advantage of the clustering aspects of the key collision–comparing string similarity, and metaphone–comparing pronunciation, algorithms while being able to break out of these groupings if the user saw that the algorithms did not produce the desired groupings. To accomplish this, we allowed the user to not only specify a value for the cluster to resolve to, but also to specify the new value for any of the individual values within the cluster.

Through user testing, we found that this model gave our users the flexibility that they were looking for, but it was missing a key aspect when the clusters didn’t match the user’s expectations: bulk editing. Users felt that it was tedious to pull multiple values out of a cluster one by one so we added the ability to select multiple values and edit them in bulk. Through this iteration, we discovered that selecting values was generally very useful and intuitive so we extended the pattern to work across clusters as well. Now, users are able to select a single cluster, multiple clusters, specific values within a single cluster, as well as specific values across multiple clusters and resolve them appropriately.

Pattern Clean

Another data quality issue that Trifacta can quickly address is issues of mismatched formatting. If you saw our blog on Active Profiling, you saw how Trifacta helps users identify and drill down on issues related to mismatched formatting. By interacting with that profiling information, Trifacta also gives users a powerful method to address those mismatched values, Pattern Clean.

For those familiar with more traditional approaches to addressing these types of mismatched values, it often requires a heavy reliance on regular expressions and complex conditions. That can take enormous amounts of manual effort to identify those issues and patterns, and equal amounts of manual effort to replace them with the correct format. Trifacta identifies the patterns for you, and by interacting with those patterns, Pattern Clean will predict the best way to resolve them.

What’s Next

This is just the first step in our focus on making data cleaning more intelligent and efficient, we have a lot of exciting extensions of this work coming soon! We’ll make it possible to use an existing set of ‘gold standard’ values to speed the process up even more with Reference Clean. We also have new clustering methods planned, allowing for more ways to explore how values may be related. We’ll be layering in more intelligence over time, giving users intelligent suggestions on how they can resolve their values and clusters. Try it out today and let us know what you think!

The post Trifacta for Data Quality: Introducing Smart Cleaning appeared first on Trifacta.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Last week, we announced our expanded focus to bring a modern approach to the process of data quality as part of our continued effort to build out a modern DataOps platform. Data quality is a hugely important piece of any organization’s data initiatives. Research shows that poor data quality costs organizations on average $9.7 million dollars in revenue per year, and we foresee that number only increasing as more and more organizations compete for better insights and more efficiency from AI and machine learning. By blending visual guidance, user interaction, and machine intelligence into an intuitive user interface, Trifacta’s Active Profiling enhances the ability to profile, discover, and validate data quality issues.

Active Profiling in Trifacta - YouTube

Active Profiling

A large piece of Active Profiling is surfacing relevant information and metadata to users as they interact with columns. We’ve built intelligent factories that take data (for example, the unique values in a column) and metadata (the overall data-type of the column or the name of the column) about selected columns to determine which charts and metadata to show. For continuous types like numbers or dates, we can surface precisely binned distributions; for categorical types like phone numbers, we can surface phone number formats so users can work with and standardize heterogeneous data. We also allow users to drill-down into larger data through in-depth details and unique values panels. For example, when profiling a date column, our additional details panels include breakdowns of values by month, year, day-of-week, and so on. This intelligent, multi-step approach helps users gain a richer context of their data.

Alongside these changes, we’ve also enriched our format profiler. Now surfaced directly after a user’s column interaction and powered by a scalable, unsupervised ML clustering engine, users can discover the formats of thousands of unique values.

Through this rich and responsive interface, users are a single click away from flagging, filtering, and standardizing disparate formats.

Across all these charts, we provide common interactions like chart selections and responsive context menus to tie the user experience together. We’ve built a contextual selection model to understand and generalize user interactions. When users click on columns or interact with unique values or pick specific formats, we catalog the objects they’ve interacted with and determine what actions make sense for these objects. For example, when users interact with a specific set of unique values, we understand that they may want to filter down to those values but also replace those values with other values.

We surface potential tasks users might do through context menus which are present throughout the transformation experience and act as an anchor for users.

We also leverage our ML service to score these potential task based on profiles and other rich metadata about the selected objects and provide ranked suggestions.

Finally, we’ve started exploring how data-oriented views like our transformation grid can surface metadata. As users interact with the surfaced profiles, we asynchronously compute the cross-section of user selection in profiles and data in the grid. Powered by our in-memory Photon engine, we stitch together row- and cell-level information on which cells are selected. As user selections change and evolve, we re-render the transformation grid to reflect the metadata they’re interacting with. Using a combination of custom d3 and React, we’re able to refresh relevant pieces of the page while keeping the user’s experience stable; we can update what has changed on the order of 100s of milliseconds so that user interaction flows smoothly without interruptions.

What’s Next

Live feedback and constant validation is a guiding philosophy of our user experience, and we look forward to finding new and exciting ways that we can continue to improve visual guidance and incorporate machine learning to make the task of understanding and resolving data quality issues intuitive to users of all backgrounds.

The post Trifacta for Data Quality: Introducing Active Profiling appeared first on Trifacta.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Data quality has been going through a renaissance recently.

As a growing number of organizations ramp up efforts to transition computing infrastructure to the cloud and invest in cutting-edge machine learning and AI initiatives, they are finding that the #1 barrier to success is the quality of their data.

The old adage “Garbage In, Garbage Out” has never been more relevant. With the speed and scale of today’s analytics workloads and the businesses that they support, the costs associated with poor data quality have also never been been higher.

You’re seeing this reflected in a massive uptick in media coverage on the topic. Over the past few months, data quality has been the focus of feature articles in The Wall Street Journal, Forbes, Harvard Business Review and MIT Sloan Management Review among others. The common theme is that the success of machine learning and AI is completely dependent on data quality. To quote Thomas Redman, the author of the HBR article referenced above, If Your Data is Bad, Your Machine Learning Tools are Useless.

We’re seeing this trend of increasing focus on data quality reflected in the work of our customers including The Centers  for Medicare and Medicaid (CMS), Deutsche Boerse  and GlaxoSmithKline. The need to accelerate data quality assessment, remediation and monitoring has never been more critical for organizations and they are finding that the traditional approaches to data quality don’t provide the speed, scale and agility required by today’s businesses.

This repeated pattern is what led to today’s announcement on our expansion into Data Quality and unveiling two major new platform capabilities with Active Profiling and Smart Cleaning. This is a big moment for our company because it’s the first time we’ve expanded our focus beyond  data preparation. By adding new data quality functionality, we are advancing Trifacta’s capabilities to handle a wider set of data management tasks as part of a modern DataOps platform.

Legacy approaches to data quality involve many manual, disparate activities as part of a broader process. Dedicated data quality teams, often disconnected from the business context of the data they are working with, manage the process of profiling, fixing and continually monitoring data quality in operational workflows. Each step must be managed in a completely separate interface. It’s hard to iteratively move back-and-forth between steps such as profiling and remediation. Worst of all, the individuals doing the work of managing data quality often don’t have the appropriate context for the data to make informed decisions when business rules change or new situations arise.

Trifacta takes a different approach. Interactive visualizations and machine intelligence guide users by highlighting data quality issues and providing intelligent suggestions on how to address them. Profiling, user interaction, intelligent suggestions, and guided decision-making are all interconnected and drive the other. Users can seamlessly transition back-and-forth between steps to ensure their work is correct. This guided approach lowers the barriers to users and helps to democratize the work beyond siloed data quality teams, allowing those with the business context to own and deliver quality outputs with greater efficiency to downstream analytics initiatives.

In upcoming posts, my colleagues in product and engineering will provide a more detailed overview of the new capabilities we announced today including Active Profiling and Smart Cleaning. They’ll share not only what users are able to do with these new features but also context into the design and development of each function.

Keep in mind that this is just the first (albeit significant) step for our company into data quality. We have much more planned. Later in the year, Trifacta will be adding new capabilities to govern and monitor data quality in automated workflows, allowing users to isolate bad data, orchestrate workflows, and set and monitor data quality thresholds.

The post A New Approach to Data Quality for the Era of Cloud & AI appeared first on Trifacta.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

The following is a guest blog post from Nate Ashton, Director of Accelerator Programs at Dcode.

It’s no secret that working with the US Government is complicated. In fact, a lot of high growth tech companies stay away from working with government because of all the barriers that exist around procurement, compliance, and more. It’s often said that the US government is generally ten years behind the private sector in terms of technology, but now more than ever the government can’t afford to be a decade behind. But the reality is, nobody has more data problems to solve than Uncle Sam. From complex auditing and compliance, to health care and benefits administration, there are countless major government programs sitting on massive volumes of untapped data.

At Dcode, we’re laser focused on trying to bridge that gap by running programs to make it a little bit easier for tech companies to work with government and a little easier for government to engage with tech.

For the third year in a row, we’re running a tech accelerator program focused on taking the best technologies from the private sector and bringing them to government. Including Trifacta in that group was a no-brainer for obvious reasons – nobody can tackle complex  data problems without wrangling that data, and there are many ways to do so in government:

Breaking down data silos: More and more agencies have set up data lakes and cloud environments to start bringing their data together, and Trifacta plays a key part in enabling that transition.

Bringing the government into the AI era: Investing in AI is a huge priority for the US government, and good AI starts with good, clean data.

Financial management and audits: Navigating disparate financial systems and business processes is complicated, to say the least, for an organization with an annual budget of $3.8 trillion. Trifacta is working today with one of the largest agencies in government to help them wrangle their financial data.

Science and research: From combating the opioid epidemic to enabling scientific research across the globe, the most important questions require the ability to leverage diverse data at scale.

Enabling data-driven decision-making: Data-driven government is more than just a buzzword. More and more government agencies are appointing Chief Data Officers as they seek to unleash the power of big data on their mission.

Citizen services: Agencies manage some of the biggest, most complex systems in the planet to provide citizen services ranging from government-backed loans to veterans benefits to social security. Unlocking their data has huge potential to improve efficiency and quality of those services.

At Dcode, we’ll be wrapping up our accelerator program this week, and it’s fair to say that Trifacta is is already making a splash in government. During Dcode’s 2019 Advanced Analytics accelerator, dozens of government programs across civilian, defense, and intelligence agencies joined us to sit with the Trifacta team and learn how to take their work to the next level. Several agencies are already using Trifacta’s Wrangler Enterprise to improve their operations, and we can’t wait to see what team Trifacta will do next!

Join Dcode and Trifacta for a happy hour reception in DC on Thursday March 14th: RSVP here.

The post Wrangling Big, Diverse Data in Government appeared first on Trifacta.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

#1 rankings feel good no matter what.

#1 rankings especially feel good when they are driven by feedback from customers and users. This is exactly how the Dresner Advisory Services Wisdom of Crowds reports work. Howard Dresner and his team reach out to the users of a variety of different products related to business intelligence and analytics to understand the trends in specific market categories and who the top vendors are in each space.

Dresner published the first independent research report recognizing data preparation as a stand-alone space back in 2015 and have done so every year since – 2016, 2017, 2018 and now 2019. We are incredibly grateful to the Dresner team for helping kick off the market for data preparation back in 2015 which has since grown to be recognized by a number of other leading analyst firms including Forrester, Gartner and Ovum among others.

Since Dresner’s initial study back in 2015, we’ve been fortunate enough to have been ranked the #1 data preparation vendor. The recent publishing of the 2019 report continues this trend as we have once again achieved the top spot.

As the company that helped define the data preparation market following 20 years of joint academic research across Stanford and UC Berkeley, the evolution and growth of data prep has been extremely rewarding to be a part of. We invite you to the review the results of the study by downloading the full 2019 Data Preparation Market Study from Dresner Advisory Services. Here are a few key takeaways we took from this year’s report:

  • Data Preparation is Rising in Importance – 87% of respondents to Dresner’s 2019 survey found that data preparation is at minimum “important” with 63% saying it’s either “critical” or “very important”
  • Usage of Data Prep Solutions is Improving Effectiveness – Over the past five years, as more organizations have adopted data preparation as part of their BI and analytics initiatives so has their effectiveness. The 2019 study showed that showed that close to 80% of respondents found their current approach to data preparation at least “somewhat effective.”
  • Frequency of Data Preparation Picking Up – This year’s study showed that 66% of respondents either “constantly” or “frequently” make use of data preparation. Another 26% of respondents report at least “occasional” usage of data preparation.
  • Operationalization is Now Critical – As data preparation was once mostly considered an ad-hoc activity, it is now an operational process that organizations utilize for production workflows. This is proving out in the market and the 2019 Dresner study showed this as well with “schedule a process” leading as the most important feature of data preparation solutions.

Once again, we’re honored to have been ranked as the #1 data preparation solution for the 5th straight year and are excited to see what the next five years have in store for this space.

To download the full 2019 Data Preparation Market Study from Dresner Advisory Services, click here. To try our product out for yourself, sign up for our free Wrangler edition here.

The post 5 Years as #1 – The Results of Dresner’s 2019 Data Preparation Market Study appeared first on Trifacta.

Read Full Article

Read for later

Articles marked as Favorite are saved for later viewing.
close
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Separate tags by commas
To access this feature, please upgrade your account.
Start your free month
Free Preview