Today on July 19, we released Talend Summer ’18, which is jam-packed with cloud features and capabilities. We know you are going to love the Talend Cloud automated integration pipelines, Okta Single Sign-On, and the enhanced data preparation and data stewardship functions…there is so much to explore!
Taking DevOps to the Next Level with the Launch of Jenkens Maven Plug-in Support
DevOps has become a widely adopted practice that streamlines and automates the processes between Development (Dev) and IT operations (Ops), so that they can design, build, test, and deliver software in a moreagile, frictionless, andreliable fashion. However, the conventional challenge is that when it comes to DevOps, customers are not only tasked with finding the right people and culture, but also the right technology.
Data Integration fits into DevOps when it comes to building continuous data integration flows, as well as governing apps to support seamless data flows between apps and data stores. Selecting an integration tool that automates the process is critical. It will not only allow for more frequent deploying and testing of integration flowsagainst different environments, increase code quality, reduce downtime, but also free up DevOps team’s time to work on new codes.
Talend Cloud has transformed the way developers and ops teams collaborate to release software in the past few years. With the launch of Winter ’17, Talend Cloud accelerated the continuous delivery of integration projects by allowing teams to create, promote, and publish jobs in separate production environments. An increasing number of customers recognize the value that Talend Cloud brings for implementing DevOps practice. And now they can use the Talend Cloud Jenkins Maven plug-in in this Summer ’18 release, a feature that lets you automate and orchestrate the full integration process by building, testing, and pushing jobs to all Talend Cloud environments. This in turn further boosts the productivity of your DevOps team and reduces time-to-market.
Security and Compliance made Simple: Enterprise Identity and Access Management (IAM) with 1 Click
If you are an enterprise customer, you are likely faced with the growing demands of managing thousands of users and partners who need access to your cloud applications, at any time and from any devices. This adds to the complexity of Enterprise Identity and Access Management (IAM) requirement: meeting security and compliance regulations and audit policies, minimizing IT tickets, and only giving the right users access to the right apps. Single Sign-On (SSO) feature helps address this challenge.
In the Summer ’18 release, Talend Cloud introduced the Okta Single Sign-On (SSO) support. SSO permits a user to use one set of company login credentials to access multiple applications at once. This update ensures greater compliance with your company security and audit policies as well as improve user convenience. If you are with other identity management providers, you can simply download a plug-in to leverage this SSO feature.
The other security and compliance features worth mentioning in this release are the Hadoop User Impersonation for Jobs for the cloud integration app, and the feature that enables fine-grained permissions on the sematic types definition, both will provide greater data and user visibility for better compliance and audit, see this release note for details.
Better Data Governance at Your Finger Tips: New Features in Talend Data Preparation and Data Stewardship Cloud Apps
The Summer ’18 release introduces several new data preparation and data stewardship functions. These include:
More data privacy and encryption functions with the new “hash data” function.
Finer grained access control in the dictionary service for managing and accessing the semantic types.
Improved management in Data Stewardship, now that you can perform mass import, export and remove actions on your data models and campaigns, allowing you to promote, back up or reset your entire environment configuration in just two clicks.
Enhancements in the Salesforce.com connectivity that allows you to filter the data in the source module, by defining a condition directly in your Salesforce dataset and focus on the data you need. This reduces the amount of data to be extracted and processed. Making the use case of self-service cleansing and preparation of Salesforce.com data even more compelling.
Those functionalities make cloud data governance a lot simpler and easier.
The day an IBM scientist invented the relational database in 1970 completely changed the nature of how we use data. For the first time, data became readily accessible to business users. Businesses began to unlock the power of data to make decisions and increase growth. Fast-forward 48 years to 2018, and all the leading companies have one thing in common: they are intensely data-driven.
The world has woken up to the fact that data has the power to transform everything that we do in every industry from finance to retail to healthcare– if we use it the right way. And businesses that win are maximizing their data to create better customer experiences, improve logistics, and derive valuable business intelligence for future decision-making. But right now, we are at a critical inflection point. Data is doubling each year, and the amount of data available for use in the next 48 years is going to take us to dramatically different places than the world’s ever seen.
Let’s explore the confluence of events that have brought us to this turning point, and how your enterprise can harness all this innovation – at a reasonable cost.
Today’s Data-driven Landscape
We are currently experiencing a “perfect storm” of data. The incredibly low cost of sensors, ubiquitous networking, cheap processing in the Cloud, and dynamic computing resources are not only increasing the volume of data, but the enterprise imperative to do something with it. We can do things in real-time and the number of self-service practitioners is tripling annually. The emergence of machine learning and cognitive computing has blown up the data possibilities to completely new levels.
Machine learning and cognitive computing allows us to deal with data at an unprecedented scale and find correlations that no amount of brain power could conceive. Knowing we can use data in a completely transformative way makes the possibilities seem limitless. Theoretically, we should all be data-driven enterprises. Realistically, however, there are some roadblocks that make it seem difficult to take advantage of the power of data:
Trapped in the Legacy Cycle with a Flat Budget
The “perfect storm” of data is driving a set of requirements that is dramatically outstripping what most IT shops can do. Budgets are flat —increasing only 4.5% annually — leaving companies to feel locked into a set of technology choices and vendors. In other words, they’re stuck in the “legacy cycle”. Many IT teams are still spending most of budget just trying to keep the lights on. The remaining budget is spent trying to modernize and innovate, and then a few years later, all that new modern stuff that you brought is legacy all over again, and the cycle repeats. That’s the cycle of pain that we’ve all lived through for the last 20 years.
Lack of Data Quality and Accessibility
Most enterprise data is bad. Incorrect, inconsistent, inaccessible…these factors hold enterprises back from extracting the value from data. In a Harvard Business Review study, only 3% of the data surveyed was found to be of “acceptable” quality. That is why data analysts are spending 80% of their time preparing data as opposed to doing the analytics that we’re paying them for. If we can’t ensure data quality, let alone access the data we need, how will we ever realize its value?
Increasing Threats to Data
The immense power of data also increases the threat of its exploitation. Hacking and security breaches are on the rise; the global cost of cybercrime fallout is expected to reach $6 trillion by 2021, double the $3 trillion cost in 2015. In light of the growing threat, the number of security and privacy regulations are multiplying. Given the issues with data integrity, organizations want to know: Is my data both correct and secure? How can data security be ensured in the middle of this data revolution?
Vendor Competition is Intense
The entire software industry is being reinvented from the ground up and all are in a race to the cloud. Your enterprise should be prepared to take full advantage of these innovations and choose vendors most prepared to liberate your data, not just today, but tomorrow, and the year after that.
It might seem impossible to harness all this innovation at a reasonable cost. Yet, there are companies that are thriving amid this data-driven transformation. Their secret? They have discovered a completely disruptive way, a fundamentally new economic way, to embrace this change.
We are talking about the data disruptors – and their strategy is not as radical as it sounds. These are the ones who have found a way to put more data to work with the same budget. For the data disruptors, success doesn’t come from investing more budget in the legacy architecture. These disruptors take an approach with a modern data architecture that allows them to liberate their data from the underlying infrastructure.
Put More of Your Data to Work
The organizations that can quickly put right data to work will have a competitive advantage. Modern technologies make it possible to liberate your data and thrive in today’s hybrid, multi-cloud, real-time, machine learning world. Here are three prime examples of innovations that you need to know about:
Cloud Computing: The cloud has created new efficiencies and cost savings that organizations never dreamed would be possible. Cloud storage is remote and fluctuates to deliver only the capacity that is needed. It eliminates the time and expense of maintaining on-premise servers, and gives business users real-time self-service to data, anytime, anywhere. There is no hand-coding required, so business users can create integrations between any SaaS and on-premise application in the cloud without requiring IT help. Cloud offers cost, capability and productivity gains that on-premise can’t compete with, and the data disruptors have already entrusted their exploding data volumes to the cloud.
Containers: Containers are quickly overtaking virtual machines. According to a recent study, the adoption of application containers will grow by 40% annually through 2020. Virtual machines require costly overhead and time-consuming maintenance, with full hardware and operating system (OS) that needs managed. Containers are portable with few moving parts and minimal maintenance required. A company using stacked container layers pays only for a small slice of the OS and hardware on which the containers are stacked, giving data disruptors unlimited operating potential, at a huge cost savings.
Serverless Computing: Deploying and managing big data technologies can be complicated, costly and requires expertise that is hard to find. Research by Gartner states, “Serverless platform-as-a-service (PaaS) can improve the efficiency and agility of cloud services, reduce the cost of entry to the cloud for beginners, and accelerate the pace of organizations’ modernization of IT.”
Serverless computing allows users to run code without provisioning or managing any underlying system or application infrastructure. Instead, the systems automatically scale to support increasing or decreasing workloads on-demand as data becomes available.
Its name is a misnomer; serverless computing still requires servers, but the cost is only for the actual server capacity used; companies are only charged for what they are running at any given time, eliminating the waste associated with on-premise servers. It scales up as much as it needs to solve that problem, runs it, and scales it back down, turns off. The future is serverless, and its potential to liberate your data is limitless.
Join the Data Disruptors
Now is the time to break free from the legacy trap and liberate your data so its potential can be maximized by your business. In the face of growing data volumes, the data disruptors have realized the potential of the latest cloud-based technologies. Their business and IT teams can work together in a collaborative way, finding an end-to-end solution to the problem, all in a secure and compliant fashion. Harness this innovation and create a completely disruptive set of data economics so your organization can efficiently surf the tidal wave of data.
Talend Data Integration is an enterprise data integration platform that provides visual design while generating simple Java. This lightweight, modular design approach is a great fit for containers. In this blog post, we’ll walk you through how to containerize your Talend job with a single click. All of the code examples in this post can be found on our Talend Job2Docker Git repository. The git readme also includes step-by-step instructions.
There are two parts to Job2Docker. The first part is a very simple Bash script that packages your Talend job zip file as a Docker image. During this packaging step, the script tweaks your Talend job launch command so it will run as PID 1. It does not modify the job in any other way. As a result, your Talend Job will be the only process in the container. This approach is consistent with the spirit and best practices for Docker. When you create a container instance, your Talend job will automatically run.
When your job is finished, the container will shut down. Your application logic is safely decoupled from the hosting compute infrastructure. You can now leverage container orchestration tools such as Kubernetes, OpenShift, Docker Swarm, EC2 Container Services, or Azure Container Instances to manage your job containers efficiently. Whether operating in the Cloud or on-premises, you can leverage the improved elasticity to reduce your total costs of ownership.
The second part of Job2Docker is a simple utility Job written in Talend itself. It monitors a shared directory. Whenever you build a Job from Studio to this directory, the Job2Docker listener invokes the job packaging script to create the Docker image.
All you need to run the examples are an instance of Talend Studio 7.0.1 (that you can download for free here) and a server running Docker.
If you run Studio on Linux, you can simply install Docker and select a directory to share your Talend Jobs with Docker.
If you run Studio on Windows, then you can either try Docker on Windows, or you can install Linux on a VM.
The examples here were run on a Linux VM while Talend Studio ran in Windows on the host OS. For the Studio and Docker to communicate, you will need to share a folder using your VM tool of choice.
Once you have installed these Job2Docker scripts and listener, the workflow is transparent to the Studio user.
1. Start the Job2Docker_listener job monitoring the shared directory.
2. Click “Build” in Talend Studio to create a Talend job zip file in the shared directory. That’s it.
3. The Talend Job2Docker_listener triggers the Job2Docker script to convert the Talend zip file to a .tgz file ready for Docker.
4. The Talend Job2Docker_listener triggers the Job2Docker_build script creating a Docker Image.
5. Create a container with the Docker Run command. Your job runs automatically.
Talend & Docker - Running Containers with Talend Jobs - YouTube
Once you have your Docker image you can publish it to your Docker registry of choice and stage your containerized application as part of your Continuous Integration workflow.
While this process is deceptively simple it has big implications for how you can manage your workflows at scale. In our next post, we’ll show you how to orchestrate multiple containerized jobs using container orchestration tools like Kubernetes, EC2 Container Services, Fargate, or Azure Container Instances.
In the last 3 years, IT has moved to the center of many business-critical discussions taking place in the boardroom. Chief among these discussions are those focused on digital transformation, especially those around operational effectiveness and customer centricity. A critical component for the success of these initiatives is the ability for the business to make timely business decisions and to do so with confidence. To do this, you must have 100% confidence in your ability to access and analyze all the relevant data.
The Rise of Cloud
Among the major technology shifts that are fostering this digital transformation, the rise of cloud has provided companies with innovative and compelling ways to interact with their teams, customers and business partners.
However, organizations often find themselves limited by their capabilities to work with the vast amount of data available. Most IT organizations don’t have the ability to access all the useful data available in a timely manner to fulfill business organizations’ request. This gap between the IT teams and business groups is limiting organizations’ ability to fully take advantage of the digital transformation.
Through the economics of using a Cloud Data Warehouse, you can store significantly more information for less budget, you can dramatically expand the types of data that can be analyzed and drive deeper insight, to provide the ability for users to have a self-service capability.
Talend Cloud and Snowflake in the cloud enable you to connect a broad set of source and target data such as structured and unstructured data sources.
In the video below, you’ll learn how you can use Talend to quickly and easily load your data coming from various sources and bulk load that data directly into tables in a Snowflake data warehouse. Once the data has been bulk loaded into Snowflake, you can further use Talend to perform ELT (Extract/Load/Transform) functions on that data within Snowflake. This allows you to take full advantage of Snowflake’s processing power and ease-of-use SQL querying interface to transform data in place.
Snowflake + Talend: Using Talend to Bulk Load Data in Snowflake - YouTube
If you handle large volumes of data (to the petabytes) and you need a fast and scalable data warehouse, Talend will be the right solution to access and load data to a Cloud data warehouse and then give everyone in your organization access to the data when and where they need it! Start your free 30-day trial here.
Talend Spring ’18 (Talend 7.0) was Talend’s biggest release yet. With improvements to Cloud, big data, governance and developer productivity, Talend Spring ’18 is everything you need to manage your company’s data lifecycle. For those that missed the live webinar, you can view it here. But as highlighted in the webinar, I realize that everyone is not going to Serverless today, or is looking to ingest streaming data … so what are some of the things in this release that most Talend users will start using immediately? Aka the hidden gems that did not make the headlines.
#1 Hidden Gem – Unified database selection
For this scenario, you wired your database components into an integration pipeline, e.g. Oracle and Salesforce into AWS Redshift, and then a few months later you want to use a different database like Snowflake. But of course, this needs to be done across the many pipelines that you’ve implemented. With Talend Spring ’18, several unified database components, labeled tDB***, have been added into the Talend Studio as an entry point to a variety of databases. Instead of deleting the Redshift component, adding the Snowflake component and configuring the Snowflake connection, you just click on the Redshift component, a drop-down list appears, and you select Snowflake. Big time savings!
#2 Hidden Gem – Smart tMap fuzzy auto-mapping
For this scenario, you are using your favorite component, tMap. You select two data sources to map, and as long as the input and output column names are the same, the auto-mapping works great! But most of the time since different developers created the tables, the column names are different, e.g. First_Name or FirstName or FirstName2 or FirstName_c (custom field in Salesforce). With Talend Spring ’18 and Smart tMap Fuzzy Auto-mapping, tMap uses data quality algorithms (Levenshtein, Jaccard) to do fuzzy matching of the names, saving you time as similarly named columns are matched for you. Just image when dealing with tables with hundreds of columns how much faster this will be!
Input column names
Output column names
Auto-map works Talend 6.5
Auto-map works Talend 7.0
#3 Hidden Gem – Dynamic distribution support for Cloudera CDH
For this scenario, you just installed the latest release of Talend 7.0, which has Cloudera CDH 5.13 support, and you want to use some of the new features in Cloudera CDH 5.14. Bummer! Talend Spring ’18 enables you to quickly access the latest distros without upgrading Talend through a feature called Dynamic Distributions (technical preview). Maven is used to decouple the job from the Big Data Platform version, making it quicker and easier to adopt new Hadoop distributions, so you can switch to the latest release of a new Big Data distribution as soon as it becomes available. Initial support is provided for Cloudera CDH (our R&D team are working on others J )
How does it work? When a new version of Cloudera CDH is released by Cloudera, Talend Studio will automatically list available versions. If you want to use it within your Talend jobs, then Studio will download all the related dependencies. This is an option in the project properties of the Studio (not in TAC or TMC) as this is only needed at design-time.
Not only are you able to use the features more efficiently, this update can also save you days of administration and upgrade time.
#4 Hidden Gem – Continuous Integration updates
For this scenario, your DevOps team operates like a well-oiled machine firing on all cylinders. While projects are getting completed faster, there were some Talend CI/CD command inconsistencies and if you changed a referenced job, you would need to recompile all of the connected jobs. With Talend Spring ’18, the CI features have been rewritten to use Maven standards and CI best practices.
You can now do incremental builds in the Talend Studio, so only the updated job is rebuilt, not all of them. Other features include broader Git support including Bitbucket Server 5.x and Nexus 3 support for the Talend Artifact Repository, standard Maven commands application integration (technical preview), and the ability to easily extend the build process through Maven Plug-ins and custom Project Object Models (POMs). Translation .. we have seen a 50%-time improvement in the build process! How cool is that?
#5 Hidden Gem – Remote Cloud Debug
For this final scenario, we look at the process of debugging Talend Cloud integration jobs. Everything works in Talend Studio, and the next step was to publish to Talend Cloud and test in the QA environment. Talend Spring ’18 provides a free test engine and the ability to remotely debug jobs (data integration or big data) on either Talend Cloud Engines or Remote Engines. The new feature allows Integration Developers to run their pipelines from the studio onto the cloud engine or remote engine and see logs and debug designs locally, which increases productivity by cutting test and debugging from minutes to seconds.
Now that you know all the hidden gems that Talend Spring ’18 can bring you, go put more data to work today. Enjoy!
To learn more about all the Talend Spring ’18 hidden gems, check out What’s New, or take it for a test drive.
A common practice in any development team is to conduct code reviews, or at least it should be. This is a process where multiple developers inspect written code and discuss its design, implementation, and structure to increase quality and accuracy. Whether you subscribe to the notion of formal reviews or a more lightweight method (such as pair programming), code reviews have proven to be effective at finding defects and/or insufficiencies before they hit production.
In addition, code reviews can help ensure teams are following established best practices. The collaboration can help identify new best practices to follow as well along the way! Not only that, regular code reviews allow for a level of information sharing and an opportunity for all developers to learn from each other. This is especially true for the more junior developers, but even senior developers learn a thing or two from this process.
While you aren’t writing actual code with Talend (as it is a code generator), developing jobs do share many characteristics with line-by-line coding. After all, Talend is a very flexible platform that allows developers to build jobs many ways. All the benefits of code reviews still apply, even if you are only reviewing job designs, settings, and orchestration flow.
The “Why” of Talend Job Reviews
If you had to summarize the goals of doing Talend job reviews into a single word, I would say that it is ‘Quality’. Quality in the tactical sense that your reviewed jobs will probably perform better, will tend to have fewer defects and be likely much easier to maintain over time. Quality in the more strategic sense that over time, these job reviews will naturally improve your developer’s skills and hone best practices such that future jobs also perform better, have fewer defects and are easier to maintain even before hitting job reviews!
Have I sold you on Talend job design reviews yet? Attitudes vary on this, after all there is a certain amount of ego involved with jobs that developers build… and that’s ok. It’s important however to focus on the positive; treat reviews as an opportunity to learn from each other and to improve everyone’s skills. It’s important to always be thoughtful and respectful during the process to the developers involved. Depending on your team culture, paired reviews may work better than formal full team reviews. I do recommend meeting face-to-face, however, and not creating offline review processes as there is a lot to be gained by collaborating during these reviews. Be pragmatic. Solicit the team for input into how to improve your code review process.
Quantitative & Qualitative
When reviewing Talend jobs, it’s important to think both Quantitatively and Qualitatively.
Qualitatively speaking, you want to adapt and adopt best practices. If you haven’t read Talend’s recommended best practices, then I highly recommend doing so. Currently the four-part Blog series can be found here:
In these documented best practices, we discuss some qualities of good job design. Recommendations on how to make your jobs easy to read, easy to write, and easy to maintain. In addition, you’ll find other foundational precepts to building the best possible jobs. Consider these:
And several others
Choosing and balancing these precepts are key!
You might also find interesting the follow-on series (2 parts so far) on successful methodologies and actual job design patterns.
I think you’ll find all these blogs worth the read.
Ok, great, now how should we quantitatively review jobs? In code review projects, often there are metric tools that teams may use to assess the complexity and maintainability of your code. Did you know that Talend has an audit tool that includes much of this? You’ll find it in the Talend Administration Center (TAC) here:
Here you will find all sorts of really great information on your jobs! This is a very often overlooked feature of TAC, so be sure to check it out. I recommend reading our Project Audit User Guide.
Talend Project Audit provides several functions for auditing a project through investigating different elements in Jobs designed in a Studio. Talend Project Audit reviews:
Degree of difficulty of Jobs and components used in jobs
Job error management
Documentation and Job versioning
Usage of metadata items in Job designs
Layout-related issues in Job and subjob graphical designs
I’ve been asked in the past if it is possible to use standard code metric tools with Talend. The answer is yes, although with some caveats. Code metrics tools can be run after the code generation phase of the build, so if you could perhaps create a Jenkins job for doing this or inject the metrics right into your standard builds. It’s likely that a complex job in Talend would generate more complex code.
However, keep in mind that you are maintaining this code on a graphical user interface with prebuilt and reusable components. The complexity and maintainability metrics are fine to compare between jobs however, it is not always apples-to-apples with hand-coded programs. For that reason, I generally recommend using our Audit functionality.
7 Best Practice Guidelines for Reviewing Jobs
Now that I’ve (hopefully) convinced you of the concept of Talend Job reviews, I want to conclude with some best practices and guidelines as you start to put this concept into practice.
Capture metrics and when you find areas of jobs to change, count them.
Use t-shirt size classification.
It’s important to try and quantify the impact of your review sessions.
Track your defects and see if your review efforts are paying off with fewer defects making it past the review process. Capture how long your reviews take.
Don’t review too many jobs at one time.
In fact, don’t even review too much of the same job at one time if it is complex.
Take your time to understand the job design and patterns.
Keep your jobs well described and annotated to ease the review process.
Label your components, data flows and subjobs.
Document poor patterns and create a watch list for future job reviews.
Create and share your discovered best practices.
Be sure to include a feedback mechanism built into your process so that the recommended changes and defect resolution can be implemented on schedule.
Talend job reviews should be part of your development process, and should not be rushed in as an afterthought.
For any financial service organization, failure to comply with regulations is front page news, which can majorly impact brand reputation, customer loyalty, and the bottom line.
The drive for greater transparency over customers’ finance data has led to a number of regulations and legal standards such as PSD2, Open Banking and, most recently, GDPR being introduced to the mix. In this article, I will discuss how we should view regulations as an opportunity rather than a barrier to innovation.
The regulatory minefield known as 2018…
This year has been a milestone one for regulatory changes in financial services. Open Banking launched in January 2018 with a whimper more than a bang. One possible explanation for this was a reluctance to cause a panic among consumers. Research by Ipsos MORI found that while almost two thirds (63%) of UK consumers see the services enabled by Open Banking as ‘unique’, just 13% of them would be comfortable allowing third parties to access their bank data. These figures are likely to have been impacted by high-profile breaches affecting the finance industry, which soured attitudes towards data protection policies.
Open Banking is built on the second Payment Services Directive, more commonly known as PSD2. Despite its fame being somewhat dwarfed by that of the General Data Protection Regulation (GDPR), PSD2 is a data revolution in the banking industry across Europe. By opening up banks’ APIs to third-parties, consumers will be able to take advantage of smoother transactions, innovative new services and greater transparency in terms of fees and surcharges. In the UK, this is partly enabled through the Competition and Markets Authority’s (CMA) requirement for the largest current account providers to implement Open Banking.
Creating these experiences for consumers requires APIs which seamlessly draw together information from multiple datasets and sources. Step in GDPR, which has tightened up the controls consumers have regarding their data and introduced greater financial ramifications on companies and organizations that do not adhere to it. $20,000,000 or 4% of global revenue, whichever is highest, is the penalty for non-compliance. One of the fundamental principles of GDPR compliance is providing greater transparency over where personal data is and how it is being used at all times.
PSD2 and Open Banking align with this because it is the consumer that has the control over whether their data is shared with third parties, as well as the power to stop it being shared. In addition, the concept of the ‘right-to-be-forgotten’ enshrined in GDPR means that consumers can demand that any data held by the third-party service provider be permanently deleted. Similarly, because GDPR puts the onus of data protection on both data controllers (i.e., banks) and data processors (i.e., PISPs and AISPs) it is in the interests of both to ensure that their data governance strategies and technology are fit for purpose. As has been pointed out by Deloitte and Accenture, there might be contradictions within these regulations, but the overriding message is that transparency and consent are key for banks who need good quality data to provide more innovative services.
Regulating the world’s most valuable commodity
Having untangled the web of data regulations facing the finance industry, we must remember that with the rise of big data, the cloud, and analytics based on machine learning, data is no longer something which clogs up your internal systems until it needs to be disposed of. Data is the world’s most valuable commodity – the rocket fuel that has powered the rise of Internet giants like Facebook, hyperscalers like AWS, and industry disruptors like Uber. To the finance industry, data is a matter of boom or bust, and given the vital role they play in society, consumers and businesses need banks to have data. This is why banks must take a proactive view towards data governance and treat it as an opportunity rather than a necessary evil.
EY’s 2018 annual banking regulatory outlook stresses the importance of banks staying on the front foot when it comes to regulatory compliance. It lists five key actions as achieving good governance: creating a culture of compliance; exerting command over data; investing in the ability to analyze data, and developing strategic partnerships. As these key points suggest, a proactive view of data governance does not stop at compliance. It’s about creating a virtuous cycle of data being analyzed and the insight gleaned from this analysis being turned into services which customers appreciate. This will make customers want to share their data as they can see the hyper-personalized and customized services which they get as a result.
As a rule of thumb, the more information you give your bank, the more personalized the service they can provide. This is true in the context of an entire range services such as calculating credit ratings, advising on savings and borrowing. However, this scenario works both ways, and regulations such as Open Banking, PSD2, and GDPR put the power firmly in consumers’ hands. So, the more data organizations ask for, the higher the expectation of personalized services from customers. Customers need to see what their data is being used for, so transparency is key if financial firms are to build and maintain trust with customers. Furthermore, to offer highly personalized products and services based on complex analysis of big data, organizations should already know where data is stored and how it is being used.
In summary, data protection regulations such as Open Banking, PSD2, and GDPR must be viewed as opportunities for financial services organizations to re-establish trust with consumers, which may have been eroded by high-profile data breaches in 2017. In a way, this brings us back to the basics of what financial services are all about: being a steward of people’s assets. “When it comes to customer trust, financial leaders shouldn’t wait on regulators to keep their companies in check”
Understanding where data is and that it is managed correctly is not only fundamental to regulatory compliance and customer trust, but also to providing the highly personalized and predictive services that customers crave. Therefore, the requirements of the regulation are by no means at odds with the strategies of data-driven finance firms, but in actual fact perfectly aligned.
Major data breaches are becoming more common. Big data breaches may include password data (hopefully hashed), potentially allowing attackers to login to and steal data from our other accounts or worse.
The majority of people use very weak passwords and reuse them. A password manager assists in generating and retrieving complex passwords, potentially storing such passwords in an encrypted database or calculating them on demand.
In this article, we are going to show how easy it is for Talend to integrate with enterprise password vaults like Cyberark. By doing this no-developer has direct access to sensitive passwords, Talend need not know password during compile(design) time, and password management (change/update password) is done outside of Talend. This saves administrators time as well as improves the security of the overall environment.
An Introduction Password Security with CyberArk
CyberArk is an information security company offering Privileged Account Security, is designed to discover, secure, rotate and control access to privileged account passwords used to access systems throughout the enterprise IT environment.
Create a CyberArk Safe
To get started, we first need to build our safe. A safe is a container for storing passwords. Safes are typically created based on who will need access to the privileged accounts and whose passwords will be stored within the safe.
For instance, you might create a safe for a business unit or for a group of administrators. The safes are collectively referred to as the vault. Here’s a step-by-step guide on how to do that within CyberArk.
Login to CyberArk with your credentials.
Navigate to Policies -> Access Controls(Safes) -> Click on “Add Safe”.
Create a safe.
You will need a CyberArk safe to store objects.
Creating the Application
Next, we need to create the application we will use in order to retrieve the credential from the vault. Applications connect to CyberArk using their application ID and application characteristics. Application characteristics are additional authentication factors the applications created in the CyberArk.
Each application should have a unique application ID (application name). We can change it later, but it may also require code change on the client side where this application is used to get the password. In our case, it’s Talend Code. Here’s how to do this.
Navigate to Applications and click on “Add Application”.
Now we have CyberArk – account and application id. Now give permissions to application id to retrieve credentials. Click on the Allowed Machines tab and enter the IP’s of the servers from where Talend will retrieve credentials.
Access to Applications from Safe
Navigate to policies -> access controls(safe) -> select your safe and click on members
Click on add member and search for your application, select it and check appropriate accesses and add it.
Now the next step is to install the credential provider in the development environment.
Installing credential provider (CP)
In order to retrieve credentials, we need to install a CyberArk module in the same box as your application (client) is running. This will deliver a Java API that will call the credential provider and talks to your application through Java API and then talks to CyberArk vault through their own proprietary protocol and retrieve the credentials that you need and then delivers back to your application through the Java API.
You will need to login to the CyberArk Support Vault in order to download the Credential Provider.
Retrieve password from Talend using Java API
Last but not least, we need to build a password retrieval mechanism with Talend. Create a Talend job with tLibrary_Load and a tjavaFlex.
Configure tLibraryLoad to “JavaPasswordSDK.jar” path. This is make sure that “JavaPasswordSDK.jar” is added to classpath by Talend during compilation.
And on tJavaFlex, navigate to advanced settings and make sure you import necessary classes for implementation.
In the basic setting of the tjavaFlex, below code is written to call CyberArk using Java API.
PSDKPasswordRequest passRequest = new PSDKPasswordRequest ();
PSDKPassword password = null;
passRequest.setObject ("Operating System-UnixSSH-myserver.mydomain.com-root");
passRequest.setReason ("This is a demo job for password retrival");
// Sending the request to get the password
password = javapasswordsdk.PasswordSDK.getPassword (passRequest);
// Analyzing the response
System.out.println ("The password UserName is : " +password.getUserName ());
System.out.println ("The password Address is : " +password.getAddress ());
context.dummy_password= password.getContent ();
System.out.println("password retrieved from cyberark's vault is -- context.dummy_password ==>> "+context.dummy_password);
catch (PSDKException ex)
System.out.println (ex.toString ());
In an always-on, competitive business environment, organizations are looking to gain an edge through digital transformation. Subsequently, many companies feel a sense of urgency to transform across all areas of their enterprise—from manufacturing to business operations—in the constant pursuit of continuous innovation and process efficiency.
Data is at the heart of all these digital transformation projects. It is the critical component that helps generate smarter, improved decision-make by empowering business users to eliminate gut feelings, unclear hypotheses, and false assumptions. As a result, many organizations believe building a massive data lake is the ‘silver bullet’ for delivering real-time business insights. In fact, according to a survey by CIO review from IDG, 75 percent of business leaders believe their future success will be driven by their organization’s ability to make the most of their information assets. However, only four percent of these organizations said they are set up a data-driven approach for successfully benefits from their information.
Is your Data Lake becoming more of a hindrance than an enabler?
The reality is that all these new initiatives and technologies come with a unique set of generated data, which creates additional complexity in the decision-making process. To cope with the growing volume and complexity of data and alleviate IT pressure, some are migrating to the cloud.
But this transition—in turn—creates other issues. For example, once data is made more broadly available via the cloud, more employees want access to that information. Growing numbers and varieties of business roles are looking to extract value from increasingly diverse data sets, faster than ever—putting pressure on IT organizations to deliver real-time, data access that serves the diverse needs of business users looking to apply real-time analytics to their everyday jobs. However, it’s not just about better analytics—business users also frequently want tools that allow them to prepare, share, and manage data.
To minimize tension and friction between IT and business departments, moving raw data to one place where everybody can access it sounded like a good move. The concept of the data lake first coined by James Dixon in 2014 expected the data lake to be a large body of raw data in a more natural state where different users come to examine it, delve into it, or extract samples from it. However, increasingly organizations are beginning to realize that all the time and effort spent building massive data lakes have frequently made things worse due to poor data governance and management, which resulted in the formation of so-called “Data Swamps”.
Bad data clogging up the machinery
The same way data warehouses failed to manage data analytics a decade ago, data lakes will undoubtedly become “Data Swamps” if companies don’t manage them in the correct way. Putting all your data in a single place won’t in and of itself solve a broader data access problem. Leaving data uncontrolled, un-enriched, not qualified, and unmanaged, will dramatically hamper the benefits of a data lake, as it will still have the ability to only be utilized properly by a limited number of experts with a unique set of skills.
A success system of real-time business insights starts with a system of trust. To illustrate the negative impact of bad data and bad governance, let’s take a look at what happened to Dieselgate. The Dieselgate emissions scandal highlighted the difference between real-world and official air pollutant emissions data. In this case, the issue was not a problem of data quality, but of ethics, since some car manufacturers misled the measurement system by injecting fake data. This resulted in fines for car manufacturers exceeding more than tens of billions of dollars and consumers losing faith in the industry. After all, how can consumers trust the performance of cars now that they know the system-of-measure has been intentionally tampered with?
The takeaway in the context of an enterprise data lake is that its value will depend on the level of trust employees have in the data contained in the lake. Failing to control data accuracy and quality within the lake will create mistrust amongst employees, seed doubt about the competency of IT, and jeopardize the whole data value chain, which then negatively impacts overall company performance.
A cloud data warehouse to deliver trusted insights for the masses
Leading firms believe governed cloud data lakes represent an adequate solution to overcoming some of these more traditional data lake stumbling blocks. The following four-step approach helps modernize cloud data warehouse while providing better insight into the entire organization.
Unite all data sources and reconcile them: Make sure the organization has the capacity to integrate a wide array of data sources, formats and sizes. Storing a wide variety of data in one place is the first step, but it’s not enough. Bridging data pipelines and reconciling them is another way to gain the capacity to manage insights. Verify the company has a cloud-enabled data management platform combining rich integration capabilities and cloud elasticity to process high data volumes at a reasonable price.
Accelerate trusted insights to the masses: Efficiently manage data with cloud data integration solutions that help prepare, profile, cleanse, and mask data while monitoring data quality over time regardless of file format and size. When coupled with cloud data warehouse capabilities, data integration can enable companies to create trusted data for access, reporting, and analytics in a fraction of the time and cost of traditional data warehouses.
Collaborative data governance to the rescue: The old schema of a data value chain where data is produced solely by IT in data warehouses and consumed by business users is no longer valid. Now everyone wants to create content, add context, enrich data, and share it with others. Take the example of the internet and a knowledge platform such as Wikipedia where everybody can contribute, moderate and create new entries in the encyclopedia. In the same way Wikipedia established collaborative governance, companies should instill a collaborative governance in their organization by delegating the appropriate role-based, authority or access rights to citizen data scientists, line-of-business experts, and data analysts.
Democratize data access and encourage users to be part of the Data Value Chain: Without making people accountable for what they’re doing, analyzing, and operating, there is little chance that organizations will succeed in implementing the right data strategy across business lines. Thus, you need to build a continuous Data Value Chain where business users contribute, share, and enrich the data flow in combination with a cloud data warehouse multi-cluster architecture that will accelerate data usage by load balancing data processing across diverse audiences.
In summary, think of data as the next strategic asset. Right now, it’s more like a hidden treasure at the bottom of many companies. Once modernized, shared and processed, data will reveal its true value, delivering better and faster insights to help companies get ahead of the competition.
Digital Transformation, the application of digital capabilities across the organization to uncover new monetization opportunities, is the path that any company wishing to survive in today’s world must follow. Sectors like agriculture, healthcare, banking, retail, and transportation are exploring challenges and opportunities that have come with the digital revolution. As a result of this, new business models have emerged while IT departments have become the focus of digital transformation.
Today SaaS and PaaS applications are being combined with social, mobile, web, big data, and IoT. With this, an organization’s ability to integrate legacy data from siloed on-premises applications with new data from emerging digital technologies, like cloud, have become a deciding factor for a successful digital transformation.
The reality is that many organizations still use hand-coding or ad hoc integration tools to solve immediate and specific needs, often leading to an integration nightmare. They suffer the consequences of an uncontrolled set of access, potential risks in regulatory compliance or auditory practices, inability to organically scale with the data volume surge, etc. Organizations need an enterprise strategy to address the changes brought on by the hybrid era of cloud and on-premises integration scenarios. Through this, organizations are able to unlock the true value of the data to accelerate digital transformation, a key enabling component in such strategy is a cloud integration platform (iPaaS).
iPaaS has started to go mainstream in recent years. Results from research firm MarketsandMarkets predicted that the market will be worth 2998.3 million US dollars by 2021 and there will be 41.5% market growth between 2016 and 2021.
So what exactly is an iPaaS?
iPaaS (integration platform-as-a-service), in simplest terms, is a cloud-based platform that supports any combination of on-premises, cloud data, and application integration scenarios. With an iPaaS, users can develop, deploy, execute, and govern cloud and hybrid integration flows without installing or managing any hardware or middleware. There are many benefits iPaaS offers: faster time to value, reduced TCO, accelerated time to market, bridging the skillset gap, reduced DevOps headaches, etc.
However, if your organization has integration needs or challenges, how do you pick the right iPaaS? I’ll explore a few key questions to ask when choosing the right solution.
Key areas for considerations when choosing an iPaaSDoes it support Big Data and Data Lake?
If you are embarking on an initiative to utilize your big data, you will need a solution that can handle the volume, velocity, and variety requirements of big data integration use cases. The solution should be able to support a data lake with complex data ingestion, transformation, and works well with the Hadoop ecosystems.
Does it support broad hybrid integration scenarios?
Although SaaS and cloud apps are widely adopted across organizations, on-premises apps are not likely to go away anytime soon. Because hybrid integration scenarios are still the norm, it’s important to have a solution that supports cloud to cloud, cloud-to-ground, ground-to-ground, multi-point, and multi-cloud integrations.
Can it empower my LoB and analyst teams in the project?
In most cases, an integration project is initiated to better serve business units to enable new business models, better serve customers, make more effective marketing and sales strategies, etc. LoBs are increasingly involved in integration work and therefore, to get your projects moving forward quickly, you need a solution with an easy-to-use UI, self-service capabilities like data prep, and data governance that don’t require extensive technical training.
Will multiple teams collaborate on the integration project?
If your integration projects require multiple teams, you may want to consider SDLC (software development life cycle) capabilities. This will allow you to create separate environments for each stage of SDLC like development, QA, production, etc. So your teams can plan and execute integrations more frequently and efficiently.
How do I ensure real-time data delivery?
Business agility, real-time decision making, and customer and partner interactions depend on the analytics from data delivered in real-time. To make fast and timely decisions, you will want to consider a solution that delivers data anytime with both batch or real-time streaming data processing.
What about data quality?
A critical capability that you also want in an iPaaS is built-in data quality that allows you to create clean and trusted data for analytics and decision making. You will want a solution that can handle different type and format data and ensure consistent quality throughout your data’s lifecycle.
Talend Cloud is Talend’s iPaaS that offers broad connectivity, native big data support, built-in data quality, enterprise SDLC support, and much more. If you are interested in jumpstarting your integration projects with Talend Cloud sign up for a 30-day free trial.