Over the past five decades, the focus for the mainframe has been on bigger, faster and cheaper processing power. This is no surprise, considering the technology advancements that have occurred. In Syncsort’s new eBook, “50 Years of Mainframe Innovations: Observations From a Long-Time Mainframer“, Ed Hallock talks about how amazing the past 50 years has been for the mainframe, with unbelievable advances in technology and business practices.
Not only have mainframes evolved, but organizations have evolved as well. That said, organizations struggle today with supporting growth, controlling IT budgets and costs, and ensuring the security of their IT infrastructure against potential threats and attacks.
Do you have a mainframe visibility problem? If so, you probably also have mainframe security and performance problems. Here’s how to improve visibility for your mainframe.
In IT, visibility means an understanding of what is happening in your systems and infrastructure.
Visibility is important because it provides the foundation for making informed decisions about security, performance optimization and the expansion of your infrastructure. If you lack visibility, you’re shooting in the dark when it comes to managing your hardware and software.
Ideally, visibility will be continuous, meaning it is ongoing and there are no gaps in your ability to work with your systems.
Overcoming Mainframe Visibility Hurdles
The tricky thing about visibility is that there is no one-stop solution for achieving it. An effective visibility strategy requires a mix of tools and processes. And it must be tailored to your organization, of course.
Visibility is also challenging to achieve because having visibility into one part of your infrastructure does not necessarily mean the rest of it is visible, too. You may do a great job of maintaining visibility into your storage systems but have much less visibility into front-end software, for example. Or you may have maximum visibility into commodity servers but none for your mainframes.
How do you address these challenges, particularly when you are working with a mainframe? The following tools and strategies can help you to overcome common mainframe visibility challenges:
Real-time system monitoring
Monitoring tools alert you to performance and availability issues as they occur, so that you have a decent shot of addressing them before they impact operations. Mainframe monitoring tools like Syncsort Ironstream Dataset Analyzer can be integrated with Splunk to provide easy visualization and interpretation of monitoring information.
General-purpose mainframe monitoring tools may provide some security insights, but to maximize security visibility for the mainframe you should use a dedicated security monitoring solution. Syncsort’s ZEN Suite provides mainframe security monitoring.
Optimizing mainframe performance requires you to keep tabs on your network, too. Mainframe network monitoring is another feature of Syncsort’s ZEN Suite.
Logs can enable both real-time and historical visibility into your systems. They can help you detect problems as they occur, as well as investigate issues after the fact. Because mainframe log data is so vast, analyzing mainframe logs manually is usually not feasible. Instead, log aggregation and analytics tools provide the insight you need. For guidance on leveraging mainframe logs, check out Syncsort’s whitepaper on the topic.
In addition to monitoring and analyzing your mainframe itself, full visibility requires taking into account the organizational context of the mainframe. Understand who is responsible for maintaining the mainframe, who is on-call when something goes wrong and who has access to it. Having this information on hand ensures that you can troubleshoot issues as quickly as they arise.
Again, your mainframe visibility strategy should be tailored to your infrastructure and organization. In general, however, the tips above should help you achieve the continuous visibility that you need to keep your mainframe lean, mean and running well.
The big theme in this year’s Strata Data Conference in San Jose was how Data Science can be a business game changer. In his keynote remarks, Eric Colson from Stitch Fix pointed out that differentiation has always been key to doing business. Data Science now offers a new opportunity for organizations to set themselves apart.
New Data Sciences Will Provide New Business Insights
If a data scientist can come up with an insight that no one else has thought of yet, it could give considerable advantage over the competition. Eric suggested that the Data Science function should report directly to the CEO. This point was echoed by Mike Olson, Chief Strategy Officer at Cloudera. In his Executive Briefing Session on Machine Learning, Mike quoted a Harvard Business Review study that reported that less than 50% of all structured data and less than 1% of unstructured data collected by an organization are currently used to make business decisions. Now that the technology is available to analyze all this data, it is important to have the person gaining those insights communicate directly to the business decision makers to effect real change.
Execs at Strata Point to Transformative Data Science Use Cases that Drive Revenue!
Some examples where Data Science has already transformed business were mentioned in various sessions and keynotes. Mike Olson mentioned the State of Kentucky. Their initiative to analyze sensor and weather data to do a better job of snow removal has been so successful that other states have bought the solution developed by Kentucky. Navistar, a transportation company that has used predictive maintenance models to keep their vehicles on the road was able to increase their profits in a very competitive market by transforming their business: they can sell up-time instead of just selling vehicles. They have also identified a new revenue stream, by selling their sensors.
Data Science Challenges for Humans and Machines
While there have been great advances in Data Science, there were also plenty of reminders about the remaining challenges. Computers are not yet as good at learning as humans, which can lead to disconcerting results. Hilary Mason from Cloudera Fast Forward Labs showed the results of some web searches that combined her information with that of an actress that shares her name (and little else). Janelle Shane maintains a blog of neural networks training mishaps and shared some examples of misidentified images in her keynotes. Humans also take their share of the blame. In his hilarious keynotes, Seth Stephens-Davidowitz, author of ‘Everybody Lies’, compared how people completed the sentence “My husband is…” in two (very) different contexts: Facebook posts versus private Google searches.
Even for businesses that rely on data sets that are less “out in the wild”, such as sensor data, or customer transaction data, there are big challenges. Data needs to be fresh; models become obsolete after a certain amount of time and need to be retrained on new data. The ‘echo chamber’ effect is also hard to counter: if your data is not diversified enough, you may be drawing incorrect conclusions.
Collecting all your data in one place is still a challenge. Gwen Shapira from Confluent delivered a session on the evolution of ETL to a standing-room only audience. She gave an example of a hotel use case for processing streaming customer data to deliver relevant promotions to elite members. The traditional approach of enriching the data with a database lookup wouldn’t scale, since there were millions of weblog events coming in. Her proposal was to cache the customer data and use a Change Data Capture solution to keep it fresh.
If an industry is regulated, the data needs to be governed. Data lineage and governance was another challenge discussed in a few sessions, such as the detailed GDPR executive briefing by Cloudera and the ODPi initiative presented by ING and the Linux Foundation. Syncsort CTO, Dr. Tendü Yoğurtçu, PhD likened governance to having a farm-to-table view of your data.
Data Science Agility: Problem Infatuation versus Problem Solving
In order to take advantage of the business transformation promises of Data Science, it’s important to be agile. The quicker Data Scientists can get their hands on clean, trusted data, the quicker they can begin asking innovative questions. In Mike Olson’s briefing, he quoted the research that indicates data scientists still spend 80% of their time on data preparation instead of analysis.
There is no one-size fits all solution. Tobias Ternstrom from Microsoft suggested in his keynotes that when selecting a solution, it’s important to focus on how to bring value to your business and have an objective proof of concept, instead of getting attached to a solution that sounds exciting. This ‘fall in love with the problem, not the solution’ sentiment was echoed by Ted Malaska, who shared his experiences with technology selection and cautioned against letting your passions mislead you.
Having an ETL solution that understands the Big Data ecosystem as well as the traditional enterprise business models and data sources can be a great asset in eliminating data silos, ensuring the data is clean and up-to-date, and providing governance. Syncsort DMX-h makes it easy to ingest and integrate data from all sources, including Mainframe, IBM i, relational databases, and streams such as Kafka, and to populate the data lake on any Big Data ecosystem. DMX-h supports all Hadoop distributions, works the same way on batch and streaming data, on premise and in the cloud, on MapReduce and Spark. In her , Dr. Tendü Yoğurtçu discussed the trends complicating data governance, and new Change Data Capture, data quality, and governance capabilities that Syncsort has delivered to address the evolving needs, helping customers meet these challenges and make data part of their core strategy.
Dr. Tendu Yogurtcu, Syncsort | Big Data SV 2018 - YouTube
During Strata Data Conference 2018 in San Jose, California, Syncsort CTO, Dr. Tendü Yoğurtçu sat down with theCUBE co-hosts George Gilbert and Lisa Martin at Big Data SV 2018. In the recorded interview, they discuss three key industry trends in Data Science, streaming and the Cloud, and how all of them create data governance challenges.
Watch the video to learn more about what organizations are doing as they work to make data their core strategy, and how Syncsort is working to help them.
Data Science Trends Complicate Data Governance
First, Tendü talks about how organizations are focused on preparing data for deep learning and artificial intelligence use. She also addresses how the data must be trusted to use with these technologies, heightening the importance of data integration and data quality to prepare, cleanse and match data. Should we add supervised learning to this? Tendü also addresses the advantage Syncsort has in having domain expertise to infuse machine learning algorithms and connect data profiling with data quality capabilities. That approach could help organizations recommend business rules and automate the mandated tasks.
Ensuring Data Governance Doesn’t Get “Cloudy”
Tendü explains that many organizations now have multiple workloads in hybrid clouds, creating governance challenges as well as necessitating more scoping and planning for the Cloud. She points out that Data Governance is the “umbrellas focus for everything we are doing at Syncsort,” because these other trends and developing next generation analytics environments require good data governance. The big driver is regulatory compliance, such as GDPR, which is “on the mind of every C-level exec” – not just for European companies – since most companies have European data sources in their environments. Security and availability of the data are key, and another critical aspect is delivering high quality data to data scientists.
Tendü talks about the importance of Syncsort’s design once, deploy anywhere strategy to enable organizations to run the same applications, without requiring any changes, across all their environments.
Dr. Tendu Yogurtcu, Syncsort | Big Data SV 2018 - YouTube
Video & Recap: @Syncsort #CTO Tendü Yoğurtçu on trends impacting #DataGoverance on @theCUBE… Click To Tweet
Data Governance Must Swim Up and Down Stream
Tendü also discussed another macro trend – streaming with connected devices. So much data is being generated, driving the need to process and stream data on the edge. In addition, the Kafka data bus is now a streaming data consumer, publishing data and making it available for applications and analytics in the data pipeline. Syncsort helps meet the resulting data governance challenges by providing CDC and real-time data replication capabilities.
For more on industry on trends in Data Science can be game-changing for IT organizations, be sure to check out our Strata Data Conference recap tomorrow!
However, simply backing up your data is not necessarily enough to achieve your goals. There is a reason why 75 percent of people who back up data are not able to restore all of it following a failure.
If your backup strategy is designed simply to back up data for backups’ own sake, rather than advancing a broader high availability agenda, you may as well not be doing backups at all.
Data backups won’t help you to avoid serious disruptions when something unexpected happens unless they’re complemented by the following considerations and strategies for achieving high availability.
Data Restoration Process
To minimize downtime and maximize availability during a crisis, you need to be able to restore data quickly from backups to production systems.
To do this, you must have a restoration plan in place before disaster strikes. You don’t want to wait until your business operations have been disrupted to start figuring out how you move data from backup locations to production systems.
This is why you should develop specific data restoration plans ahead of time. Although you can’t predict every variable that might be at play during a data recovery scenario, you can create general procedures that your team will follow when moving data from backups.
You can also have data migration and transformation tools (like those in Syncsort’s Big Data solutions suite) preinstalled and preconfigured, if you’ll need them as part of the data restoration.
Even better than having to restore data from a backup location is not having to restore it at all because your workloads automatically move from one host environment to another in the event that the first host environment fails.
This type of functionality, which is called automated failover, is delivered by solutions like Trader’s, which provides automated failover features as part of its high availability platform for IBM i systems.
Distributed Data Replication
Simply backing up your data somewhere is often not enough to achieve high availability. You must back it up in a way that maximizes its chances of remaining available in the event of disruption to your infrastructure.
One way to do this is to replicate your data automatically across a distributed environment of servers or storage locations. With automatic, distributed data replication, your data always exists in multiple locations at once. And because those locations are spread out — in the sense of including either multiple servers within your data center or, better, multiple data centers in different geographic locations — the data will remain intact even if some storage locations fail.
3-2-1 Backup Strategy
Another handy way of maximizing data availability is to follow what is known as the 3-2-1 data backup rule. According to this rule, you should:
Have at least three distinct copies of your data at all times.
Back up your data to at least two different types of storage (such as an on-premise server and a cloud environment).
Keep at least one off-site copy of your data.
These procedures help to ensure that if one type of data storage fails, or your local storage is wiped out, your data will still be available.
The 3-2-1 backup strategy may not be necessary if you already do automatic data replication across distributed systems. But if you lack the resources for that type of solution, the 3-2-1 approach is an easy and effective way to maximize data availability.
To learn even more about the state of disaster recovery preparedness in organizations today, read Syncsort’s full “State of Resilience“ report.
The importance of forecasting when dealing with Capacity Management can not be overstated. Making statements or predictions for future events require careful analysis of all information, and events can be anything from the state of resource consumption, to service levels, and even computing environment changes at future points in time.
In the last blog, “Mainframes Continue to Reign in 2018: Key Trends that Support Big Iron’s Sovereign Rule,” I addressed the results of our recently completed annual survey of IT and data analytics professionals to identify trends, challenges and opportunities faced by enterprises investing in the mainframe going into 2018. I discussed two very important and related findings from the survey – that the mainframe remains strategic to businesses and cost control is a priority. Another key finding was focused on the important role mainframe data plays in today’s advanced analytics. Let’s take a look at what we discovered.
The Mainframe’s Role as a Host for Revenue-Generating Services
As previously discussed, the IBM z/OS mainframe remains an important focus in many large organizations. The majority respondents to our survey reported that the mainframe serves as the main hub for business-critical applications by providing high-volume transaction and database processing. A related and extremely interesting finding – a high number of respondents indicated they’ll use the mainframe to run revenue-generating services over the next 12 months. This is not only another clear indication that the mainframe remains strategic to the business, but points to the high value of mainframe applications and data.
Integrating Mainframe Data with Enterprise-Wide Data for Next-Generation Analytics
Another key reason mainframe data is so important is the growing trend to enable enterprise-wide data views. 44% of survey respondents chose integrating mainframe data with modern analytics tools as a top organizational priority, and 23% said they already use Big Data tools (like Splunk and Hadoop) to monitor mainframe and other enterprise data together in a single dashboard. This trend to connect “Big Iron to Big Data” makes perfect sense – organizations need to access and integrate mainframe data with other enterprise-wide data sources to get a 360-degree view of data that drives better decision making, particularly in support of security and compliance initiatives.
This year’s survey asked what is most important for security on the mainframe. The answer? Unlocking and analyzing SMF data is the key!
Tracking Data Movement is a Priority
In fact, with the need so many diverse enterprise-wide data sources, mainframe organizations are more challenged than ever in the need to track and understand data movement for good data governance, including data quality and compliance. 53% of organizations said they lack full visibility into their data movement, compared with 61% last year. While the downward trend is encouraging, this remains an area of risk that must be addressed to ensure security and compliance initiatives are met.
This year’s survey revealed that more than half of organizations either, don’t have a clear view of what data is moved by whom, when and where
The takeaway is clear – the mainframe is a gold mine of critical data, and organizations increasingly want to make that data, along with other key enterprise data, available to next-generation analytics tools to extract its true value for key business initiatives, gain competitive advantage and yes, even generate additional revenue!
For the past few years, Syncsort DMX-h has been helping large enterprises populate their data lakes by making it easy to access legacy data coming from the Mainframe or Enterprise Data Warehouse platforms such as Teradata or Oracle and integrate it with data in Hive, HDFS, Kafka, etc.
As a growing number of enterprises successfully deploy their newly populated data lakes, we have started to hear about the next pain point: Data Lineage and Governance. In Syncsort’s recently published Big Data survey, nearly 60% of respondents who are testing or in production with Hadoop or Spark identified “including the data lake in data governance initiatives and meeting regulatory compliance mandates” as a significant challenge.
Tracking Data Movement
Data lineage tracks, at a field level, data origination (source), what happens to it (transformations), and where it moves to over time (target). Data lineage also simplifies tracing errors back to their sources in a data analytics process.
Enterprises must track data movement throughout the organization for many types of use cases including regulatory reporting, security, and auditing. This might be part of a larger data governance practice in the organization. The challenges for the organizations are: addressing the volume and variety of their data sources (e.g., mainframe, DBMS, files, external); tracking and understanding the vast movement and transformation of data; and being able to “consume” this understanding presented via a graphical user interface, or integrated with tools or technologies already present within the organization.
Cloudera Navigator is Cloudera’s Data Governance solution for Hadoop. It automatically collects audit logs from across the entire platform and maintains a full history, with a unified, searchable audit dashboard for simple, point-in-time visibility.
Syncsort has partnered with Cloudera to extend Navigator’s reach beyond the Hadoop cluster. Not only does Syncsort DMX-h access data from the Mainframe, RDBMS, or other legacy sources, and transforms those into Hadoop compatible formats, but now, with new, extended integration with Cloudera Navigator, it makes the lineage information accessible to Navigator.
DMX-h is also used for data integration within the cluster: ETL jobs created in the DMX-h point-and-click interface can be run on MapReduce, Spark, or stand-alone Windows/Linux/Unix systems. And the best part is that now, all the details of that data processing can also be published to Navigator. This means that regardless of whether the data movement and transformation process was run inside or outside of Hadoop, or some of both. Navigator shows the data lineage from beginning to end.
Syncsort DMX-h makes its lineage information available through an API that can be integrated into different Data Lineage and Governance solutions. As the first of our joint customers with Cloudera opted to use Navigator as their Data Governance solution, our engineering team worked very closely with our partners at Cloudera to implement the deep integration of DMX-h lineage with Navigator. The joint customers were involved in every step of the development process to provide feedback and ensure the integration will meet their needs.
For enterprises that are not using Cloudera Navigator, DMX-h makes the lineage information available through a REST-API that can be used to integrate with different governance solutions.
What is big data, really? Despite what the term big data implies, the definition of big data is not actually about the size of your data. It’s how you use the data.
When it comes to data, size is always relative.
True, the number of data sources and the amount of information that can be stored and analyzed have increased significantly over the past several years. This increase coincided with the entry of the term big data into the popular lexicon.
Yet it’s not as though enough large data sets didn’t exist until we started talking about big data. What we call big data today may involve more data than the data sets and workloads of the past, but it may not. Again, it’s all relative.
What Really Defines Big Data
If you can’t distinguish big data from traditional data sets in terms of size, then what does define big data?
The answer lies in how the data is used. The processes, tools, goals, and strategies that are deployed when working with big data are what set big data apart from traditional data.
Specifically, big data is defined by the following features:
Highly scalable analytics processes
Big data platforms like Hadoop and Spark have become popular due in large part to their ability to scale. The amount of data that they can analyze without a degradation in performance is virtually unlimited. This is what sets these big data tools apart from traditional methods of investigating data, such as basic SQL queries. The latter doesn’t scale unless you integrate them into a larger analytics framework.
Big data is flexible data. Whereas in the past all of your data might have been stored in a specific type of database using consistent data structures, today’s datasets come in many forms. Effective big data analytics strategies are designed to be highly flexible and to handle any type of data that is thrown at them. Fast data transformation is an essential part of big data, as is the ability to work with unstructured data.
Traditionally, organizations could afford to wait for data analytics results. In the world of big data, however, maximizing value means gaining insights in real time. After all, when you are using big data for tasks like fraud detection, results received after the fact are of little value.
Machine learning applications
Machine learning is not the only way to leverage big data. It is, however, an increasingly important application in the big data world. Machine learning use cases set big data apart from traditional data, which was very rarely used to power machine learning.
Scale-out storage systems
Traditionally, data was stored on conventional tape and disk drives. Today, big data often relies on software-defined scale-out storage systems that abstract data away from the underlying storage hardware. Of course, not all big data is stored on modern storage platforms, which is why the ability to move data quickly between traditional storage and next-generation storage remains important for big data applications.
Data quality is important in any context. With the increasing complexity of big data, however, has come greater attention to the importance of ensuring data quality within complex data sets and analytics operations. Attention to data quality is a core feature of any effective big data workflow.
If you’re not striving to achieve these features in your big data, you’re not making the most of your data.
Cloud security is not a new topic, but it remains a very relevant one. Keeping cloud-based data secure is something all businesses should think about to avoid being the victim of the next cloud security breach.
In the early days of the cloud — which is to say, eight or ten years ago — organizations often assumed that the cloud was inherently less safe than on-premise computing. If you moved data to the cloud, it was because you decided that convenience and agility outweighed the security risks.
Today, that is no longer the case. You can have your cloud and eat it, too — meaning that you can store data in the cloud without compromising security.
However, cloud-based data won’t secure itself. Indeed, as Syncsort’s “State of Resilience” report found, cloud security remains the top security challenge for IT professionals.
How do you respond to the threat? By focusing on the following areas and opportunities you can improve the security of your cloud-based data.
One of the most common reasons for moving data to the cloud is to make it accessible from anywhere.
That’s an important advantage. However, cloud-based data that can be accessed from anywhere should not also be accessible to everyone.
To keep data safe, implement granular access controls. Particular data sets should be accessible only to the particular people who need to access them.
In other words, don’t treat your cloud data as a company-wide networked data share, unless everyone in your company should have access to the data.
When your data lives in the cloud, you may be tempted to assume that it can never disappear.
In reality, although cloud-based data is often more resilient to disruption than data that exists in a single on-premise data center, cloud data can and does become subject to theft, deletion and other forms of loss.
There’s an easy solution: Backing up your cloud data either to another cloud, or to an on-premise location.
Your data in the cloud may be secured via restrictions on access control, but there is no reason you shouldn’t encrypt it, too. Encryption provides another layer of protection to keep the data secure in the event that it is stolen.
Multiple Cloud Regions
Most major cloud providers now offer you the ability to choose which geographic regions host your data. This choice can help to meet compliance and data sovereignty requirements, such as those associated with the GDPR. This is the most common motivation for users who choose specific cloud regions for their data.
However, cloud regions can also improve data reliability. If you spread your data between multiple regions — or, better, keep multiple copies of your data in different regions — you have greater assurance that your data will remain available in the event that one region is disrupted.
Know Your Cloud Provider
All cloud storage services may seem to be more or less the same. They let you upload data onto infrastructure that is managed by someone else, and download it when you need it.
However, the way in which cloud providers secure and manage your data can vary widely. When you’re choosing a cloud host, consider more than price. Does the provider have a history of security issues? Do they explain in detail what they do to keep data safe? The answers to these questions can help you to choose a cloud provider that delivers the best security for your data.
Download the TDWI Checklist Report which lays out seven key considerations for organizations that are trying to decide what, if any, role the cloud should play in their data protection strategy.