Loading...

Follow Deep Data Mining Blog on Feedspot

Continue with Google
Continue with Facebook
or

Valid

Deep-Data-Mining is pleased to introduce the new book, Principles of Database Management – The Practical Guide to Storing, Managing and Analyzing Big and Small Data, by Lemahieu W., vanden Broucke S.,  and Baesens B. (ISBN: 9781107186125). The following is the book interview.




Why did you decide to write your book Principles of Database Management?

This book is the result of having taught an undergraduate database management class and a postgraduate advanced database management class for more than ten years at KU Leuven (Belgium).  Throughout these years, we have found no good textbook which covers the material in a comprehensive way without getting flooded by theoretical detail and losing focus.  Hence, after having teamed up together, we decided to start writing a book ourselves.  This work aims to offer a complete and practical guide covering all the governing principles of database management, including:
  •          End-to-end coverage starting with legacy technologies to emerging trends such as Big Data, NoSQL databases, Analytics, data governance, etc.
  •        A unique perspective on how lessons learnt from past data management could be relevant in today’s technology setting (e.g., navigational access and its perils in Codasyl and XML/OO databases)
  •          A critical reflection and accompanying risk management considerations when implementing the technologies considered, based on our own experiences from participating in data and analytics related projects with industry partners in a variety of sectors, from banking to retail and from government to the cultural sector
  •          Offering a solid balance between theory and practice, including various exercises, industry examples and case studies originating from a diversified and complimentary business practice, scientific research and academic teaching experience


We hear a lot of companies complaining about bad data quality these days.  How can database management contribute to this?
First of all, data quality (DQ) is often defined as ‘fitness for use,’ which implies the relative nature of the concept.  Data of acceptable quality in one decision context may be perceived to be of poor quality in another decision context, even by the same business user.  For instance, the extent to which data is required to be complete for accounting tasks may not be required for analytical sales prediction tasks.  Database management can contribute in various ways to improving data quality. 
A good conceptual data model capturing the business requirements as accurately as possible is the start of everything.  As discussed in the book, both EER and UML can be used for this purpose.  It is important to also list the semantic shortcomings of the conceptual models developed such that they can be followed up during application development. 
A next important activity concerns metadata management.  Just as raw data, also metadata is data that needs to be properly modeled, stored and managed.  Hence, the concepts of data modeling should also be applied to metadata in a transparent way.  In a DBMS approach, metadata is stored in a catalog, sometimes also called data dictionary or data repository, which constitutes the heart of the database system. 
Finally, to manage and safeguard data quality, a data governance culture should be put in place assigning clear roles and responsibilities. The ultimate aim of data governance is to set up a company-wide controlled and supported approach towards data quality, accompanied by data quality management processes. The core idea is to manage data as an asset rather than a liability, and adopt a proactive attitude towards data quality problems. To succeed, it should be a key element of a company’s corporate governance and supported by senior management.

Should all companies start to invest in NoSQL databases?

The explosion of popularity of NoSQL databases should be put in perspective considering their limitations.  Most NoSQL implementations have yet to prove their true worth in the field (most are very young and in development).  Most implementations sacrifice ACID (atomicity, consistency, isolation and durability) concerns in favor of being eventually consistent, and the lack of relational support makes expressing some queries or aggregations particularly difficult, with map-reduce interfaces being offered as a possible, but harder to learn and use, alternative.  Combined with the fact that RDBMSs do provide strong support for transactionality, durability and manageability, quite a few early adopters of NoSQL were confronted with some sour lessons.  It would be an over-simplification to reduce the choice between RDBMSs and NoSQL databases to a choice between consistency and integrity on the one hand, and scalability and flexibility on the other.  The market of NoSQL systems is far too diverse for that.  Still, this tradeoff will often come into play when deciding on taking the NoSQL route.  We see many NoSQL vendors focusing again on robustness and durability.  We also observe traditional RDBMS vendors implementing features that let you build schema-free, scalable data stores inside a traditional RDBMS, capable to store nested, semi-structured documents, as this seems to remain the true selling point of most NoSQL databases, especially those in the document store category.  Expect the future trend to continue towards adoption of such “blended systems”, except for use cases that require specialized, niche database management systems.  In these settings, the NoSQL movement has rightly taught users that the one size fits all mentality of relational systems is no longer applicable and should be replaced by finding the right tool for the job.  For instance, graph databases arise as being “hyper-relational” databases, which makes relations first class citizens next to records themselves rather than doing away with them altogether. 

Many companies are investing in data lakes these days.  What is the difference with a data warehouse?

Much more recent than data warehouses, the data lake concept became known as part of the big data and analytics trend.  Although both data warehouses and data lakes are essentially data repositories, there are some clear differences as listed in the table below.


Data Warehouse
Data lake
Data
Structured
Often unstructured
Processing
Schema-on-write
Schema-on-read
Storage
Expensive
Low cost
Transformation
Before entering the DW
Before analysis
Agility
Low
High
Security
Mature
Maturing
Users
Decision makers
Data Scientists

A key distinguishing property of a data lake is that it stores raw data in its native format, which could be structured, unstructured or semi-structured.  This makes data lakes fit for more exotic and ‘bulk’ data types that we generally do not find in data warehouses, such as social media feeds, clickstreams, server logs, sensor data, etc.  A data lake collects data emanating from operational sources ‘as is’, often without knowing upfront which analyses will be performed on it, or even whether the data will ever be involved in analysis at all.  For this reason, either no or only very limited transformations (formatting, cleansing, …) are performed on the data before it enters the data lake.  Consequently, when the data is tapped from the data lake to be analyzed, quite a bit of processing will typically be required before it is fit for analysis.  The data schema definitions are only determined when the data is read (schema-on-read) instead of when the data is loaded (schema-on-write) as is the case for a data warehouse.  Storage costs for data lakes are also relatively low because most of the implementations are open-source solutions that can be easily installed on low-cost commodity hardware.  Since a data warehouse assumes a predefined structure, it is less agile compared to a data lake which has no structure.  Also, data warehouses have been around for quite some time already, which automatically implies that their security facilities are more mature.  Finally, in terms of users, a data warehouse is targeted towards decision makers at middle and top management level, whereas a data lake requires a data scientist, which is a more specialized profile in terms of data handling and analysis.

How do Object Oriented DBMSs (OODBMSs) compare against Relational Databases (RDBMSs)? Why haven’t they been that successful?

OODBMSs offer several advantages.  First, they store complex objects and relationships in a transparent way.  The identity based approach allows for improved performance when performing complex queries involving multiple interrelated objects, avoiding expensive joins.  By using the same data model as the programming language to develop database applications, the impedance mismatch problem is no longer an issue.  In addition, developers can be more productive, as they are confronted with only a single language and data model.
Still, the success of OODBMSs has been limited to niche applications characterized by complex, nested data structures where an identity-based, instead of a value-based, method of working pays off.  An example is the processing of scientific data sets by CERN in Switzerland, where data access follows predictable patterns.  The widespread use and performance of RDBMSs, however, proved hard to displace: the (ad-hoc) query formulation and optimization procedures of OODBMSs are often inferior to relational databases, which all adopt SQL as their primary database language combined with a powerful query optimizer.  When compared to RDBMSs, OODBMSs are not that developed in terms of robustness, security, scalability and fault-tolerance.  They also don’t provide a transparent implementation of the three-layer database architecture.  More specifically, most OODBMSs provide no support for defining external database models, such as views in the relational model. 

Why are so many companies still struggling with database legacy instead of investing in newer technologies?

Many firms still struggle with legacy databases due to historical implementations and limited IT budgets.  Hence, knowing the basic characteristics thereof is essential to the maintenance of the corresponding database applications and the potential migration to modern DBMSs.  Second, understanding the basics of these legacy models will contribute to better understanding the semantical richness of newer database technologies.  Finally, the concept of procedural DML and navigational access originally introduced by these legacy models has also been adopted by more recent databases such as OODBMSs. 

This book covers a few decades of evolutions in the database world.  Which developments or patterns struck you as the most remarkable in the long run ?
The perseverance of SQL is really striking.  It survived object-oriented databases and XML databases, and after a while, many of its constructs were even incorporated in these paradigms.  More recently, there is considerable irony (and beauty) in the fact that one of the main tendencies of NoSQL development efforts is to mimic the convenient data retrieval and query facilities of, well, SQL-based databases.  Solutions for ‘SQL on top of NoSQL’ are all over the place.  And let’s not forget: SQL still remains a beautiful language.

Is there any additional material provided (e.g., for instructors or students)?
Yes, the book comes with the following additional material:
  • A website with additional information: www.pdbmbook.com
  • Free YouTube lectures for each of the 20 chapters, see https://www.youtube.com/watch?v=o36Z_OqC2ac&list=PLdQddgMBv5zHcEN9RrhADq3CBColhY2hl
  • PowerPoint slides for each of the 20 chapters, see http://www.pdbmbook.com/lecturers
  • A solutions manual with the solutions to all multiple choice and open questions.
  • An online playground with diverse environments, including MySQL for querying; MongoDB; Neo4j Cypher; and a tree structure visualization environment
  • Full-color illustrations throughout the text
  • Extensive coverage of important trending topics, including data warehousing, business intelligence, data integration, data quality, data governance, Big Data, and analytics
  • Hundreds of examples to illustrate and clarify the concepts discussed that can be reproduced on the book’s companion online playground
  • Case studies, review questions, problems, and exercises in every chapter


Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

West Lake in Hangzhou, ChinaI published an article today on my Linkedin profile Using Artificial Intelligence to Calculate Characteristic Town Index In this article, I described a milestone project that we recently accomplished and how it made me instantly famous in the field.The follow is the content of the Linkedin article.Characteristic town development is a hot topic in China. I am extremely excited to announce that using artificial intelligence we have successfully developed Characteristic Town Index for 3,700 towns in China. Based on the index, we are able to identify the most promising towns objectively and efficiently for development. This is the first time that artificial intelligence is used in characteristic town evaluation. It is a cross-team effort under my leadership as the CTO of Hangzhou Jingli Company. I published an article, Use Big Data and Artificial Intelligence to Rationally Evaluate the Characteristic Town, on influential "Economic Information Daily" (in Chinese) on January 10, 2018 (Please use Google Translate if you are interested). It quickly becomes one of the most cited articles on the Internet in China in this field. On one website alone, the article has been read nearly 52 thousand times. Websites hosting the article consistently ranked number one by Google and Baidu on keywords "characteristic town assessment artificial intelligence" (in Chinese). After reading the article, a government official of a major city in China says, "Dr. Zhou's article should be read carefully and thoroughly. The points raised are thought-provoking". The article's main point is that data-driven artificial intelligence models have advantages over experience-based expert systems. The following are some highlights:
  • Multiple teams have spent months to collect 69 variables that are relevant for 3,700 towns. These variables cover climate, geography, economy, ecology, transportation, Internet infrastructure and so on.
  • Expert systems are subjective, rigid and static. It is hard to adjust expert systems based on the discrepancies between their outputs and the realities. The feedback mechanism is lacking.
  • Data, machine learning models and applications should form closed loops. The data are used to train models. Models' outputs provide decision support for applications. The data are updated and enhanced when models are applied in real world, e.g., finding better target variables. These closed loops allow models to be improved iteratively.
This article has made me instantly famous in this field in China. However, the success doesn't come instantly at all. The core ideas are from my award-winning academic paper published two decades ago "Using genetic learning neural networks for spatial decision making in GIS". I feel fulfilled that I finally implement the ideas for practical applications in the field of GIS.
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 
Deep Data Mining Blog by Jay Zhou, Phd. - 10M ago
I gather a list of leading data providers. This list is continuously expanding. Thomson Reuters Thomson Reuters provides financial data, financial data,including company, market, valuation and price. In addition, Open Calais from Thomson Reuters is a tool that extracts structured data from text. Three Credit Bureaus Three credit bureaus, Experian, Equifax and TransUnion offer consumer credit reports. Acxiom Acxiom's Personicx products divide consumers into clusters or segments according to different criteria to help companies can better market their products to targeted audience. Personnicx products include Lifestage, Digital, Financial and Hispanic.
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 
I have found an excellent book on SQL tailored for Data Miners/Scientists, Data Analysis Using SQL and Excel by Gordon S. Linoff. Unlike other technical SQL books, this one is highly practical with many examples that can be applied directly to business problems. I highly recommend it.
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 
On January 31, 2017, I was invited by Prof. Wei Ding at Department of Computer Science, University of Massachusetts Boston, and gave 3 talks about my data science projects across different industries. These talks are extremely well received. The following is what Prof. Ding says about my talks.
It was a fortune to have Jay come to our computer science department to share his experience in solving business problems with predictive analytics on February 28, 2017. What Jay had presented in his 3 talks, each lasting for 1 hour in different topics of data mining, was totally impressive and beyond our wildest expectation. Having built competition-winning predictive models for some of the biggest companies and produced hundreds of millions of dollars’ savings, Jay shared the secret of his success with students and faculty without reservation. His strong presentations were such an inspiration for our computer science students and faculty and his methodology was innovative and powerful , even for very seasoned data scientists among the audience. Jay, thank you so much for your hard work preparing and delivering these presentations! The audience are particularly amazed by how I come up with solutions using Oracle SQL environment. To share my expertise, I create the online course Oracle SQL for Data Science to show how to perform common data science tasks using Oracle SQL and the benefits for doing that. The description of the course can be found here.
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

I have gone through many steps to build predictive models on the Oracle instance on Amazon AWS and presented the results to the management. We decide to stop the Oracle instance so that we will not pay instance hours (but we are still charged for provisioned storage). The stopped instance can be restarted.

The instance is being stopped.
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

I have a training data set of 136 records, 25 of them are positive examples and the remaining negative. The goal of the project is to build a predictive model that gives the probability of data points being positive. A logistic regression model is selected for its structural and implementation simplicity. To make the model more robust and able to perform reasonably well on new data set, I decide to build 20 logistic models, each based on a random sample of the original 136 records with replacement. The prediction probabilities produced by these 20 models are average to arrive at the final score. My first step is to generate 20 random sampling sets of 136 records from the original training set. The sampled set will have the same size but some records will be picked more zero, one or more than one times. I write the following PL/SQL to do the 20 rounds of random sampling with replacement.



create table t_bagging (iter number, id number);

declare
i number;
begin
for i in 1..20
loop
insert into t_bagging select i, 1+mod(abs(dbms_random.random),136) from t_train;
dbms_output.put_line(i);
end loop;
commit;
end;
/

In the above script, t_trian is the original training set having 136 records with unique identifier starting from 1 to 136. The function dbms_random.random generates a uniformly distributed random integer from from -2^^31 to 2^^31. I make the random number positive by taking the absolute value using abs() function. Mod() function forces the random number to be within the range of 0 and 135. I also add 1 after applying mod function so that its range becomes from 1 to 136. Next, I write the following script to create 20 views each which will be used as the new training sets for building 20 models.


declare
sqlstr varchar2(512);
begin
for i in (select distinct iter from T_BAGGING order by iter)
loop
sqlstr:='create or replace view v_tr_bag'||i.iter||' as select '||
'* from t_training a, T_BAGGING c'||
' where a.id=c.id and c.iter='||i.iter;
--dbms_output.put_line(sqlstr);
execute immediate sqlstr;
end loop;
end;
/
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

I have a text file on my laptop that I want to load into an Oracle database on Amazon. The only tool available is SQLPLUS client on my laptop from which I can connect to the database to run query. The text file has 3,735 records and 17 columns as shown below.


1.0,65.79,329.0,0.76,-1,-1,-1,30.0,0.863,351.0,0.81,28.1,13.8,0.8055,-1,137000.0,4.0
2.0,224.9,146.0,-1,1949.5,-1,19.3,22.0,0.301,166.0,-1,28.0,7.9,4.6,-1,4670000.0,10.0
3.0,113.0,113.0,0.76,1114.9,123.8,-1,22.0,0.65,119.0,0.85,-1,-2.0,6.3,8.0,230000.0,9.0
4.0,175.4,28.0,0.74,1670.0,-1,22.9,29.0,0.283,35.0,-1,37.1,2.0,5.7,-1,390000.0,7.0
I decide to create a SQL script file that contains 3,735 "insert" SQL statements, connect from SQLPLUS client to the database and run the SQL script file. As shown in the following scripts, I use Linux awk command to make insert statement for each records.

$ cat x.txt | awk '{print "insert into t_data values(",$0,");"}' > insert_data2.sql
$ cat insert_data2.sql
insert into t_data values( 1.0,65.79,329.0,0.76,-1,-1,-1,30.0,0.863,351.0,0.81,28.1,13.8,0.8055,-1,137000.0,4.0 );
insert into t_data values( 2.0,224.9,146.0,-1,1949.5,-1,19.3,22.0,0.301,166.0,-1,28.0,7.9,4.6,-1,4670000.0,10.0 );
insert into t_data values( 3.0,113.0,113.0,0.76,1114.9,123.8,-1,22.0,0.65,119.0,0.85,-1,-2.0,6.3,8.0,230000.0,9.0 );
insert into t_data values( 4.0,175.4,28.0,0.74,1670.0,-1,22.9,29.0,0.283,35.0,-1,37.1,2.0,5.7,-1,390000.0,7.0 );
.................
I open notepad and add "begin " at the beginning of the insert_data2.sql and "end; /" at the end of insert_data2.sql to make these insert statements within a PL/SQL block. By doing so, all these insert statments will be treated as one transaction, i.e., all 3,735 insert statements have to be executed fully or not at all. I don't want data partially inserted.

begin
insert into t_data values( 1.0,65.79,329.0,0.76,-1,-1,-1,30.0,0.863,351.0,0.81,28.1,13.8,0.8055,-1,137000.0,4.0 );
insert into t_data values( 2.0,224.9,146.0,-1,1949.5,-1,19.3,22.0,0.301,166.0,-1,28.0,7.9,4.6,-1,4670000.0,10.0 );
insert into t_data values( 3.0,113.0,113.0,0.76,1114.9,123.8,-1,22.0,0.65,119.0,0.85,-1,-2.0,6.3,8.0,230000.0,9.0 );
insert into t_data values( 4.0,175.4,28.0,0.74,1670.0,-1,22.9,29.0,0.283,35.0,-1,37.1,2.0,5.7,-1,390000.0,7.0 );
.................
end; /
I connect to the Oracle database on Amazon and run the insert_data2.sql. It takes less than 16 seconds to insert 3725 records. Not bad.

SQL> @insert_data2

PL/SQL procedure successfully completed.

Elapsed: 00:00:15.93
SQL> select count(1) from t_data;

COUNT(1)
----------
3735
Elapsed: 00:00:00.39
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

When I try to connect to the instance that I just started on Amazon, I got "ORA-12170: TNS:Connect timeout occurred".


$ sqlplus adminxxx/abcxxxx@xxxxx.ctbxjhstgq8o.us-east-1.rds.amazonaws.com:1521/orcl

SQL*Plus: Release 12.1.0.2.0 Production on Thu Aug 3 12:18:04 2017

Copyright (c) 1982, 2014, Oracle. All rights reserved.

ERROR:
ORA-12170: TNS:Connect timeout occurred


Enter user-name:
I suspect it has to to with the security group. I click the security group
Then I add port 1521 to the inbound rules.
I save the setting and am able to connect to the instance from my laptop.

$ sqlplus adminxxx/abcxxxx@xxxxx.ctbxjhstgq8o.us-east-1.rds.amazonaws.com:1521/orcl

SQL*Plus: Release 12.1.0.2.0 Production on Thu Aug 3 15:20:17 2017

Copyright (c) 1982, 2014, Oracle. All rights reserved.

Last Successful login time: Thu Aug 03 2017 15:18:16 -04:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> select count(1) from user_tables;
COUNT(1)
----------
0
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

I am working on a project to analyze the tourism data. The raw data are delivered to me in the format of Microsoft Excel spreadsheet files. I open the files and find the data are messy. After I manage to combine 6 spreadsheet files into a single sheet using Microsoft Access, I decide to load the single sheet into an Oracle database as a table. All columns will be defined as varchar2 type initially. My plan is to clean up those varchar2 data type columns, such as average annual temperature, and convert them to numeric using rich set of Oracle string functions. I don't have the access to any Oracle instance and I decide to start one on Amazon's platform using RDS. It takes me about 2 minutes to fill in those necessary parameters and the database in starting.

After about 5 minutes, the instance is ready to be connected. Cool!

Read Full Article

Read for later

Articles marked as Favorite are saved for later viewing.
close
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Separate tags by commas
To access this feature, please upgrade your account.
Start your free month
Free Preview