This blog is about SQL Server, T-SQL, CLR, Service Broker, Integration Services, Reporting, Analysis Services, Business Intelligence, XML, SQL Scripts, best practices, database development, database administration, and programming.
The next 24 Hours of PASS will have the opportunity to celebrate the first 20 years of the international association of professionals and data management experts. PASS is the acronym of Professional Association of SQL Server, PASS was established on 1999 April 5th and since then, every year it provides training on the entire Microsoft Data Platform!
For those who do not know 24 Hours of PASS family of events: The event consists of a series of one-hour webinars. For this edition, the live sessions will start at 6:00 pm (UTC) on April 3rd 2019 and will continue for 24 hours in which international speakers will talk about past experiences, future visions and how data will evolve over the next 20 years!
The main topics will be:
Database modernization and migration
Future-proofing your data architecture
Data in a world of enhanced security and privacy
The impact of AI on our management and usage of data
Dave Wells have just gave this great definition that clearly describes what's happening in the data management world during the recent years.
I am greatly enjoying Dave’s session today at Enterprise Data World summit and couldn't resist writing down the summary.
Everything that we did in the last decade becomes wrong now. We have used to believe that application logic can run faster and do better if it sits inside the database layer. Now this architecture is being considered a wrong choice. Same goes for data normalization or strong schema. Some people even say that data warehouses are dead.
We need to rethink everything. Data schema used to be defined during the design phase. Now we define schema-on-read, after the data have been persisted. Good news - I have always believed that and Dave have just mentioned - there is no schema-less data. Despite the fact that we do not get to design the schema anymore, for Big Data we need to understand the schema from the existing data and define a separate schema models for each use case. Same goes to Data Quality rules that aren't generic anymore. Need to figure out the data quality rules for each use case. Even data governance is changing. We cannot govern the data anymore. It's getting out of the boundaries, out of the control. We can only govern what people do with the data.
In the modern data world we have more data sources and more types of data, we have much more ways to organize and store data, more uses for data and more data consumers that require fast and on-demand data delivery.
Data management world is changing.
The age of Data Warehousing and BI have came to an end.
We are now in the age of Big Data and Data Lakes but they are slowly going away as well.
And our future is approaching fast, bringing with it Data Catalos, Data Hubs and Data Fabric concepts.
I am myself on the way to figure out what is it and how it all makes sense.
What a great speaker, awesome session and tons of learning ahead of me.
More and more companies are aiming to move away from managing their own servers and moving towards a cloud platform. Going serverless offers a lot of benefits like lower administrative overhead and server costs. In the serverless architecture, developers work with event driven functions which are being managed by cloud services. Such architecture is highly scalable and boosts developer productivity.
AWS Glue service is an ETL service that utilizes a fully managed Apache Spark environment. Glue ETL that can clean, enrich your data and load it to common database engines inside AWS cloud (EC2 instances or Relational Database Service) or put the file to S3 storage in a great variety of formats, including PARQUET.
I have recently published 3 blogposts on how to use AWS Glue service when you want to load data into SQL Server hosted on AWS cloud platform.
The Database Migration Team of the SQL Server Product Group has created the following tools and services to facilitate migration between different versions of SQL Server or between on-premises and SQL Azure DB in the cloud (but not only)!
Azure Database Migration Service (Azure DMS)
Designed as a seamless, end-to-end solution for moving on-premises SQL Server databases to the cloud.
This opportunity is perfect to remind you that the next July 9, 2019 SQL Server 2008 and SQL Server 2008 R2 will exit from the maintenance program, Microsoft will not release updates on these two products, not even security updates. You can find all details here. This is the time to migrate!
You can also provide your feedback on Data Migration Tools and Services by writing to @Data_Migrations (datamigration at microsoft dot com).
Another SQL Saturday has been scheduled in Pordenone (Italy), it will be the SQL Saturday #829, as last year it will be an international event with international speakers and some sessions in English language!
The agenda of the day is divided in five track (one more than last year) that will deliver a total of 30 hours of free training on SQL Server, PowerBI/Visualization, Cloud, Analytics, Enterprise Engine and DevOps/Development! You will learn more about topics that you use every day rather than learn something about technologies that you don't use yet.
Thanks to our Sponsors, the event will be free of charge for you, but registration is mandatory.
If you are around Pordenone at the end of february, or if you want to come in Italy for a weekend of training on Microsoft Data Platform with #sqlfamily friends and good food :) you are welcome!
Many companies these days keep their data assets in multiple data stores. Many companies that I have worked at have used other database systems alongside SQL Server, such as PostgreSQL instances, Redis, Elasticsearch or Couchbase. There are situations when the application, that uses SQL Server as their main database, needs to access data from another database system. Some datastores have ODBC/JDBC drivers so you can easily add a linked server. Some datastores do not have ODBC/JDBC drivers.
I had to investigate today situation when my AWS Aurora PostgreSQL instance CPU was 100%.
I have started to search for a root cause at the database statistic views level. There is a view pg_stat_activity which shows information related to the current activity of each process, such as host ip, last transaction starttime and waitstats information. There were no long running transactions or long waits and unfortunately this view does not have any counters on CPU or memory usage per process. Boom.
Another way to track down performance issues is to use extension pg_stat_statements to see execution statistics. This view rows are per query and provide information about how many times query was executed, query execution time, number of rows retrieved etc. Again, no counters related to memory or cpu usage.
Some use the below query (source here) that is based on query total_time, assuming that the query that runs longer uses more cpu.
SELECT substring(query, 1, 50) as query, round(total_time::numeric, 2) AS total_time, calls,
rows, round(total_time::numeric / calls, 2) AS avg_time,
round((100 * total_time / sum(total_time::numeric) OVER ())::numeric, 2) AS percentage_cpu
ORDER BY total_time
DESC LIMIT 10;
I am not sure it’s true in all cases but this might point us to the root cause.
I could have used plperlu extension that can show percentage of CPU and memory used by particular session but it is not supported by AWS RDS.
Postgresql is a process-based system, it starts new process for each database connection. This is why you can see database connection memory and cpu usage using OS facilities only.
If we are using RDS, we have no access to OS level and cannot run top to take a look at all processes.
Today I have discovered advanced monitoring for RDS instances to monitor OS processes:
After enabling this monitoring, we can chose OS process list in the below drop down:
Using the above, you can monitor all PostgreSQL processes and their resource consumption !
Pid 6723 had some sensitive information and I had to clean it. The above screenshot was taken after the CPU peak was over which is why the numbers are low.
Now I can go back to pg_stat_activity and check which host and which application is using the specific connection and see executed queries and the waitstats:
select * from pg_stat_activity where pid = 6723
Unfortunately pg_stat_activity does not show an active statement but only the top-level one. And there is no way to join between pg_stat_activity and pg_stat_statements to match pid of the connection with query history that we can see in pg_stat_statements.
We are halfway through our problem. We now know which processes are consuming CPU but do not really know which queries they have executed.
However, since we have the hostname and the application that stand behind problematic connections, we can go to the application developers, check together the query patterns that they execute and try to understand why their connections are so CPU intensive.