Ruslan's R&D in DB notes
Friday, March 27, 2015
LinkedIn as publishing platform
It was long time, since I wrote my last post. I am not sure if I will be more active in future, but if I lucky, I think to publish my thoughts on my LinkedIn account.
Thursday, November 15, 2012
NuoDB - a cloud database
This article presents NuoDB as a Cloud database. It also claims that NuoDB is ACID compliance. At the end of the article it is clarified that the company works on making NuoDB to be ACID compliance in a cloud.
According to CAP theorem scaling out a consistent database requires to compromise availability, since unavoidable synchronization between all databases cannot be performed fast enough over a network due to at least latency.
It will be interesting to see how this problem will be solved in NuoDB if so.
According to CAP theorem scaling out a consistent database requires to compromise availability, since unavoidable synchronization between all databases cannot be performed fast enough over a network due to at least latency.
It will be interesting to see how this problem will be solved in NuoDB if so.
Thursday, May 3, 2012
TPC benchmarks for OLTP databases
TPC states for Transaction Processing Performance Council, which develops benchmarks and publishes results. The benchmarks test both software and hardware. There are two active benchmark for testing hardware and DBMSs running OLTP applications: TPC-C and TPC-E. TPC-C is rather old, but still is active benchmark, while TPC-E is quite new. It is interesting to see that TPC-E results cover only MS SQL Server out of different database engines, while TPC-C is popular for Oracle and DB2. The latest version of MS SQL Server presented in TPC-C results is 2005. This odd participation of database vendors in benchmarking is a sign. What are the problems with TPC-C and TPC-E? Let's have a look to TPC-C first.
TPC-C defines an application with highly partitioned database, where each database partition contains small amount of data and accessed by rare transactions. TPC-C specification requires that only around 10 transactions per second are performed on a partition. Among them less than 20% of transactions access data from more than one partition. Is it realistic example? High-performing DBMSs hardly can show much better performance than old DBMSs on TPC-C benchmark. I believe that TPC-C is dead.
It is important to pay attention when vendors publish their own TPC-C results, which are not reported on TPC web site. This usually suggests that they run unofficial implementation of TPC-C, which is greatly modified to get high throughput. Thus comparison of such benchmark results is not possible due to large differences between benchmarks. Moreover it is not possible to run the same benchmark, e.g., faked unofficial implementation of TPC-C, on different DBMSs and publish results due to the DeWitt Clause, which is included in license agreement of Oracle, MS SQL Server and such.
The other OLTP benchmark, TPC-E, is much more complex than TPC-C and has significant limitation on how an application and a DBMS are integrated. TPC-E requires that application simulators and backend databases communicate through modules provided in implemented code as part of TPC-E specification. Thus TPC-E assumes only traditional architecture, where a DBMS and application code run in separate processes or on separate servers. This makes impossible to utilize VMDBMS concept, where a DBMS and application code run in the same process to avoid unnecessary data movements and transformations between the DBMS and the application business logic.
Since only Microsoft runs TPC-E benchmark, it seems that TPC-E will never become a popular benchmark.
Among all TPC benchmarks the most popular and live one is TPC-H, which is oriented to data warehouses or databases supporting OLAP-kind applications.
TPC-C defines an application with highly partitioned database, where each database partition contains small amount of data and accessed by rare transactions. TPC-C specification requires that only around 10 transactions per second are performed on a partition. Among them less than 20% of transactions access data from more than one partition. Is it realistic example? High-performing DBMSs hardly can show much better performance than old DBMSs on TPC-C benchmark. I believe that TPC-C is dead.
It is important to pay attention when vendors publish their own TPC-C results, which are not reported on TPC web site. This usually suggests that they run unofficial implementation of TPC-C, which is greatly modified to get high throughput. Thus comparison of such benchmark results is not possible due to large differences between benchmarks. Moreover it is not possible to run the same benchmark, e.g., faked unofficial implementation of TPC-C, on different DBMSs and publish results due to the DeWitt Clause, which is included in license agreement of Oracle, MS SQL Server and such.
The other OLTP benchmark, TPC-E, is much more complex than TPC-C and has significant limitation on how an application and a DBMS are integrated. TPC-E requires that application simulators and backend databases communicate through modules provided in implemented code as part of TPC-E specification. Thus TPC-E assumes only traditional architecture, where a DBMS and application code run in separate processes or on separate servers. This makes impossible to utilize VMDBMS concept, where a DBMS and application code run in the same process to avoid unnecessary data movements and transformations between the DBMS and the application business logic.
Since only Microsoft runs TPC-E benchmark, it seems that TPC-E will never become a popular benchmark.
Among all TPC benchmarks the most popular and live one is TPC-H, which is oriented to data warehouses or databases supporting OLAP-kind applications.
About interview with Mike Stonebraker on ODBMS.org
ODBMS.org published an interview with Mike Stonebraker, who is a well-known scientist in database technologies and founder of several database products.
Mike repeats his main statement that one size does not fit all, which he proves with founding and developing different database products specialized for different application sizes. Examples of such products are in-memory OLTP database VoltDB and DW/BI database Vertica. Mike argues that current data operated by ACID transactions, should be managed by an OLTP engine, e.g., VoltDB, while historical data should be moved into analytical database system, such as Vertica. Thus getting much better performance for the two different tasks and I fully support him.
For the benchmark comparison, Mike refers to TPC-C benchmark and comparison between VoltDB numbers and legacy DBMSs numbers. Unfortunately, this comparison is unfair. VoltDB runs a modified version TPC-C, which does not follow the TPC-C specification and, thus, the benchmark results are not published on TPC-C web page. VoltDB implementation of "TPC-C benchmark" is biased towards to VoltDB, since VoltDB does not allow concurrency on the same database partition. Note that the original TPC-C is biased to legacy database and limits benchmark result by underlying hardware. (I hope to find time and write a small post about problems with TPC benchmarks for OLTP databases)
In general, Mike Stonebraker plays important role in modern DBMS development. I highly recommend to read the interview, read his papers and listen his presentations.
Mike repeats his main statement that one size does not fit all, which he proves with founding and developing different database products specialized for different application sizes. Examples of such products are in-memory OLTP database VoltDB and DW/BI database Vertica. Mike argues that current data operated by ACID transactions, should be managed by an OLTP engine, e.g., VoltDB, while historical data should be moved into analytical database system, such as Vertica. Thus getting much better performance for the two different tasks and I fully support him.
For the benchmark comparison, Mike refers to TPC-C benchmark and comparison between VoltDB numbers and legacy DBMSs numbers. Unfortunately, this comparison is unfair. VoltDB runs a modified version TPC-C, which does not follow the TPC-C specification and, thus, the benchmark results are not published on TPC-C web page. VoltDB implementation of "TPC-C benchmark" is biased towards to VoltDB, since VoltDB does not allow concurrency on the same database partition. Note that the original TPC-C is biased to legacy database and limits benchmark result by underlying hardware. (I hope to find time and write a small post about problems with TPC benchmarks for OLTP databases)
In general, Mike Stonebraker plays important role in modern DBMS development. I highly recommend to read the interview, read his papers and listen his presentations.
Wednesday, April 25, 2012
Cool presentation of F#
I am investigating Visual F# as an additional tool to C# under .NET Framework. To get quick overview of F# I watched a presentation by Luca Bolognese at PDC2008: An Introduction to Microsoft F#. This was a very good choice: very cool presentation and very good introduction into F#. I recommend to watch it and have a fun together with getting good overview.
Some comments about F#:
A functional language for .NET, which syntax makes it easy to use for:
Some comments about F#:
A functional language for .NET, which syntax makes it easy to use for:
- Declarative processing of lists or sets in a pipeline fashion.
- Asynchronous and parallel processing.
Friday, April 13, 2012
ACM Webinar on Security
Yesterday I attended second ACM Webinar: Security: Computing in an Adversarial Environment by Dr. Carrie Gates from CA Technologies.
It was high level introductory lecture in general security. I cannot say that it was a computer science (CS) lecture. The only connection to computer science was that security in modern world is much related to computers and Internet. Since the webinar was from ACM I was expecting more CS technological lecture.
At the beginning Dr. Gates asked why security discipline is different from any other major CS disciplines such as AI. My answer was:
By the way, I was using Firefox during the webinar and noticed bad quality of the sound and inability to get slides from there. Later I switched to IE to get the slides and the sound was improved. I have not tested with Chrome.
It was high level introductory lecture in general security. I cannot say that it was a computer science (CS) lecture. The only connection to computer science was that security in modern world is much related to computers and Internet. Since the webinar was from ACM I was expecting more CS technological lecture.
At the beginning Dr. Gates asked why security discipline is different from any other major CS disciplines such as AI. My answer was:
Most CS disciplines help to get solutions for primary tasks or services of applications or software, while security is often hidden behind those tasks and services in background. Security does not help with solving primary tasks, instead it is often opposite.This one was not in the list of the question answers at the end of the presentation, but it was indirectly mention during the lecture.
By the way, I was using Firefox during the webinar and noticed bad quality of the sound and inability to get slides from there. Later I switched to IE to get the slides and the sound was improved. I have not tested with Chrome.
Location:
Stockholm, Sweden
Wednesday, April 4, 2012
SAP HANA database in SIGMOD record
An article about new SAP HANA database was published in SIGMOD Record last December: SAP HANA database: data management for modern business applications by F. Färber, S. Kyun Cha, J. Primsch et. al.
I was expecting to get insights in HANA from this article. Unfortunately, it is just another white paper without technical details.
HANA is main-memory database and implements both column-oriented storage and row-oriented storage. This mix is getting popular (see C-Store). In contrast to C-Store, which uses row-store for updates and column-store for analysis, HANA's article recommends to use row-store for meta-data and column-store for real data. Data in column-store are processed by both read-write and read-only transactions. Note that HANA supports ACID transactions. Therefore, having both column-store and ACIDness can potentially lead to performance issues due to:
The article does not say how ACID property should be utilized in applications. Another missing point is how a database should be populated.
Another question related to database population and transactions: how scale-out is supported in the case of updates. According to CAP-theorem consistency, i.e., ACIDness, should be relaxed for scale-out. It will be great to see how HANA works for data population and analysis at the same time.
Having different engines for different data kinds such as relational data, graph data, text data under single hood is a good idea. It will be interesting to see performance and complexity of applications with data managed by several engines at the same time.
Details about "Beyond SQL" features are also missing.
I was expecting to get insights in HANA from this article. Unfortunately, it is just another white paper without technical details.
HANA is main-memory database and implements both column-oriented storage and row-oriented storage. This mix is getting popular (see C-Store). In contrast to C-Store, which uses row-store for updates and column-store for analysis, HANA's article recommends to use row-store for meta-data and column-store for real data. Data in column-store are processed by both read-write and read-only transactions. Note that HANA supports ACID transactions. Therefore, having both column-store and ACIDness can potentially lead to performance issues due to:
- Column-stores are design for scans and perform badly for many small insert transactions.
- Maintaining ACID is expensive if update or insert transactions are large.
The article does not say how ACID property should be utilized in applications. Another missing point is how a database should be populated.
Another question related to database population and transactions: how scale-out is supported in the case of updates. According to CAP-theorem consistency, i.e., ACIDness, should be relaxed for scale-out. It will be great to see how HANA works for data population and analysis at the same time.
Having different engines for different data kinds such as relational data, graph data, text data under single hood is a good idea. It will be interesting to see performance and complexity of applications with data managed by several engines at the same time.
Details about "Beyond SQL" features are also missing.
Location:
Sweden
Subscribe to:
Posts (Atom)