MapReduce vs Parallel DBMS
Buy custom MapReduce vs Parallel DBMS essay
In the article “MapReduce and Parallel DBMSs: Friends or Foes?” written by Michael Stonebraker, Daniel J. Abadi, David J. Dewitt, Sam Madden, Erik Paulson, Andrew Pavlo, and Alexandr Rasin it is argued that cloud computing tasks should be performed by complex systems in which a MapReduce (MR) system should be used upstream with respect to a parallel relational database management system (DBMS), and interfaces between these two systems are to be developed.
Cloud computing is a new technology that assumes the use of a large number of processors working in parallel to perform calculations (Stonebraker et al., 2010). These processors are situated in interconnected commodity computers that are viewed as a cluster. Each such computer is called a node of the cluster (Stronberg, 1986). Among the tools for cluster programming there are MR and parallel DBMS. There is the opinion that extreme scalability of MR gives it huge competitive advantage over a parallel DBMS. Moreover, Facebook enterprise has solely used the MR technology to implement its warehouse. Nevertheless, parallel DBMSs completely satisfy current customers’ needs in scalability. Although any program that contains parallel processing can be written “as either a set of database quires or a set of MR jobs”, there are classes of tasks that are considered to be more suitable to a MR model than to a parallel DBMS.
Typically, a MR system is supposed to transform “raw data into useful information that is consumed by another storage system”. Therefore, a MR system is like an extract-transform-load (ETL) system. As practice shows, for a modern DBMS other products perform ETL. At the same time, no ETL system is used “to do DBMS services”. Analytical problems encountered in data mining assume “multiple passes over the data”. Therefore, they cannot be programmed by means of “single SQL aggregate queries”. Instead, to find numerical solutions “a complex data flow program” is to be developed. Therefore, MR model should be used in this case. Since MR systems do not require a specification of a scheme for their data, they can work with data that have a varying number of attributes. In relational DBMS model such data can be described by means of tables “with many attributes”. If a specific record does not require some attributes, they can be assigned NULL values. Relational DBMSs that use this technique are called row-based DBMSs (Stonebraker et al., 2010). In its turn, a column-based DBMS reads only necessary attributes performing a query (Abadi, 2007). Authors believe that analytical queries on such data should be performed by the last mentioned system. In case there is a need to perform ETL on them a MR system should be used. Start up time of a MR system is significantly smaller than that of a DBMS. It can be accounted for by the fact that it is much easier to install a MR system than a DBMS (Pavlo et al., 2009). Besides, MR systems work with raw data by default, while DBMSs need to transform them to required formats. Therefore, MR systems are more suitable for quick approximate analyses on transient data than DBMSs. Finally, most MR systems are available for free, while parallel DBMSs are expensive. Hence, the systems of the first type better fit to “users with limited budgets” than those of the second type (Stonebraker et al., 2010). A comparison of real life performances of MR systems and DBMSs can verify these arguments.
Pavlo et al. (2009) conducted a study, where open source project Hadoop was chosen as a typical representative of MR systems. All parallel DBMSs of acceptable quality were commercial. Vertica and DBMS-X were chosen as typical representatives of parallel relational column-store and row-based databases respectively for this study. Performances were compared on three “tasks of increasing complexity”. In two of them Hadoop was expected to perform better than chosen databases. Nevertheless, results of the study indicate that after data loading, Vertica and DBMS-X solve all these tasks much faster than Hadoop does. At the same time, data loading in these databases is much more time consuming than in Hadoop. It should be mentioned that Google version of the MR system can be faster than Hadoop, but it is not available for this study. Poor Hadoop performance in the discussed study can be explained by its inefficient architecture.
Differences in architectures of a MR system and a parallel DBMS can be explained by the fact that the first system is designed to perform “complex analytics and ETL tasks”, while the second is designed to perform “efficient querying of large data sets”. Therefore, these technologies should complement each other by placing a MR system upstream with respect to a parallel DBMS. Since in order to find a numerical solution of an analytical problem one often needs to run a query on a large data set, there is a need to develop interfaces between these systems (Stonebraker et al., 2010).
I do not agree with the evidence used by authors in arguing that complex analytics should be solved by MR systems. Specifically, it was stated that numerical solving of an analytical task assumes multiple passes over the data, and that these passes can not be structured as “single SQL aggregate queries” (Stonebraker et al., 2010). I had an experience of numerical solving of a system of linear algebraic equations with a tridiagonal matrix. This problem can be referred to as complex analytic. To find its numerical solution I used Thomas method that assumes two passes over input data. Nevertheless, this algorithm can be presented as superposition of single a pass over the data, and each of them can be performed in parallel on many processors (Karniadakis & Kirby II, n. d.).
Thus, investigation should be performed to find out whether all data mining algorithms can be adapted for parallel calculations and whether these calculations can be performed by parallel DBMSs. If these assumptions turned out to be true, then it would entail the use of DBMSs for entire complex analytic in a cloud computing system.
Other than that, this work presents a thorough analysis of advantages and weaknesses of MR and parallel DBMS technologies. I completely agree with authors' conclusion that in a complex system a MR subsystem should perform ETL for a parallel DBMS.