Big Data Hadoop Alternatives: What They Offer and Who Uses Them : 맵리듀스와 하둡의 대안 찾기

Many people, particularly those new to the concept of Big Data, think of Big Data and Hadoop as almost one and the same. But there are frameworks other than Hadoop that are gaining popularity. The costs of implementing Hadoop can be quite substantial, and so organizations are exploring other options.



Alternatives to Hadoop for big and unstructured data are emerging.

The two top Hadoop vendors, Hortonworks and Cloudera, aren't exactly suffering from an increase in competition at this point, but more organizations are discovering that Big Data comprises more than the Hadoop ecosystem. Following are some of these Big Data alternatives to Hadoop.

Apache Spark

Apache Spark promises faster speeds than Hadoop MapReduce along with good application programming interfaces. This open source framework runs in-memory on a cluster and is not tied to the Hadoop MapReduce two-stage paradigm, so repeated access to the same data is faster, plus it can read data directly from the Hadoop Distributed File System (HDFS).

It requires a lot of memory, however, because it loads a process into memory and keeps it there unless told otherwise. For iterative computations that pass over the same data multiple times, Spark excels. But with one-pass extract-transform-load (ETL) jobs, MapReduce is still tops. When all data fits in the memory, Spark performs better. It's also easier to program and has an interactive mode. But Hadoop MapReduce still has more security features than Apache Spark.

Cluster Map Reduce

Cluster Map Reduce was developed by Massachusetts-based online ad company Chitika. They had been using HDFS with MapReduce, and then started using a file system called Gluster for its analytical data warehouse. They tried bridging Gluster with MapReduce using existing tools, but found they wanted a more efficient solution. So they built Cluster Map Reduce.

Cluster Map Reduce provides a Hadoop-like framework for MapReduce jobs run in a distributed environment. By simplifying movement of data and minimizing dependencies that can slow data pull, they were able to create something faster. Compared to Hadoop, it also offers:

• More straightforward construction of queries
• Lighter footprint compared to Hadoop
• Greater ability to customize future iterations in Perl or Python (or other languages)
• Resilience to failure in server nodes

Cluster Map Reduce makes better use of hardware, allowing the same workload to be completed on fewer nodes than Hadoop requires.



Some Hadoop alternatives move data more efficiently through analytical back-end processes.

High Performance Computing Cluster

A massive parallel-processing platform, High Performance Computing Cluster (HPCC) is open source and incorporates a data refinery cluster called Thor, a query cluster called Roxie, plus middleware components, external communications, and client interfaces. An HPCC environment may include only Thor clusters, or both Thor and Roxie clusters.

Thor functions as a distributed file system with parallel processing spread across nodes. It consumes, transforms, links, and indexes data. Roxie offers separate high-performance online query processing as well as data warehousing capabilities. HPCC uses Enterprise Control Language, a language specifically suited to Big Data manipulation that is compiled and optimized into C++ and is easily extended using C++ libraries.


Hydra is a distributed task processing system developed by social bookmarking service AddThis. It's available under an open source Apache license and can tackle some Big Data tasks that Hadoop struggles with. The company needed a scalable distributed system to deliver real-time analysis of data to customers, and Hadoop wasn't an option for AddThis at the time, so they created Hydra.

Hydra supports streaming and batch operations using a tree-based data structure so it can store and process data across clusters that may have thousands of nodes. AddThis engineer Chris Burroughs describes Hydra thus: "It ingests streams of data (think log files) and builds trees that are aggregates, summaries, or transformations of the data. These trees can be used by humans to explore (tiny queries), as part of a machine learning pipeline (big queries), or to support live consoles on websites (lots of queries)." Hydra can use HDFS, but it also operates on native file systems.


For all its tremendous power and benefits, Hadoop does have drawbacks. How it moves data is complex, and it's not always the most efficient execution with Big Data and unstructured data processing. The automatic association between Big Data and Hadoop is becoming looser as more alternatives to Hadoop are developed. Some have speed advantages, while others allow streaming processing or make more efficient use of hardware. Hadoop alternatives are emerging, and those who deal with Big Data or unstructured data are wise to scope them out when considering their own needs.





