Apache Hadoop 3.3.0 comes with improvements for ARM platforms and more

After a year and a half of development, the Apache Software Foundation has released the launch of the new version of Apache Hadoop 3.3.0, version in which he added improvements for ARM platforms, support for scheduling container launches and other things.

Apache Hadoop positions itself as a free platform to organize the distributed processing of large amounts of data using the map / reduce paradigm, in which a task is divided into many smaller isolated chunks, each of which can run on a separate cluster node.

Hadoop-based storage it can span thousands of nodes and contain exabytes of data.

About Apache Hadoop

Hadoop includes an implementation of the Hadoop distributed file system (HDFS), which provides data redundancy automatically and is optimized for MapReduce applications.

A key functionality is that for effective work scheduling, each file system must know and provide its location, the name of the rack (more precisely, of the switch) where the worker node is.

Hadoop applications can use this information to run work on the node where the data is and, failing that, on the same rack / switch, thus reducing network traffic.

To simplify access to data in Hadoop storage, HBase database and SQL-like Pig language have been developed, which is an SQL type for MapReduce, whose queries can be parallelized and processed by various Hadoop platforms.

The project is evaluated as completely stable and ready for industrial operation. Hadoop is actively used in large industrial projects, providing capabilities similar to the Google Bigtable / GFS / MapReduce platform, while Google officially delegated Hadoop and other Apache projects are entitled to use patent-covered technologies related to the MapReduce method.

Hadoop ranks first among the Apache repositories in terms of the number of changes made and the fifth largest code base (approximately 4 million lines of code).

What's new in Apache Hadoop 3.3?

This new version of Hadoop is positioned as the first version that has el support for ARM-based platforms, with which those interested in being able to implement this platform will be able to find the binary for ARM already available.

Another of the main changes that is presented in this new version is the implementation of the new version of the Protobuf format (Protocol buffers) used to serialize structured data has been updated to version 3.7.1 due to the end of the life cycle of the protobuf-2.5.0 branch.

In addition to it, also the capabilities of the S3A connector have already been expanded that now has him added support for authentication using tokens, improved support for response caching with a 404 code, higher S3guard performance, and improved operational reliability.

Also DNS resolver service added for the client to determine the servers via DNS by host names, allowing you to dispense with the list of all hosts in the configuration.

As well as the support for scheduling container launches through a centralized resource manager (ResourceManager), even with the ability to distribute containers taking into account the load of each node.

Of the other changes that stand out of this new version:

  • Problems with automatic tuning have been resolved in the ABFS file system.
  • Added native support for the Tencent Cloud COS file system to access COS object storage.
  • Full support for Java 11 was added.
  • Stabilized the HDFS RBF (Router Based Federation) implementation. Security controls have been added to the HDFS router.
  • Search YARN application directory (another resource negotiator) added.

Finally, if you want to know more about it, you can check the details of the new version at the original post.

For those who are interested in being able to obtain the new version, they can download the prepared binaries In the following link.