Setting up Tez on CDH Cluster

Summary

It is known that Cloudera has no official support for Tez execution engine. They push their customers to use Impala instead (or Hive on Spark nowadays). This article describes how we set Tez engine up on CDH cluster including Tez UI.

Install Java/Maven

Follow official instructions on how to install Java. Ensure version is same as the one you have your CDH cluster running on. In our case it is Java 7 update 67.

Follow instructions to install latest version of Maven.

Download Tez source

Download source from Tez site. We built 0.8.4 version for use on CDH 5.8.3 cluster.

Build Tez

As of this writing, Cloudera doesn’t support timeline server that ships with it’s distribution. If you want to host Tez UI, timeline server is required. A choice is to use apache version of timeline server. Timeline server that ships with hadoop 2.6 or later versions uses 1.9 version of jackson jars. CDH  5.8.3 however, uses 1.8.8 version of jackson jars, so we need to ensure when we build Tez against CDH, we need to ensure it is built with right jackson dependencies to ensure Tez is able to post events to apache version of timeline server.

Build protobuf 2.5.0

Tez requires protocol buffers 2.5.0. Download the source code from here. Build the binaries locally.

tar xvzf protobuf-2.5.0.tar.gz
cd protobuf-2.5.0
./configure --prefix $PWD
make
make check
make install

Verify version like so:

bin/protoc --version
libprotoc 2.5.0

Setup git protocol

Our firewall blocked git protocol, so we set it up to use https protocol.

git config --global url.https://github.com/.insteadOf git://github.com/

Build against CDH repo

Since we want to maintain compatibility with existing CDH installation, it’s a good idea to build against CDH 5.8.3 repo. Let’s create a new maven build profile for that.

    <profile>
       <id>cdh5.8.3</id>
       <activation>
       <activeByDefault>false</activeByDefault>
       </activation>
       <properties>
         <hadoop.version>2.6.0-cdh5.8.3</hadoop.version>
         <pig.version>0.12.0-cdh5.8.3</pig.version>
       </properties>
       <pluginRepositories>
         <pluginRepository>
         <id>cloudera</id>
         <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
         </pluginRepository>
       </pluginRepositories>
       <repositories>
         <repository>
           <id>cloudera</id>
           <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
         </repository>
       </repositories>
    </profile>

However, this repo will by default download 1.8.8 version of jackson jars, which will not work when Tez tries post events to apache timeline server, let’s ensure we specify those dependencies right.

     <dependency>
        <groupId>org.codehaus.jackson</groupId>
        <artifactId>jackson-mapper-asl</artifactId>
        <version>1.9.13</version>
      </dependency>
      <dependency>
        <groupId>org.codehaus.jackson</groupId>
        <artifactId>jackson-core-asl</artifactId>
        <version>1.9.13</version>
      </dependency>
      <dependency>
        <groupId>org.codehaus.jackson</groupId>
        <artifactId>jackson-jaxrs</artifactId>
        <version>1.9.13</version>
      </dependency>
      <dependency>
        <groupId>org.codehaus.jackson</groupId>
        <artifactId>jackson-xc</artifactId>
        <version>1.9.13</version>
      </dependency>

Here’s the complete pom.xml under the top level tez source directory for reference, in pdf format.

Finally, build Tez:

mvn -e clean package -DskipTests=true -Dmaven.javadoc.skip=true -Dprotoc.path=<location of protoc> -Pcdh5.8.3

Run Apache timeline server

We are going to run timeline server that ships with 2.7.3 version of apache hadoop. Install it on a separate server. Copy core-site.xml from CDH cluster. Change yarn-site.xml to have right timeline server properties.

  <property>
    <name>yarn.timeline-service.store-class</name>
    <value>org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore</value>
  </property>
  <property>
    <name>yarn.timeline-service.leveldb-timeline-store.path</name>
    <value>/var/log/hadoop-yarn/timeline/leveldb</value>
  </property>
   <property>
     <name>yarn.timeline-service.bind-host</name>
     <value>0.0.0.0</value>
   </property>
   <property>
     <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
     <value>300000</value>
   </property>
   <property>
     <name>yarn.timeline-service.hostname</name>
     <value>ATS_HOSTNAME</value>
   </property>
   <property>
     <name>yarn.timeline-service.http-cross-origin.enabled</name>
     <value>true</value>
   </property>

Start the timeline server after setting right environment variables.

yarn-daemon.sh start timelineserver

Install httpd

We are going to use apache web server to host Tez-UI.

yum install httpd

Untar tez-ui.war file that was built as part of Tez in the directory pointed to by DocumentRoot (default /var/www/html).

cd /var/www/html
mkdir tez-ui
cd tez-ui
jar xf tez-ui.war
ln -s . ui -- this is because UI is looking for <base_url>/ui from YARN UI when the app is still running

Setup YARN on CDH

Add following properties to YARN configuration of CDH cluster in “YARN Service Advanced Configuration Snippet (Safety Valve) for yarn-site.xml” and restart the service.

<property>
  <name>yarn.timeline-service.http-cross-origin.enabled</name>
   <value>true</value>
</property>
<property>
   <name>yarn.resourcemanager.system-metrics-publisher.enabled</name>
   <value>true</value>
</property>
<property>
   <name>yarn.timeline-service.generic-application-history.enabled</name>
   <value>true</value>
</property>
<property>
   <name>yarn.timeline-service.enabled</name>
   <value>true</value>
</property>
<property>
   <name>yarn.timeline-service.hostname</name>
   <value>ATS_HOSTNAME</value>
</property>

Setup Tez

Ensure you copy tez-0.8.4.tar.gz that got created when  you built Tez to HDFS. Let’s say we copied it to /apps/tez. This has all the files needed by Tez.

hdfs dfs -copyFromLocal tez-0.8.4.tar.gz /apps/tez

On client,  extract tez-0.8.4.tar.gz in installation (let’s use /usr/local/tez/client) directory.

ssh root@<client>
mkdir -p /usr/local/tez/client 
cd /usr/local/tez/client
tar xvzf <location on FS>/tez-0.8.4.tar.gz

Setup /etc/tez/conf/tez-site.xml on the client. Following are some of the properties needed to make Tez work properly.

  <property>
    <name>tez.lib.uris</name>
    <value>/apps/tez/tez-0.8.4.tar.gz</value>
  </property>
  <property>
    <name>tez.use.cluster.hadoop-libs</name>
    <value>false</value>
  </property>
  <property>
    <name>tez.tez-ui.history-url.base</name>
    <value>UI_HOSTNAME/tez-ui</value>
  </property>
  <property>
    <name>yarn.timeline-service.hostname</name>
    <value>ATS_HOSTNAME</value>
  </property>
  <property>
    <name>yarn.timeline-service.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>tez.task.launch.env</name>
    <value>LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native</value>
  </property>
  <property>
    <name>tez.am.launch.env</name>
    <value>LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native</value>
  </property>
  <property>
    <name>tez.history.logging.service.class</name>
    <value>org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService</value>
  </property>

Ensure HADOOP_CLASSPATH is set properly on tez client.

export TEZ_CONF_DIR=/etc/tez/conf
export TEZ_JARS=/usr/local/tez/client
export HADOOP_CLASSPATH=${TEZ_CONF_DIR}:${TEZ_JARS}/*:${TEZ_JARS}/lib/*:`hadoop classpath`

If you are using hiveserver2 clients (beeline, HUE), ensure you have extracted tez-0.8.4.tar.gz contents in /usr/local/tez/client (as an example), and created /etc/tez/conf/tez-site.xml like you did on Tez client. In addition, set the HADOOP_CLASSPATH with actual values in hive configuration on CM at Hive Service Environment Advanced Configuration Snippet (Safety Valve).  Once done, restart hive.

HADOOP_CLASSPATH=/etc/tez/conf:/usr/local/tez/client/*:/usr/local/tez/client/lib/*:/etc/hadoop/conf:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop/.//*:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop-hdfs/./:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop-hdfs/.//*:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop-yarn/.//*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/.//*

Run clients

-bash-4.1$ hive

Logging initialized using configuration in jar:file:/apps/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/jars/hive-common-1.1.0-cdh5.8.3.jar!/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> set hive.execution.engine=tez;
hive> select count(*) from warehouse.dummy;
Query ID = nex37045_20170305002020_18680c2c-0df1-4f5c-b0a5-5d511da78c63
Total jobs = 1
Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1488498341718_39067)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 6.91 s     
--------------------------------------------------------------------------------
OK
1
Time taken: 19.114 seconds, Fetched: 1 row(s)
hive>

You should see details for this job in Tez UI now.

screen-shot-2017-03-04-at-4-26-17-pm

 

 

Building rack aware mirroring scheme in Greenplum

Summary

Customers using software only installations of Greenplum have an option to configure mirrors that fit their availability and performance needs. This post describes a way that can be leveraged to maximize availability and performance when Greenplum cluster is utilizing servers on 2 more racks.

Rack/Server setup

In a typical data center, customers usually have servers in more than 1 rack. In order to mitigate risk of having a platform service (such as GPDB) go down because of a node or a rack going down, it’s recommended to configure the cluster on servers in multiple racks. By default, GPDB comes with a standby node for the master services (in active-passive mode). This host should be hosted on a rack different that the master node. In addition, GPDB provides mirroring for protection against data loss due to node failures. These mirrors can be configured in a way to provide protection against failures at the rack level (power failures, TOR switch issues etc) to enhance the availability of the clusters by a large degree. Following picture depicts how we did this at one of our customer sites:

Important things to notice:

  • Master and Standby nodes are on separate racks. VIP in front of them directs traffic to ‘live’ master
  • Segments on a worker node in a given rack has mirrors spread across segment hosts on different racks, using spread mirroring scheme (arrows depict master -> mirror relationship)
  • Every single node in the cluster hosts same number of mirrors, thus providing same performance degradation in case of node failures (Eg: In a setup where we have 4 mirrors/segment host as depicted in the diagram, in case of a node failure, each of it’s mirror hosts get 1 mirror)

More details…

We used separate ports for master (for incoming connection requests) and standby nodes (for replication traffic). In addition, F5 hardware load balancer was used that detected which was the ‘real’ master by probing the port that the master is supposed to be listening at. Finally, we had to manually move the mirrors using ‘gpmovemirrors’ utility using custom configuration file. We tested the setup in 3 separate ways:

  1. Remove rack 3, by rebooting all the servers on that rack in live cluster. Check if those segments fail over with minimal disruption.
  2. Remove rack 2, by rebooting all the servers on that rack in live cluster. Cluster should still be available.
  3. Remove rack 1, by rebooting all the servers on that rack in live cluster. Fail master over to rack 2 and bring the cluster up. It should work just fine.

Performing FULL VACCUM in Greenplum efficiently

Summary

Greenplum is a MPP Warehouse platform based on PostgreSQL database. We discuss one of the most important and most common maintenance task that needs to be executed on periodic basis on the platform.

What causes bloat?

Greenplum platform is ACID compliant. The isolation property ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially, i.e., one after the other. Providing isolation is the main goal of concurrency control. Greenplum (PostgreSQL, really) implements isolation with Multi Version Concurrency Control. In normal PostgreSQL operation, rows that are deleted or obsoleted by an update are not physically removed from their table; they remain present until a VACUUM is done. Therefore it’s necessary to do VACUUM periodically, especially on frequently updated tables. Failure to do table maintenance in order to allow reuse of this space causes table data file to grow bigger and therefore scans of the table take longer.

Options to cleanup the bloat

VACUUM FULL

This the slowest method of the lot, when VACUUM FULL command is executed, rows in the table are reshuffled. If there are large number of rows, it leads to various reshuffling of the data leading to very unpredictable time.

Any bloated catalog table of the database (i.e all tables under pg_catalog schema) can only use this method to remove the bloat, thus removing highly bloated catalog tables can be time-consuming.

Create Table As Select (CTAS)

CTAS is a quick method to remove bloat and using this method also helps to avoid the table lock (EXCLUSIVE LOCK) which VACUUM FULL method acquires to do the operation, thus allowing the table to be used by end users while maintenance is being performed on the main table.

The disadvantage is that it involves many steps to perform the activity:

  1. Obtain the DDL of the table using pg_dump -s -t <bloated-table-name> <database> command. This creates a script with all the sub objects like indexes that are involved with the table. It also provides list of grants associated with the table.
  2. Once the DDL is obtained, replace the to using the find/replace option with any editor of your choice and then execute the file on psql to create new objects in the database.
  3. Then follow steps put in the code below:
INSERT INTO <schema-name>.<new-table-name> SELECT * FROM <schema-name>.<bloated-table-name>;
ALTER TABLE <schema-name>.<bloated-table-name> RENAME TO <table-name-to-drop>;
ALTER TABLE <schema-name>.<new-table-name> RENAME TO <bloated-table-name>;
-- Once users confirm everything is good.
DROP TABLE <schema-name>.<bloated-table-name>;
Backup/Restore

This is another way to clear up the bloat, this involves backing up the tables and then restoring them back. Tools that can be used to achieve this are:

  1. gpcrondump / gpdbrestore
  2. gp_dump / gp_restore
  3. pg_dump / pg_restore
  4. COPY .. TO .. / COPY .. FROM ..
Redistribute

This is one the quickest method of the list, when you execute redistribute command, internally the database creates a new file and loads the existing data into it. Once load is done, it removes the old file, effectively eliminating the bloat. Following script automates this process by building redistribution script that preserves distribution key order.

SELECT 'ALTER TABLE '||n.nspname||'.'||c.relname||' SET with (reorganize=false) DISTRIBUTED RANDOMLY;\n'||
 'ALTER TABLE '||n.nspname||'.'||c.relname||' SET with (reorganize=true) DISTRIBUTED'||
 CASE
 WHEN length(dist_col) > 0 THEN
 ' BY ('||dist_col||');\n'
 ELSE
 ' RANDOMLY;\n'
 END||'ANALYZE '||n.nspname||'.'||c.relname||';\n'
 FROM pg_class c, pg_namespace n,
 (SELECT pc.oid,
 string_agg(attname, ', ' order by colorder) AS dist_col
 FROM pg_class AS pc
 LEFT JOIN (SELECT localoid,
 unnest(attrnums) as colnum,
 generate_series(1, array_upper(attrnums, 1)) as colorder
 FROM gp_distribution_policy
 ) AS d
 ON (d.localoid = pc.oid)
 LEFT JOIN pg_attribute AS a
 ON (d.localoid = a.attrelid AND d.colnum = a.attnum)
 GROUP BY
 pc.oid
 ) AS keys
 WHERE c.oid = keys.oid
 AND c.relnamespace = n.oid
 AND c.relkind='r'
 AND c.relstorage='h'
 -- Following filter takes out all partitions, we care just about master table
 AND c.oid NOT IN (SELECT inhrelid FROM pg_inherits)
 AND n.nspname not in (
 'gp_toolkit',
 'pg_toast',
 'pg_bitmapindex',
 'pg_aoseg',
 'pg_catalog',
 'information_schema')
 AND n.nspname not like 'pg_temp%';

When we execute this script against a database, it produces SQL that will cleanup the bloat when run against the database.

psql -t gpadmin -f fullvacuum.sql |sed 's/\s*$//'

ALTER TABLE public.testtable SET with (reorganize=false) DISTRIBUTED RANDOMLY;
ALTER TABLE public.testtable SET with (reorganize=true) DISTRIBUTED BY (n);
ANALYZE public.testtable;

Design Pattern: Avoid HAWQ creating too many small files on HDFS for tables

Summary

This post deals with rather unknown problem that HAWQ has on hadoop clusters, specifically creating too many small files on HDFS even when there’s negligible amount of data in tables. This post talks about our encounter with this problem and we overcame that issue. Bear in mind that the solution may not work for your use case, so you may have to get little creative to workaround the problem.

Problem

I am sure you have heard that for larger read-only tables and especially the ones with lots of columns, using Column orientation is a good idea. Column-oriented tables can offer better query performance on wide tables (lots of columns) where you typically only access a small subset of columns in your queries. It is one of the strategies used to reduce Disk IO. At our install site, HAWQ is used as an archival platform for ERP databases that tend to have lots and lots of tables with negligible amount of data. The data is rarely queried and when it is queried, queries tend to be simple queries (with minimal joins). Going by the above recommendation, architects had asked all the internal customers to create tables using that as default storage format. After few months, name node started crashing because JVM started running out of memory. When we looked into the problem, largest database with 34145 tables (all with column orientation) has 5,323,424 files on HDFS (cluster has 116 segments). Overall, number of files on HDFS for all databases in the cluster had reached 120 million at which point name node JVM kept running out of memory making the cluster highly unstable.

Problem in a nutshell arises because, HAWQ, like Greenplum database, implements columnar storage with one file per column. As you may already know, Hadoop doesn’t do well with lots and lots of really small files. Let’s look at the technical details:

[gpadmin@hdm2 migration]$ psql manoj
psql (8.2.15)
Type "help" for help.
-- CREATE TABLE
manoj=# create table testhdfs (mynum1 integer, mynum2 integer, mytext1 char(2), mytext2 char(2)) 
manoj-# with (appendonly=true, orientation=column) distributed randomly;
CREATE TABLE

-- INSERT A ROW
manoj=# insert into testhdfs values (1, 2, 'v1', 'v2');
INSERT 0 1

Now, let’s look at how many files this table has created. HAWQ stores files for this on HDFS at this location (/hawq_data/<segment>/<tablespace ID>/<database ID>).

[gpadmin@hdm2 migration]$ hadoop fs -ls /hawq_data/gpseg0/16385/5699341
Found 7 items
-rw------- 3 postgres gpadmin 0 2015-02-12 18:35 /hawq_data/gpseg0/16385/5699341/5699342
-rw------- 3 postgres gpadmin 0 2015-02-12 18:58 /hawq_data/gpseg0/16385/5699341/5699432
-rw------- 3 postgres gpadmin 0 2015-02-12 18:58 /hawq_data/gpseg0/16385/5699341/5699432.1
-rw------- 3 postgres gpadmin 0 2015-02-12 18:58 /hawq_data/gpseg0/16385/5699341/5699432.129
-rw------- 3 postgres gpadmin 0 2015-02-12 18:58 /hawq_data/gpseg0/16385/5699341/5699432.257
-rw------- 3 postgres gpadmin 0 2015-02-12 18:58 /hawq_data/gpseg0/16385/5699341/5699432.385
-rw------- 3 postgres gpadmin 4 2015-02-12 18:35 /hawq_data/gpseg0/16385/5699341/PG_VERSION

As you can see, it created 1 file per column per segment (even where there’s no data for this particular segment). That is a HUGE problem.

Solution

There are couple of ways to deal with this. One option is to use parquet format to store the data.

with (appendonly=true, orientation=parquet);

This will create one file for storing data per segment. You can use a nifty tool called “gpextract” to see where these files are stored on HDFS.

gpextract -o manoj.yaml -W testhdfs -dmanoj

[gpadmin@hdm2 migration]$ cat manoj.yaml 
DBVersion: PostgreSQL 8.2.15 (Greenplum Database 4.2.0 build 1) (HAWQ 1.2.0.1 build
 8119) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Apr
 23 2014 16:12:32
DFS_URL: hdfs://hdm1.gphd.local:8020
Encoding: UTF8
FileFormat: Parquet
Parquet_FileLocations:
 Checksum: false
 CompressionLevel: 0
 CompressionType: null
 EnableDictionary: false
 Files:
 - path: /hawq_data/gpseg0/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg1/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg2/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg3/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg4/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg5/16385/5699341/5699460.1
 size: 1371
 - path: /hawq_data/gpseg6/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg7/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg8/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg9/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg10/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg11/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg12/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg13/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg14/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg15/16385/5699341/5699460.1
 size: 0
 PageSize: 1048576
 RowGroupSize: 8388608
TableName: public.testhdfs
Version: 1.0.0

Another way is to just copy the data in a flat file format on HDFS and create extrenal tables on it. We used this approach as this creates just one file per table in the whole cluster. We did not need the read performance, as this is an archival database, so it was an easy choice.

Conclusion

HAWQ provides a nice SQL layer on HDFS, use the tool in right way to solve your business problem. In addition, parquet format is easy to use in HAWQ and doesn’t lock you into a Pivotal HD/HAWQ only solution. It is easy to use the other tools like Pig or MapReduce to read the Parquet files in your Hadoop cluster.

Authorization in Hadoop using Apache Ranger

Summary

Over the past couple of years, Apache Hadoop has made great progress in the area of security. Security for any computing system is divided in two categories:

  1. Authentication is the process of ascertaining that somebody really is who he claims to be. In Hadoop, this is achieved via Kerberization. This post will not cover any details about this as there’s ample material already available online.
  2. Authorization refers to rules that determine who is allowed to do what. E.g. Manoj may be authorized to create and delete databases, while Terence is only authorized to read the tables.

For folks who are coming from a traditional database world, authorization is a well understood concept. It involves creation of roles, grants at various levels on variety of objects. Apache Hadoop community is trying to emulate some of those concepts in the recently incubated Apache Ranger project. We recently implemented this in one of our customer sites and talk about that experience here.

What does Ranger do?

Ranger provides a centralized way to manage security across various components in a Hadoop cluster. Currently, it’s capable of providing authorization as well as auditing for HDFS, Hive, HBase, Knox and Storm services.  At the core, Ranger is  a centralized web application, which consists of the policy administration, audit and reporting modules. Authorized users will be able to manage their security policies using the web tool or using REST APIs. These security policies are enforced within Hadoop ecosystem using lightweight Ranger Java plugins, which run as part of the same process as the Namenode (HDFS), Hive2Server(Hive), HBase server (Hbase), Nimbus server (Storm) and Knox server (Knox). Thus there is no additional OS level process to manage. In addition, this means there’s no single point of failure (for example, if the web application or the policy database goes down, security is not compromised. It just disables security administrator from pushing new policies).

Components of Ranger

There are three main components of Ranger:

  1. Portal/Policy Manager is central UI for security administration. Users can create and update policies, which are then stored in a policy database. Plugins within each component poll these policies at regular intervals.
  2. Plugins are lightweight Java programs which embed within processes of each cluster component. For example, plugin for HDFS runs as part of the namenode process. The plugins pull policies from the policy database at regular intervals (configurable) and store them locally (in a file). Whenever a request is made for a resource, the plugins intercept the request and evaluate against the security policy in effect. Plugins also collect data for all the requests and send it back to the audit server via separate thread.
  3. User/Group Sync is a utility provided to enable synchronization of users and groups from OS/LDAP/AD. This information is used while defining policies (and we’ll shortly see an example).

Setup

Ranger Admin

Ranger software is already included in HDP 2.2 repos. Find the ranger policy admin software (assuming you have setup yum repos correctly):

yum search ranger
====================================================== N/S Matched: ranger =========================================
ranger_2_2_0_0_1947-admin.x86_64 : Web Interface for Ranger
ranger_2_2_0_0_1947-debuginfo.x86_64 : Debug information for package ranger_2_2_0_0_1947
ranger_2_2_0_0_1947-hbase-plugin.x86_64 : ranger plugin for hbase
ranger_2_2_0_0_1947-hdfs-plugin.x86_64 : ranger plugin for hdfs
ranger_2_2_0_0_1947-hive-plugin.x86_64 : ranger plugin for hive
ranger_2_2_0_0_1947-knox-plugin.x86_64 : ranger plugin for knox
ranger_2_2_0_0_1947-storm-plugin.x86_64 : ranger plugin for storm
ranger_2_2_0_0_1947-usersync.x86_64 : Synchronize User/Group information from Corporate LD/AD or Unix

Install the admin module

yum install ranger_2_2_0_0_1947-admin

In the installation directory (/usr/hdp/current/ranger-admin/) edit install.properties file:

SQL_COMMAND_INVOKER=mysql
SQL_CONNECTOR_JAR=/usr/share/java/mysql-connector-java.jar

# DB password for the DB admin user-id
db_root_user=root
db_root_password=<password>
db_host=<database host>

#
# DB UserId used for the XASecure schema
#
db_name=ranger
db_user=rangeradmin
db_password=<password>

# DB UserId for storing auditlog infromation
#
# * audit_db can be same as the XASecure schema db
# * audit_db must exists in the same ${db_host} as xaserver database ${db_name}
# * audit_user must be a different user than db_user (as audit user has access to only
audit tables)
#
audit_db_name=ranger_audit
audit_db_user=rangerlogger
audit_db_password=<password>

#
# ------- PolicyManager CONFIG ----------------
#

policymgr_external_url=http://<portal host>:6080
policymgr_http_enabled=true

#
# ------- UNIX User CONFIG ----------------
#
unix_user=ranger
unix_group=ranger

#
# ** The installation of xasecure-unix-ugsync package can be installed after the
policymanager installation is finished.
#
#LDAP|ACTIVE_DIRECTORY|UNIX|NONE
authentication_method=NONE

Run the setup as root:

export JAVA_HOME=<path of installed jdk version folder>
/usr/hdp/current/ranger-admin/setup.sh
service ranger-admin start

UserSync

Install the module

yum install ranger_2_2_0_0_1947-usersync

In the Ranger UserSync installation directory (/usr/hdp/current/ranger-usersync), update install.properties file as appropriate for HDP 2.2:

POLICY_MGR_URL = http://<portal host>:6080

# sync source,  only unix and ldap are supported at present
# defaults to unix
SYNC_SOURCE =  ldap

# sync interval in minutes
# user, groups would be synced again at the end of each sync interval
# defaults to 5   if SYNC_SOURCE is unix
# defaults to 360 if SYNC_SOURCE is ldap
SYNC_INTERVAL=1

#User and group for the usersync process
unix_user=ranger
unix_group=ranger

# URL of source ldap
# a sample value would be:  ldap://ldap.example.com:389
# Must specify a value if SYNC_SOURCE is ldap
SYNC_LDAP_URL = ldap://<ldap host>:389

# ldap bind dn used to connect to ldap and query for users and groups
# a sample value would be cn=admin,ou=users,dc=hadoop,dc=apache,dc-org
# Must specify a value if SYNC_SOURCE is ldap
SYNC_LDAP_BIND_DN = <bind username>

# ldap bind password for the bind dn specified above
# please ensure read access to this file  is limited to root, to protect the password
# Must specify a value if SYNC_SOURCE is ldap
# unless anonymous search is allowed by the directory on users and group
SYNC_LDAP_BIND_PASSWORD = <password>
CRED_KEYSTORE_FILENAME=/usr/lib/xausersync/.jceks/xausersync.jceks

# search base for users
# sample value would be ou=users,dc=hadoop,dc=apache,dc=org
SYNC_LDAP_USER_SEARCH_BASE = <Value depends upon your LDAP setup>

# search scope for the users, only base, one and sub are supported values 
# please customize the value to suit your deployment 
# default value: sub SYNC_LDAP_USER_SEARCH_SCOPE = sub 
# objectclass to identify user entries 
# please customize the value to suit your deployment 
# default value: person SYNC_LDAP_USER_OBJECT_CLASS = person 
# optional additional filter constraining the users selected for syncing 
# a sample value would be (dept=eng) 
# please customize the value to suit your deployment 
# default value is empty
SYNC_LDAP_USER_SEARCH_FILTER = <Value depends upon your LDAP setup>

# attribute from user entry that would be treated as user name
# please customize the value to suit your deployment
# default value: cn
SYNC_LDAP_USER_NAME_ATTRIBUTE=sAMAccountName

# attribute from user entry whose values would be treated as
# group values to be pushed into Policy Manager database
# You could provide multiple attribute names separated by comma
# default value: memberof, ismemberof
SYNC_LDAP_USER_GROUP_NAME_ATTRIBUTE=memberOf

#
# UserSync - Case Conversion Flags
# possible values:  none, lower, upper
SYNC_LDAP_GROUPNAME_CASE_CONVERSION=lower

NOTE: Customize SYNC_LDAP_USER_SEARCH_FILTER parameter to suit your needs.

Run the setup:

export JAVA_HOME=<path of installed jdk version folder>
./usr/hdp/current/ranger-usersync/setup.sh

service ranger-usersync start

Verify by visiting ranger portal and clicking Users/Groups tab. You should see all LDAP users. Furthermore, you may add LDAP/AD user/group and it should show up in the portal within SYNC_INTERVAL.

PlugIns

We will go over one of the plugins. Similar setup should be followed for all interested plugins.

HDFS

On NameNode (in case of HA NameNode setup, on all the namenodes), install the plugin.

yum install ranger_2_2_0_0_1947-hdfs-plugin

In the plugin installation directory (/usr/hdp/current/ranger-hdfs-plugin), edit install.properties.

POLICY_MGR_URL=http://<portal host>:6080
SQL_CONNECTOR_JAR=/usr/share/java/mysql-connector-java.jar
#
# Example:
# REPOSITORY_NAME=hadoopdev
#
REPOSITORY_NAME=<This is the repo that'll be looked up when plugin is loaded>
XAAUDIT.DB.IS_ENABLED=true
XAAUDIT.DB.FLAVOUR=MYSQL
XAAUDIT.DB.HOSTNAME=<database host>
XAAUDIT.DB.DATABASE_NAME=ranger_audit
XAAUDIT.DB.USER_NAME=rangerlogger
XAAUDIT.DB.PASSWORD=<password>
XAAUDIT.HDFS.IS_ENABLED=true
XAAUDIT.HDFS.DESTINATION_DIRECTORY=hdfs://<NameNode>:8020/ranger/audit/%app-type%/%time:yyyyMMdd%
XAAUDIT.HDFS.LOCAL_BUFFER_DIRECTORY=/var/log/hadoop/%app-type%/audit
XAAUDIT.HDFS.LOCAL_ARCHIVE_DIRECTORY=/var/log/hadoop/%app-type%/audit/archive

Run the script to enable the plugin:

export JAVA_HOME=<path of installed jdk version folder>
/usr/hdp/current/ranger-hdfs-plugin/enable-hdfs-plugin.sh

Restart namenode(s) from Ambari or manually.

Test the setup

  • On ranger portal, click “Policy Manager”. Click “+” sign on HDFS tab and create a repository. Ensure name of this repository is EXACTLY same as the one you specified during installation.
  • Let’s take a test user “svemuri” and check his permissions on a test directory:
[svemuri@sfdmgctmn005 ~]$ hadoop fs -ls /user/mmurumkar
ls: Permission denied: user=svemuri, access=READ_EXECUTE,
inode="/user/mmurumkar":mmurumkar:sfdmgct_admin:drwxr-x---:user:mmurumkar:r--,user:ranger:---,user:rbolla:---,user:svemuri:---,group::r-x
  • Now, let’s create a policy called “TestPolicy”, that allows “svemuri” all the privileges on “/user/mmurumkar”

Policy

  • Now the earlier command should work:
[svemuri@sfdmgctmn005 ~]$ hadoop fs -ls /user/mmurumkar
Found 9 items
drwxr-xr-x   - mmurumkar sfdmgct_admin          0 2014-11-20 02:22 /user/mmurumkar/.hiveJars
drwxr-xr-x   - mmurumkar sfdmgct_admin          0 2014-11-18 20:00 /user/mmurumkar/test
drwxr-xr-x   - mmurumkar sfdmgct_admin          0 2014-11-18 20:01 /user/mmurumkar/test1
drwxr-xr-x   - mmurumkar sfdmgct_admin          0 2014-11-18 20:08 /user/mmurumkar/test2
drwxr-xr-x   - rbolla    sfdmgct_admin          0 2014-11-18 20:09 /user/mmurumkar/test3
drwxr-xr-x   - rbolla    sfdmgct_admin          0 2014-11-18 20:10 /user/mmurumkar/test4
drwxr-xr-x   - ranger    sfdmgct_admin          0 2014-11-18 20:18 /user/mmurumkar/test5
drwxr-xr-x   - mmurumkar sfdmgct_admin          0 2014-11-20 18:01 /user/mmurumkar/test7
drwxr-xr-x   - ranger    sfdmgct_admin          0 2014-11-19 14:21 /user/mmurumkar/test8
  •  Audit records will now show up in audit UI on the portal.
Audit

 Conclusion

Apache Ranger is starting to fill critical security needs in Hadoop environment, marking a big progress towards making Hadoop an enterprise data platform.