This post deals with rather unknown problem that HAWQ has on hadoop clusters, specifically creating too many small files on HDFS even when there’s negligible amount of data in tables. This post talks about our encounter with this problem and we overcame that issue. Bear in mind that the solution may not work for your use case, so you may have to get little creative to workaround the problem.
I am sure you have heard that for larger read-only tables and especially the ones with lots of columns, using Column orientation is a good idea. Column-oriented tables can offer better query performance on wide tables (lots of columns) where you typically only access a small subset of columns in your queries. It is one of the strategies used to reduce Disk IO. At our install site, HAWQ is used as an archival platform for ERP databases that tend to have lots and lots of tables with negligible amount of data. The data is rarely queried and when it is queried, queries tend to be simple queries (with minimal joins). Going by the above recommendation, architects had asked all the internal customers to create tables using that as default storage format. After few months, name node started crashing because JVM started running out of memory. When we looked into the problem, largest database with 34145 tables (all with column orientation) has 5,323,424 files on HDFS (cluster has 116 segments). Overall, number of files on HDFS for all databases in the cluster had reached 120 million at which point name node JVM kept running out of memory making the cluster highly unstable.
Problem in a nutshell arises because, HAWQ, like Greenplum database, implements columnar storage with one file per column. As you may already know, Hadoop doesn’t do well with lots and lots of really small files. Let’s look at the technical details:
[gpadmin@hdm2 migration]$ psql manoj psql (8.2.15) Type "help" for help. -- CREATE TABLE manoj=# create table testhdfs (mynum1 integer, mynum2 integer, mytext1 char(2), mytext2 char(2)) manoj-# with (appendonly=true, orientation=column) distributed randomly; CREATE TABLE -- INSERT A ROW manoj=# insert into testhdfs values (1, 2, 'v1', 'v2'); INSERT 0 1
Now, let’s look at how many files this table has created. HAWQ stores files for this on HDFS at this location (/hawq_data/<segment>/<tablespace ID>/<database ID>).
[gpadmin@hdm2 migration]$ hadoop fs -ls /hawq_data/gpseg0/16385/5699341 Found 7 items -rw------- 3 postgres gpadmin 0 2015-02-12 18:35 /hawq_data/gpseg0/16385/5699341/5699342 -rw------- 3 postgres gpadmin 0 2015-02-12 18:58 /hawq_data/gpseg0/16385/5699341/5699432 -rw------- 3 postgres gpadmin 0 2015-02-12 18:58 /hawq_data/gpseg0/16385/5699341/5699432.1 -rw------- 3 postgres gpadmin 0 2015-02-12 18:58 /hawq_data/gpseg0/16385/5699341/5699432.129 -rw------- 3 postgres gpadmin 0 2015-02-12 18:58 /hawq_data/gpseg0/16385/5699341/5699432.257 -rw------- 3 postgres gpadmin 0 2015-02-12 18:58 /hawq_data/gpseg0/16385/5699341/5699432.385 -rw------- 3 postgres gpadmin 4 2015-02-12 18:35 /hawq_data/gpseg0/16385/5699341/PG_VERSION
As you can see, it created 1 file per column per segment (even where there’s no data for this particular segment). That is a HUGE problem.
There are couple of ways to deal with this. One option is to use parquet format to store the data.
with (appendonly=true, orientation=parquet);
This will create one file for storing data per segment. You can use a nifty tool called “gpextract” to see where these files are stored on HDFS.
gpextract -o manoj.yaml -W testhdfs -dmanoj [gpadmin@hdm2 migration]$ cat manoj.yaml DBVersion: PostgreSQL 8.2.15 (Greenplum Database 4.2.0 build 1) (HAWQ 184.108.40.206 build 8119) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Apr 23 2014 16:12:32 DFS_URL: hdfs://hdm1.gphd.local:8020 Encoding: UTF8 FileFormat: Parquet Parquet_FileLocations: Checksum: false CompressionLevel: 0 CompressionType: null EnableDictionary: false Files: - path: /hawq_data/gpseg0/16385/5699341/5699460.1 size: 0 - path: /hawq_data/gpseg1/16385/5699341/5699460.1 size: 0 - path: /hawq_data/gpseg2/16385/5699341/5699460.1 size: 0 - path: /hawq_data/gpseg3/16385/5699341/5699460.1 size: 0 - path: /hawq_data/gpseg4/16385/5699341/5699460.1 size: 0 - path: /hawq_data/gpseg5/16385/5699341/5699460.1 size: 1371 - path: /hawq_data/gpseg6/16385/5699341/5699460.1 size: 0 - path: /hawq_data/gpseg7/16385/5699341/5699460.1 size: 0 - path: /hawq_data/gpseg8/16385/5699341/5699460.1 size: 0 - path: /hawq_data/gpseg9/16385/5699341/5699460.1 size: 0 - path: /hawq_data/gpseg10/16385/5699341/5699460.1 size: 0 - path: /hawq_data/gpseg11/16385/5699341/5699460.1 size: 0 - path: /hawq_data/gpseg12/16385/5699341/5699460.1 size: 0 - path: /hawq_data/gpseg13/16385/5699341/5699460.1 size: 0 - path: /hawq_data/gpseg14/16385/5699341/5699460.1 size: 0 - path: /hawq_data/gpseg15/16385/5699341/5699460.1 size: 0 PageSize: 1048576 RowGroupSize: 8388608 TableName: public.testhdfs Version: 1.0.0
Another way is to just copy the data in a flat file format on HDFS and create extrenal tables on it. We used this approach as this creates just one file per table in the whole cluster. We did not need the read performance, as this is an archival database, so it was an easy choice.
HAWQ provides a nice SQL layer on HDFS, use the tool in right way to solve your business problem. In addition, parquet format is easy to use in HAWQ and doesn’t lock you into a Pivotal HD/HAWQ only solution. It is easy to use the other tools like Pig or MapReduce to read the Parquet files in your Hadoop cluster.
Fun fact: we created those tables, and ingested the data in about 40 hours via Talend ETL using 8 threads.