From discussions that spun off from the last article on “Data Modeling in the Big Data Era,” it became apparent that a discussion of the Hadoop Distributed File System (HDFS) was warranted as this is basically the physical implementation of any Hive, or Impala model, and design considerations here also impact a few security concerns. As you know, security is implemented in layers, and this would be the lowest layer of security, or the last line of defense. This might be more of a Big Data Architect discussion, but contains information that I believe a data modeler should consider as well.

As you read through this discussion, you might think that we are treading into the Hadoop Administrator’s role. Just like with an RDBMS, when a data modeler develops the conceptual, and logical model, eventually the physical model must be deployed. This is where you develop the Data Definition Language (DDL) scripts, add fields to support ETL framework, decide on things like partitioning, whether to create views, any denormalization required for performance, table and column naming standards, and whether to use materialized views. This is not an inclusive list, but are all things that the physical data modeler is responsible for.

The same holds true in defining the physical data model for a Hadoop cluster where parts of the HDFS file system is the data model. As we will see below, in Hive, Impala, Spark and other components in the ecosystem, the directory structures are treated as tables. How these are structured has implications on the data model/management processes, as well as addressing potential data access restrictions. The file structure of HDFS where data is maintained is definitely something that a data modeler in the Big Data era should manage.

At first I was going to combine designing the HDFS layout with NoSQL design, and decided that it was just too much information to cover adequately in one post, so I will begin with HDFS, and we will discuss NoSQL design in the following post.

HDFS and Data Modeling:

Modeling in the Big Data era requires an understanding of the operating system, and more importantly, the file system. If you are familiar with Hadoop, and in particular how the Hadoop Distributed File System (HDFS) works, and you understand the importance of data modeling in a big data environment, this post might not be worth your time. However, feedback, suggested improvements, validation, or corrections are always welcome.

File Level Security:

To the end user, files in HDFS are managed similarly to that of a UNIX or Linux file system. HDFS runs on top of the OS file system. For example, if the cluster is composed of Red Hat Linux servers, the file system might be ext3, or, depending on the operating system flavor of Linux/Unix/Windows, the file system could be, for example, NTFS, HTFS, UFS, ext2, ext3, ext4, Veritas File System,VMFS, or ZFS. HDFS maintains all data across the cluster, and the location of the data is obscured from the end user. A global directory structure is made available to the end-user and executing a unix ls -al command is essentially the same as executing hdfs dfs -ls on HDFS. This global UNIX-like directory structure is displayed in a hierarchical tree (just like any directory tree). Everything in HDFS is contained in a root directory, and file ownership and permissions are based on the UNIX style of access control lists (ACL). So there is one level of security at the directory and file level that becomes part of the security architecture. Through the use of ACLs, a modeler, working with security architects, might decide on modeling changes that separate out sensitive data into one file. Restricting access to a complete table (directory/file) is much simpler and secure than trusting a virtual layer data masking application to enforce your security constraints.

No Concept of Current Working Directory:

That said, there are important differences between HDFS and UNIX/Linux file systems. In HDFS is there is no concept of current working directory. You must specify the directory name or it defaults to the user’s home directory. The command above, hdfs dfs -ls would show the contents of the user’s home directory, hdfs dfs -ls / would show the contents at the root directory (much like ls -al / on UNIX), and hdfs dfs -ls /user/hive/warehouse would show the contents of the hive warehouse directory. In other words, if the path is not specified each time, the directory defaults to the user’s home directory. Creating user accounts that have exclusive rights (for a service for example) to a specific data set is yet another layer of security. An intruder would not only have to know the data, but the user account authorized access to that data, and well as the groups allowed access.

HDFS is Write Once:

The other important concept to understand is that by default HDFS is “write once.” You can set the dfs.support.append parameter in the hdfs-site.xml file to a value of “true” in the newer HDFS releases, however, by default the setting is “false”. In sqoop files can be appended if the setting is on, but the data written cannot be changed. There are ways around this with various Hadoop components by writing a new line, and tagging things for deletion using things called “tombstone markers,” but for general definitions, HDFS is write once, and anything other than that is by special case.

For reasons discussed above, the design of the directory structure is actually part of the data modeling required for a Data Lake. In addition, a data modeler must develop data structures that will perform well. As is discussed below, a few large tables (even more so than dimensional tables) perform much better than a lot of tables with less data. The trick is maintaining data integrity while denormalizing the data structures. In addition, the data modeler must consider the separation of sensitive and anonymized data.

Data Transformation During Ingestion:

In sqoop, you can import all tables from a relational database, provide a user name, target directory, and connection information for an RDBMS schema. Sqoop will create one directory for each table by default. For a staging area, this might be acceptable. You could have one user account for banking data, one for insurance, one for claims, and so forth. This means that when it comes time do the transformations, it will be extremely slow. There will be one directory for every reference table, every fact table, and association table. As will be discussed, Hadoop performs better with a few large tables. By large, this means wide as well as long tables. In other words more columns, with many rows.

The other option, it to land the data in another RDBMS, or do the transformation during ingestions. To do this requires that you know how the data should look in the lake, which implies some engineering (data modeling and design) has taken place. This can be simplified by creating views of the data in RDBMS, and importing those, or the data can be pulled directory from the RDBMS and transformed using Apache Sqoop, or some other ETL tool like Talend, Apache Falcon, Apache Sqoop, IBM Datastage,. . . . it’s a long list (I personally think it is crazy to pay for what is free and just as good, if not better).

The decision point here is whether the data lake is simply a staging area, or will be used for reporting, analytics, or other reasons. Tools like Hive and Impala sit on top of the directory structure that’s been created, and treats the directories as tables. These tools provide a SQL-like query language that allows you to query the data. However, the tools actually transform these queries into Mapreduce, Spark, or Tez jobs that then query the data (depending on the setting for hive.execution.engine parameter in the hive-site.xml file). Again, given that Hadoop is very slow when dealing with a large number of small files, this is something to consider. MapReduce is quickly becoming obsolete with Spark and Tez being much faster. However, even Spark requires that the data be read from disk into memory, and disk I/O is alway the bottleneck when dealing with Hadoop.

Therefore, it is critical for performance reasons that the data structures (directories) are designed in a way that is optimized for speed; tables are large and few if possible (see below), and limited joins required while still satisfying business and security requirements.

Partitioning:

With RDBMS databases, you get a big boost in performance when tables are partitioned (as long as your data lends itself to partitioning properly*). Anything that segments your data into smaller chunks of data essentially has the effect of indexing the data, and these partitions segment subsets of the data into their own table, making any query much faster. For example, if you have a large amount of claims data that is frequently parsed by state, you can partition the data by state, and reduce the size of data queried to just a given state, not all 50 states. Another obvious partition is date. There are any number of possibilities, and just like with RDBMS tables, this option should be considered in your physical data model for HDFS.

In HDFS, you hear the term dynamic partitioning and static partitioning. With dynamic partitioning, the partitions are automatically created by Hive or Impala when data is inserted into a table. As the name implies, static partitioning is a manual process where you manually create partitions using the ADD PARTITION command, and the correct partition is selected during the load process. Dynamic and static refer only to the method used in creating partitions during the load process. Once the data is loaded, there is no difference between the two.

In my opinion, partitioning is an extremely underutilized concept. I have been very few places that actually use it, but I can tell you that I personally implemented it on an Oracle RAC and the performance improved 50 fold; depending on the implementation, similar or better improvements in performance should be seen for Hadoop and HDFS. Depending on how the data is partitioned will have a lot do with the performance, and improvements in performance might not be linear, but will always be considerably faster than querying the complete data set (barring the note below — small files). For example, the typical company probably will have more customers in states with the greatest population. If the data is partitioned by state, queries running against New York, California, or Texas tables will not be as fast as if they were querying the Wyoming table. However, querying records from Texas would be much faster than querying the entire data set.

*Note: Partitioning is not a panacea for managing large data sets. You must be careful when choosing your partitions and make sure that you do not create a “small files” problem. For example, partitioning on Date by month might be fine assuming you have enough records in each month to create a reasonably sized file. Imagine if you accidentally partitioned on a date/timestamp field, and ended up with one table for each second — not good. This could be the same in the above example if you only had 10 customers in Wyoming.

Choosing the File Format:

One reason that Hadoop projects fail is that the designers don’t fully understand the impacts of their decisions. HDFS, out of the box, by default will store data in text format. While this format offers some benefits that might be appropriate for some use cases, it is not optimal where performance is a primary concern. The good news is that you are not stuck with your original decision, and the format can be changed at a later date.

For many of the Hadoop ecosystem components, the file format is specified during the creation of data structures. With Hive and Impala, as tables are created, the clause STORED AS defines the file format. If omitted, the default is TEXTFILE. The other available file formats are SEQUENCEFILE, AVRO, and PARQUET. There are two additional types, RCFILE, and ORCFILE which are primarily used with Hive, and have limited interoperability with other components. If you’re implementing a Hive only solution, I will leave the research of those file formats to you.

As with anything involving design, you must apply your decisions to the defined use cases and select the appropriate file format. I have seen several projects stumble along until they finally discovered that the text file format was not optimal for their application. There are several factors to consider when choosing a file format.

Typical design type questions: 1) what will the ingest pattern look like — e.g. how is the data loaded? 2) Which tools are currently planned on being used? Which file formats do they support? 3) What is the expected lifetime of the data? 4) Are you optimizing for performance, or storage space? All of these types of questions must be answered before making this decision.

Text File Format:

As mentioned, text files are the most basic file type in the Hadoop ecosystem and is the default format. Text files are human readable, and compatible with almost any application. Values represented as plain strings can be extremely inefficient. Representing numeric values as strings wastes space, and storing binary objects requires the use of techniques like Base64 encoding, and converting data back to their native types requires serialization and deserialization, which slows performance. Overall, text files are simpler in some cases, more complicated in others, and are not optimal for performance.

Sequence File Format:

The sequence file format was developed specifically for Hadoop as a viable alternative to text files. As we will discuss in some detail in the NoSQL design post, many things perform well using key-value pairs, and sequence files use key-value pairs in binary container format which is efficient with both text and binary files. Depending on your perspective, one consideration in using sequence files is that the format is tightly coupled with the Java programming language and is not widely supported outside of the ecosystem. As a general rule, sequence files have good performance, with poor interoperability.

Avro File Format:

The most popular and widely used file format is Avro which is an efficient data serialization framework that uses the binary container format; similar to the sequence file format. The main advantage of Avro is its interoperability. It is widely used and supported throughout the Hadoop ecosystem and is deigned to work across different programming languages, such as Java, Scala, Python, and R.

In addition to the sequence-like file format binary containers, Avro provides an efficient data serialization framework, and uses optimized binary encoding for efficient storage. Avro distinguishes itself by providing an embedded schema definition that describes the file. As the schema evolves, the file definition evolves without changing the data. In other words, if you add or drop columns using Hive or Impala, you do not need to change the data stored in Avro files. The obvious reason for this is that HDFS is ‘write once’, so the data remains in place, and a new column is added. The only thing that needs to change is the schema maintained by Avro, which is managed automatically by Avro. As an architect/modeler, you might imagine how much fun this is to manage, but it does provide flexibility, and it can be managed with a little due diligence policy enforcement.

An Avro schema is represented in JSON and it’s possible to specify the schema of an Avro table by including that JSON directly in the CREATE TABLE statement. Avro is one of the more compatible file formats in the ecosystem, it provides excellent performance, and for these reasons, is one of the more widely deployed file formats for general purpose data.

Parquet File Format:

The last format we will discuss is Parquet, and it is a columnar format which that alone provides significant performance gains if you have large tables (which is more likely than not), and you only want to look at a few columns out of 100 columns in one query. All of the aforementioned formats are row based formats, meaning that an entire row must be queried, and then the selected columns extracted. As you might imagine, in the right context, Parquet could offer significant advantages.

Parquet is a top-level Apache project, and is widely supported in the ecosystems, to include MapReduce, Pig, Hive, Impala, Spark, and multiple programming languages. In addition to being columnar, and highly interoperable, like Avro, Parquet embeds the schema definition of the file. Parquet also provides advanced optimizations for storing data compactly that speeds up query execution. Having the data loaded all at once or in large batches is optimal; this enables Parquet to take advantage of repeated patterns in the data to store it more efficiently. Overall, Parquet is a good choice if columnar data storage and retrieval is a consideration.

File Compression:

Compression may seem like we are seriously treading into the Hadoop Administrator’s domain, and this may be more of a Big Data Architect decision, but compression can have a big impact on performance. As we discussed above, different file formats have different performance characteristics. Another key factor on performance can be compression. The smaller the data file is, the 1) less disk space required, and 2) the less time it takes to read that data from disk.

There are some tradeoffs here, because if the data is compressed, it will take less time to read it from disk, but the data now needs to be decompressed which requires CPU cycles and time. Likewise, if we compress the file as it is being stored, writing will take more time than if the file were stored uncompressed — but require more disk space. Which is the optimal solution?

Disk I/O is almost always the limiting factor in computing, and in particular with Hadoop performance. Ultimately, it has been determined that with large files, and massive amounts of processing power and memory (the latter which is much faster and cheaper now), performance actually improves with compression. Compression algorithms have also greatly improved over the years which strengthens support for the use of compression algorithms.

Compression is accomplished using code known as codecs (compressor/decompressors) and the result is faster overall performance, even though there’s an amount of time taken to compress and decompress the data. One consideration here is compatibility since not all components are compatible with all codecs. The most efficient in terms of space is BZip2, but it is also significantly slower than the other codecs. If space is the main criteria, this might be a good option. However, for data that is constantly having to be read and written to disk, BZip2 would not be a viable option.

The two compression algorithms with the best balance between compression and decompression are Snappy and LZ4, and of these two Snappy is the codec of choice simply because of its compatibility within the Hadoop ecosystem. While loading data from Sqoop, you simply use the following option: –compression-codec snappy. If no codec is defined, the default is gzip.

HDFS is Optimized for Streaming Reads of Large Files:

Another fact that must be taken into consideration, and has been mentioned several times throughout the post, when designing the data structures for a data lake, or Hadoop cluster in general, is that HDFS is optimized for streaming reads of large files, rather than random reads from many small files. Keep in mind, that prior to Spark and Tez, the only way of retrieving data from Hadoop was with mapreduce programs. This meant that each node in the distributed cluster would be queried for the file data, the data reduced into the consolidated query set, and the results sent back to the user. The more files involved in the process, the slower the execution. In recent versions, Spark and Tez have been introduced which greatly reduces query times, but the principle still applies: fewer large files is optimal to many small files.

Conclusion:

So what does all of this mean? Data modeling in the Big Data era has introduced some new challenges, and especially when we talk about the physical implementation. The conceptual, and logical concerns remain basically the same, but even with these models, we must now consider ways of creating views of the data that create tables that optimize data retrieval from HDFS.

As we have seen, components like Hive, Impala, and Spark depend on the HDFS directory structures. The structure of these directories has implications on the data model/management processes, as well as addressing potential data access restrictions. The file structure of HDFS where data is maintained is definitely something that a data modeler in the Big Data era should manage.

In my opinion, nothing really changes from the data design perspective. First we gather requirements, then we document use cases, then we design the models needed to satisfy those requirements and use cases. Once the conceptual, and logical models are complete, we are now faced with a few unique challenges with the physical implementation.

Data Modeling in the Big Data Era: HDFS (Part 2)