Splittable compression algorithmsįiles can be compressed with gzip, lzo, bzip2 and other compression algorithms. Further, if the data is sorted by one of the columns, that column will be super-compressible (for example, runs of the same value can be run-length encoded). Certainly phone numbers are more similar to each other than surrounding text fields like e-mail addresses or names. Storing data in columns allows all of the names to be stored together, all of the phone numbers together, etc. Take, for example, a database table containing information about customers (name, phone number, e-mail address, snail-mail address, etc.). Compression algorithms perform better on data with low information entropy (high data value locality). Intuitively, data stored in columns is more compressible than data stored in rows. Ease of compressionĬolumn oriented file formats are more compressible: Spark can easily determine the schema of Parquet files from metadata, so it doesn’t need to go through the time consuming process of reading files and inferring the schema. The Parquet file format makes it easy to avoid eager evaluation. Lazy vs eager evaluationĪs discussed in this blog post CSV files are sometimes eagerly evaluated so Spark needs to perform a slow process to infer the schema (JSON files are always eagerly evaluated). If the data is persisted as a CSV file, then df.select("city") will transmit all the data across the wire. If the data is persisted as a Parquet file, then df.select("city") will only have to transmit one column worth of data across the wire. Suppose you have the following DataFrame ( df) and would like to query the city column. Row oriented file formats require data from all the rows to be transmitted over the wire for every analysis. OCR and Parquet are column oriented data formats. Column oriented formatsĬSV, JSON, and Avro are row oriented data formats. JSON is the worst file format for distributed systems and should be avoided whenever possible. TL DR Use Apache Parquet instead of CSV or JSON whenever possible, because it’s faster and better. Spark works with many file formats including Parquet, CSV, JSON, OCR, Avro, and text files. Solving the small file problem is important.Use 1GB Parquet files with Snappy compression.This blog post outlines the data lake characteristics that are desirable for Spark analyses. The code will run fast if the data lake contains equally sized 1GB Parquet files that use snappy compression. Spark code will run faster with certain data lakes than others.įor example, Spark will run slowly if the data lake uses gzip compression and has unequally sized files (especially if there are a lot of small files).
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |