You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1.Introduction
Linkis is faced with the need to store various types of data in files, such as: storing Hive table data in files, and hoping to save metadata information such as field types, column names, and comments.
2.Storage stores a variety of file systems
3.Result Set - Parquet
3.1 Parquet composition
Parquet is just a storage format, it is language- and platform-independent, and does not need to be bound to any data processing framework. Currently, the components that can be adapted to Parquet include the following, and it can be seen that basically the commonly used queries The engine and computing framework have been adapted, and data generated by other serialization tools can be easily converted into Parquet format.
Query Engines: Hive, Impala, Pig, Presto, Drill, Tajo, HAWQ, IBM Big SQL
Data Models: Avro, Thrift, Protocol Buffers, POJOs
The schema of each data model contains multiple fields, and each field can contain multiple fields. Each field has three attributes: repetition number, data type and field name. The repetition number can be the following three types: required (occurrence 1 time ), repeated (0 or more occurrences), optional (0 or 1 occurrences). The data type of each field can be divided into two types: group (complex type) and primitive (basic type).
type of data
INT64, INT32, BOOLEAN, BINARY, FLOAT, DOUBLE, INT96, FIXED_LEN_BYTE_ARRAY
3.2 Parquet Design
3.3Parquet implementation
4.Result Set - ORC
4.1 ORC composition
Unlike Parquet, ORC does not natively support nested data formats, but supports nested formats through special processing of complex data types.
CREATE TABLE orcStructTable( name string, course structcourse:string,score:int, score map<string,int>, work_locations array)
Similar to Parquet, ORC files are also stored in binary mode, so they cannot be read directly. ORC files are also self-parsed and contain a lot of metadata, which are serialized by isomorphic ProtoBuffer.
ORC file: Ordinary binary file saved on the file system. An ORC file can contain multiple stripes, and each stripe contains multiple * * records. These records are stored independently according to columns, corresponding to the concept of row group in Parquet.
File-level metadata: including file description information PostScript, file meta information (including statistical information of the entire file), all stripe information and file schema information.
stripe: A group of rows forms a stripe. Each time a file is read, the unit is row group, generally the block size of HDFS, which saves the index and data of each column.
stripe metadata: saves the position of the stripe, the statistics of each column in the stripe, and all stream types and positions.
row group: The smallest unit of the index. A stripe contains multiple row groups, which are composed of 10,000 values by default.
stream: A stream represents a valid piece of data in the file, including index and data. The index stream saves the position and statistical information of each row group, and the data stream includes various types of data, which are determined by the column type and encoding method.
Compare
hive
ORC wide table data performs better than parquet data.
The ORC file storage format performs better in terms of space storage, data import speed and query speed, and ORC can support ACID operations to a certain extent. The development of the community is currently a columnar format that is more advocated in Hive. storage format.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
1.Introduction
Linkis is faced with the need to store various types of data in files, such as: storing Hive table data in files, and hoping to save metadata information such as field types, column names, and comments.
2.Storage stores a variety of file systems
3.Result Set - Parquet
3.1 Parquet composition
Parquet is just a storage format, it is language- and platform-independent, and does not need to be bound to any data processing framework. Currently, the components that can be adapted to Parquet include the following, and it can be seen that basically the commonly used queries The engine and computing framework have been adapted, and data generated by other serialization tools can be easily converted into Parquet format.
The schema of each data model contains multiple fields, and each field can contain multiple fields. Each field has three attributes: repetition number, data type and field name. The repetition number can be the following three types: required (occurrence 1 time ), repeated (0 or more occurrences), optional (0 or 1 occurrences). The data type of each field can be divided into two types: group (complex type) and primitive (basic type).
type of data
INT64, INT32, BOOLEAN, BINARY, FLOAT, DOUBLE, INT96, FIXED_LEN_BYTE_ARRAY
3.2 Parquet Design
3.3Parquet implementation
4.Result Set - ORC
4.1 ORC composition
Unlike Parquet, ORC does not natively support nested data formats, but supports nested formats through special processing of complex data types.
CREATE TABLE
orcStructTable
(name
string,course
structcourse:string,score:int,score
map<string,int>,work_locations
array)ORC file: Ordinary binary file saved on the file system. An ORC file can contain multiple stripes, and each stripe contains multiple * * records. These records are stored independently according to columns, corresponding to the concept of row group in Parquet.
File-level metadata: including file description information PostScript, file meta information (including statistical information of the entire file), all stripe information and file schema information.
6.release
expected release 2022-03-31
Beta Was this translation helpful? Give feedback.
All reactions