Returns an estimated size of this relation in bytes.
Returns an estimated size of this relation in bytes. This information is used by the planner to decided when it is safe to broadcast a relation and can be overridden by sources that know the size ahead of time. By default, the system will assume that tables are too large to broadcast. This method will be called multiple times during query planning and thus should not perform expensive operations for each invocation.
An alternative to ParquetRelation that plugs in using the data sources API. This class is currently not intended as a full replacement of the parquet support in Spark SQL though it is likely that it will eventually subsume the existing physical plan implementation.
Compared with the current implementation, this class has the following notable differences:
Partitioning: Partitions are auto discovered and must be in the form of directories
key=value/
located atpath
. Currently only a single partitioning column is supported and it must be an integer. This class supports both fully self-describing data, which contains the partition key, and data where the partition key is only present in the folder structure. The presence of the partitioning key in the data is also auto-detected. Thenull
partition is not yet supported.Metadata: The metadata is automatically discovered by reading the first parquet file present. There is currently no support for working with files that have different schema. Additionally, when parquet metadata caching is turned on, the FileStatus objects for all data will be cached to improve the speed of interactive querying. When data is added to a table it must be dropped and recreated to pick up any changes.
Statistics: Statistics for the size of the table are automatically populated during metadata discovery.