pyspark.sql.DataFrameReader.parquet

DataFrameReader.parquet(*paths, **options)[source]

Loads Parquet files, returning the result as a DataFrame.

New in version 1.4.0.

Parameters:
pathsstr
Other Parameters:
mergeSchemastr or bool, optional

sets whether we should merge schemas collected from all Parquet part-files. This will override spark.sql.parquet.mergeSchema. The default value is specified in spark.sql.parquet.mergeSchema.

pathGlobFilterstr or bool, optional

an optional glob pattern to only include files with paths matching the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. It does not change the behavior of partition discovery. # noqa

recursiveFileLookupstr or bool, optional

recursively scan a directory for files. Using this option disables partition discovery. # noqa

modification times occurring before the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)

modifiedBefore (batch only)an optional timestamp to only include files with

modification times occurring before the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)

modifiedAfter (batch only)an optional timestamp to only include files with

modification times occurring after the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)

Examples

>>> df = spark.read.parquet('python/test_support/sql/parquet_partitioned')
>>> df.dtypes
[('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')]