ParquetSharp.Dataset 0.1.0-beta3

This is a prerelease version of ParquetSharp.Dataset.
There is a newer prerelease version of this package available.
See the version list below for details.
dotnet add package ParquetSharp.Dataset --version 0.1.0-beta3                
NuGet\Install-Package ParquetSharp.Dataset -Version 0.1.0-beta3                
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="ParquetSharp.Dataset" Version="0.1.0-beta3" />                
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add ParquetSharp.Dataset --version 0.1.0-beta3                
#r "nuget: ParquetSharp.Dataset, 0.1.0-beta3"                
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install ParquetSharp.Dataset as a Cake Addin
#addin nuget:?package=ParquetSharp.Dataset&version=0.1.0-beta3&prerelease

// Install ParquetSharp.Dataset as a Cake Tool
#tool nuget:?package=ParquetSharp.Dataset&version=0.1.0-beta3&prerelease                

ParquetSharp.Dataset

CI Status NuGet latest release

This is a work in progress and is not yet ready for public use

ParquetSharp.Dataset supports reading datasets consisting of multiple Parquet files, which may be partitioned with a partitioning strategy such as Hive partitioning. Data is read using the Apache Arrow format.

Note that ParquetSharp.Dataset does not use the Apache Arrow C++ Dataset library, but is implemented on top of ParquetSharp, which uses the Apache Arrow C++ Parquet library.

Usage

To begin with, you will need a dataset of Parquet files that have the same schema:

/my-dataset/data0.parquet
/my-dataset/data1.parquet

You can then create a DatasetReader, and read data from this as a stream of Arrow RecordBatch:

using ParquetSharp.Dataset;

var dataset = new DatasetReader("/my-dataset");
using var arrayStream = dataset.ToBatches();
while (await reader.ReadNextRecordBatchAsync() is { } batch)
{
    using (batch)
    {
        // Use data in the batch
    }
}

Your dataset may be partitioned using Hive partitioning, where directories are named containing a field name and value:

/my-dataset/part=a/data0.parquet
/my-dataset/part=a/data1.parquet
/my-dataset/part=b/data0.parquet
/my-dataset/part=b/data1.parquet

To read Hive partitioned data, you can provide a HivePartitioning.Factory instance to the DatasetReader constructor, and the partitioning schema will be inferred by looking at the dataset directory structure:

var partitioningFactory = new HivePartitioning.Factory();
var dataset = new DatasetReader("/my-dataset", partitioningFactory);

Alternatively, you can specify the partitioning schema explicitly:

var partitioningSchema = new Apache.Arrow.Schema.Builder()
    .Field(new Field("part", new StringType(), nullable: false))
    .Build());
var partitioning = new HivePartitioning(partitioningSchema);
var dataset = new DatasetReader("/my-dataset", partitioning);

When creating a DatasetReader, the schema from the first Parquet file found will be inspected to determine the full dataset schema. This can be avoided by providing the full dataset schema explicitly:

var datasetSchema = new Apache.Arrow.Schema.Builder()
    .Field(new Field("part", new StringType(), nullable: false))
    .Field(new Field("x", new Int32Type(), nullable: false))
    .Field(new Field("y", new FloatType(), nullable: false))
    .Build());
var dataset = new DatasetReader("/my-dataset", partitioning, datasetSchema);

Filtering data

When reading data from a dataset, you can specify the columns to include and filter rows based on field values. Row filters may apply to fields from data files or from the partitioning schema. When a filter excludes a partition directory no files from that directory will be read.

var columns = new[] {"x", "y"};
var filter = Col.Named("part").IsIn(new[] {"a", "c"});
using var arrayStream = dataset.ToBatches(filter, columns);
while (await reader.ReadNextRecordBatchAsync() is { } batch)
{
    using (batch)
    {
        // batch will only contain columns "x" and "y",
        // and only files in the selected partitions will be read.
    }
}
Product Compatible and additional computed target framework versions.
.NET net6.0 is compatible.  net6.0-android was computed.  net6.0-ios was computed.  net6.0-maccatalyst was computed.  net6.0-macos was computed.  net6.0-tvos was computed.  net6.0-windows was computed.  net7.0 was computed.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 was computed.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
0.1.0-beta4 87 8/22/2024
0.1.0-beta3 85 4/10/2024
0.1.0-beta2 69 4/9/2024
0.1.0-beta1 67 4/4/2024