fileprepper-cli 0.6.0

dotnet tool install --global fileprepper-cli --version 0.6.0
                    
This package contains a .NET tool you can call from the shell/command line.
dotnet new tool-manifest
                    
if you are setting up this repo
dotnet tool install --local fileprepper-cli --version 0.6.0
                    
This package contains a .NET tool you can call from the shell/command line.
#tool dotnet:?package=fileprepper-cli&version=0.6.0
                    
nuke :add-package fileprepper-cli --version 0.6.0
                    

FilePrepper

NuGet SDK NuGet CLI SDK Downloads CLI Downloads .NET Version License

A powerful .NET library and CLI tool for data preprocessing. Features a Pipeline API for efficient in-memory data transformations with 67-90% reduction in file I/O. Perfect for ML data preparation, ETL pipelines, and data analysis workflows.

๐Ÿš€ Quick Start

SDK Installation

# Install FilePrepper SDK for programmatic use
dotnet add package FilePrepper

# Or install CLI tool globally
dotnet tool install -g fileprepper-cli
using FilePrepper.Pipeline;

// CSV Processing: Only 2 file I/O operations (read + write)
await DataPipeline
    .FromCsvAsync("data.csv")
    .Normalize(columns: new[] { "Age", "Salary", "Score" },
               method: NormalizationMethod.MinMax)
    .FillMissing(columns: new[] { "Score" }, method: FillMethod.Mean)
    .FilterRows(row => int.Parse(row["Age"]) >= 30)
    .ToCsvAsync("output.csv");

// Multi-Format Support: Excel โ†’ Transform โ†’ JSON
await DataPipeline
    .FromExcelAsync("sales.xlsx")
    .AddColumn("Total", row =>
        (double.Parse(row["Price"]) * double.Parse(row["Quantity"])).ToString())
    .FilterRows(row => double.Parse(row["Total"]) >= 1000)
    .ToJsonAsync("high_value_sales.json");

// Multi-File CSV Concatenation: Merge 33 files โญ NEW
await DataPipeline
    .ConcatCsvAsync("kemp-*.csv", "dataset/")
    .ParseKoreanTime("Time", "ParsedTime")  // Korean time format โญ NEW
    .ExtractDateFeatures("ParsedTime", DateFeatures.Hour | DateFeatures.Minute)
    .ToCsvAsync("processed.csv");

CLI Usage

# Normalize numeric columns
fileprepper normalize-data --input data.csv --output normalized.csv \
  --columns "Age,Salary,Score" --method MinMax

# Fill missing values
fileprepper fill-missing-values --input data.csv --output filled.csv \
  --columns "Age,Salary" --method Mean

# Get help
fileprepper --help
fileprepper <command> --help

๐Ÿ“ฆ Supported Formats

Process data in multiple formats:

  • CSV (Comma-Separated Values)
  • TSV (Tab-Separated Values)
  • JSON (JavaScript Object Notation)
  • XML (Extensible Markup Language)
  • Excel (XLSX/XLS files)

๐Ÿ› ๏ธ Feature Matrix (30 Tasks)

Category CLI Command Task Description
Data Transformation normalize NormalizeData MinMax, ZScore normalization
scale ScaleData StandardScaler, MinMaxScaler, RobustScaler
one-hot-encoding OneHotEncoding Categorical โ†’ binary columns
convert-type DataTypeConvert Column data type conversion
extract-date DateExtraction Extract Year, Month, Day, DayOfWeek
datetime DateTimeOps Parse datetime and extract features
string StringOps upper, lower, trim, substring, concat, replace
conditional Conditional If-then-else column creation
expression Expression Arithmetic expression-based columns
Data Cleaning fill-missing FillMissingValues Mean, Median, Mode, Forward, Backward, Constant
drop-duplicates DropDuplicates Remove duplicate rows by key columns
replace ValueReplace Replace values in columns
remove-constants RemoveConstants Remove constant/near-constant columns
clean CSVCleaner Thousand separators, whitespace, \r strip
Column Operations add-columns AddColumns Add computed columns
remove-columns RemoveColumns Delete columns
rename-columns RenameColumns Rename column headers
reorder-columns ReorderColumns Change column order
column-interaction ColumnInteraction Create interaction features between columns
Data Organization merge Merge Vertical (concat) / Horizontal (join), glob support
merge-asof MergeAsOf Time-series merge with tolerance
data-sampling DataSampling Random, Stratified, Systematic sampling
convert-format FileFormatConvert CSV โ†” TSV โ†” JSON โ†” XML โ†” Excel
unpivot Unpivot Wide โ†’ Long format reshape
filter-rows FilterRows Row filtering by conditions
Data Analysis stats BasicStatistics Mean, Median, StdDev, ZScore
aggregate Aggregate Group-by aggregations
Feature Engineering create-lag-features CreateLagFeatures Time-series lag features
window WindowOps Resample, rolling aggregations
Common Options โ€” โ€” --skip-rows, --has-header, --encoding, --ignore-errors

๐Ÿงช ML Data Preparation Cookbook

Common scenarios for machine learning data preparation:

Large Dataset Sampling (100K+ rows โ†’ 10K sample)

# Random sampling with fixed seed for reproducibility
fileprepper data-sampling -i large_dataset.csv -o sampled.csv \
  --method Random --sample-size 10000 --seed 42

# Stratified sampling (preserve label distribution)
fileprepper data-sampling -i large_dataset.csv -o sampled.csv \
  --method Stratified --sample-size 10000 --stratify-column "label"
// Pipeline API
await DataPipeline
    .FromCsvAsync("large_dataset.csv")
    .Sample(10000, SamplingMethod.Random, seed: 42)
    .ToCsvAsync("sampled.csv");

Merging X/Y Split Files (Features + Labels)

# Horizontal merge: combine X_train.csv (features) + Y_train.csv (labels) by row index
fileprepper merge -i X_train.csv Y_train.csv -o merged_train.csv --direction Horizontal
// Pipeline API
var features = await DataPipeline.FromCsvAsync("X_train.csv");
var labels = await DataPipeline.FromCsvAsync("Y_train.csv");
await features
    .Join(labels, JoinType.Full, leftKey: null, rightKey: null) // row-by-row join
    .ToCsvAsync("merged_train.csv");

Multi-Row Header Files (Skip metadata rows)

# Skip first row (category header), use second row as actual column names
fileprepper filter-rows -i messy_data.csv -o clean_data.csv --skip-rows 1

# No header in file โ€” use numeric column indices
fileprepper normalize -i raw.csv -o normalized.csv \
  --columns "0,1,2" --method MinMax --has-header false

Cleaning External Data (Mixed line endings)

# Strip \r from quoted fields + remove thousand separators
fileprepper clean -i external_export.csv -o cleaned.csv --strip-cr -s ','
// Pipeline API
await DataPipeline
    .FromCsvAsync("external_export.csv")
    .StripCarriageReturn()
    .ToCsvAsync("cleaned.csv");

๐Ÿ’ก Common Use Cases

Data Cleaning Pipeline (CLI)

# 1. Remove unnecessary columns
fileprepper remove-columns --input raw.csv --output step1.csv \
  --columns "Debug,TempCol,Notes"

# 2. Drop duplicates
fileprepper drop-duplicates --input step1.csv --output step2.csv \
  --columns "Email" --keep First

# 3. Fill missing values
fileprepper fill-missing-values --input step2.csv --output step3.csv \
  --columns "Age,Salary" --method Mean

# 4. Normalize numeric columns
fileprepper normalize-data --input step3.csv --output clean.csv \
  --columns "Age,Salary,Score" --method MinMax

Time-Series Processing

# 5-minute window aggregation for sensor data
fileprepper window --input sensor_current.csv --output aggregated.csv \n  --type resample --method mean \n  --columns "RMS[A]" --time-column "Time_s[s]" \n  --window 5T --header

# Rolling window for smoothing
fileprepper window --input noisy_data.csv --output smoothed.csv \n  --type rolling --method mean \n  --columns temperature,humidity --window-size 3 \n  --suffix "_smooth" --header

ML Feature Engineering (SDK - Efficient!)

using FilePrepper.Pipeline;

// Single pipeline: Only 2 file I/O operations instead of 8!
await DataPipeline
    .FromCsvAsync("orders.csv")
    .AddColumn("Year", row => DateTime.Parse(row["OrderDate"]).Year.ToString())
    .AddColumn("Month", row => DateTime.Parse(row["OrderDate"]).Month.ToString())
    .Normalize(columns: new[] { "Revenue", "Quantity" },
               method: NormalizationMethod.MinMax)
    .FilterRows(row => int.Parse(row["Year"]) >= 2023)
    .ToCsvAsync("features.csv");

// 67-90% reduction in file I/O compared to CLI approach!

Format Conversion

# CSV to JSON
fileprepper file-format-convert --input data.csv --output data.json --format JSON

# Excel to CSV
fileprepper file-format-convert --input report.xlsx --output report.csv --format CSV

# CSV to XML
fileprepper file-format-convert --input data.csv --output data.xml --format XML

Data Analysis

# Calculate statistics
fileprepper basic-statistics --input data.csv --output stats.csv \
  --columns "Age,Salary,Score" --statistics Mean,Median,StdDev,ZScore

# Aggregate by group
fileprepper aggregate --input sales.csv --output summary.csv \
  --group-by "Region,Category" --agg-columns "Revenue:Sum,Quantity:Mean"

# Sample data
fileprepper data-sampling --input large.csv --output sample.csv \
  --method Random --sample-size 1000

๐Ÿ”ง Programmatic Usage (SDK)

FilePrepper provides a powerful SDK with Pipeline API for efficient data processing:

dotnet add package FilePrepper

Benefits: 67-90% reduction in file I/O, fluent API, in-memory processing

using FilePrepper.Pipeline;
using FilePrepper.Tasks.NormalizeData;

// Efficient: Only 2 file I/O operations (read + write)
await DataPipeline
    .FromCsvAsync("data.csv")
    .Normalize(columns: new[] { "Age", "Salary", "Score" },
               method: NormalizationMethod.MinMax)
    .FillMissing(columns: new[] { "Score" }, method: FillMethod.Mean)
    .FilterRows(row => int.Parse(row["Age"]) >= 30)
    .AddColumn("ProcessedDate", _ => DateTime.Now.ToString())
    .ToCsvAsync("output.csv");

// Or work in-memory without any file I/O
var result = DataPipeline
    .FromData(inMemoryData)
    .Normalize(columns: new[] { "Age", "Salary" },
               method: NormalizationMethod.MinMax)
    .ToDataFrame();  // Get immutable snapshot

Advanced Pipeline Features

// Chain multiple transformations
var pipeline = await DataPipeline
    .FromCsvAsync("sales.csv")
    .RemoveColumns(new[] { "Debug", "TempCol" })
    .RenameColumn("OldName", "NewName")
    .AddColumn("Total", row =>
        (double.Parse(row["Price"]) * double.Parse(row["Quantity"])).ToString())
    .FilterRows(row => double.Parse(row["Total"]) > 100)
    .Normalize(columns: new[] { "Total" }, method: NormalizationMethod.MinMax);

// Get intermediate results without file I/O
var dataFrame = pipeline.ToDataFrame();
Console.WriteLine($"Processed {dataFrame.RowCount} rows");

// Continue processing
await pipeline
    .AddColumn("ProcessedAt", _ => DateTime.UtcNow.ToString("o"))
    .ToCsvAsync("output.csv");

In-Memory Processing

// Work entirely in memory - zero file I/O
var data = new List<Dictionary<string, string>>
{
    new() { ["Name"] = "Alice", ["Age"] = "25", ["Salary"] = "50000" },
    new() { ["Name"] = "Bob", ["Age"] = "30", ["Salary"] = "60000" }
};

var result = DataPipeline
    .FromData(data)
    .Normalize(columns: new[] { "Age", "Salary" },
               method: NormalizationMethod.MinMax)
    .AddColumn("Category", row =>
        int.Parse(row["Age"]) < 30 ? "Junior" : "Senior")
    .ToDataFrame();

// Access results directly
foreach (var row in result.Rows)
{
    Console.WriteLine($"{row["Name"]}: {row["Category"]}");
}

Traditional Task API

using FilePrepper.Tasks.NormalizeData;
using Microsoft.Extensions.Logging;

var options = new NormalizeDataOption
{
    InputPath = "data.csv",
    OutputPath = "normalized.csv",
    TargetColumns = new[] { "Age", "Salary", "Score" },
    Method = NormalizationMethod.MinMax
};

var task = new NormalizeDataTask(logger);
var context = new TaskContext(options);
bool success = await task.ExecuteAsync(context);

See SDK Usage Guide for comprehensive examples and best practices.

๐Ÿ“– Documentation

Getting Started

SDK & Programming

Advanced Features

Use Cases

For more documentation, see the docs/ directory.

๐ŸŽฏ Use Cases

  • Machine Learning - Prepare datasets for training (normalization, encoding, feature engineering)
  • Time-Series Analysis - Window aggregations, resampling, lag features - Prepare datasets for training (normalization, encoding, feature engineering)
  • Data Analysis - Clean and transform data for analysis
  • ETL Pipelines - Extract, transform, and load data workflows with minimal I/O overhead
  • Data Migration - Convert between formats and clean legacy data
  • Automation - Script data processing with SDK or CLI
  • In-Memory Processing - Chain transformations without file I/O costs

๐Ÿ“‹ Requirements

  • .NET 10.0 or later
  • Cross-platform - Windows, Linux, macOS
  • Flexible Usage - CLI tool (no coding) or SDK (programmatic)

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

Product Compatible and additional computed target framework versions.
.NET net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

This package has no dependencies.

Version Downloads Last Updated
0.6.0 74 3/20/2026
0.5.0 95 2/21/2026
0.4.9 110 1/10/2026
0.4.8 163 11/16/2025
0.4.7 265 11/14/2025
0.4.5 297 11/13/2025
0.4.3 285 11/10/2025
0.4.0 210 11/3/2025
0.2.2 187 1/17/2025
0.2.1 169 1/16/2025
0.2.0 179 1/11/2025
0.1.1 192 12/16/2024
0.1.0 206 12/6/2024