CSVComparer 1.0.0

There is a newer version of this package available.
See the version list below for details.
dotnet add package CSVComparer --version 1.0.0
NuGet\Install-Package CSVComparer -Version 1.0.0
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="CSVComparer" Version="1.0.0" />
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add CSVComparer --version 1.0.0
#r "nuget: CSVComparer, 1.0.0"
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install CSVComparer as a Cake Addin
#addin nuget:?package=CSVComparer&version=1.0.0

// Install CSVComparer as a Cake Tool
#tool nuget:?package=CSVComparer&version=1.0.0

Build Status CodeQL

CSVComparison

A tool to compare 2 CSV files. Results of the comparison are saved to an output file or can be interrogated via an API. See here for more information and a list of updates

Some terminology

  • Left Hand Side file The first CSV file for the comparison.
  • Right Hand Side file The second CSV file for the comparison.
  • Key Unique definition of a single CSV Row. This can be made from one or more Columns
  • Break A single difference between the files. There may be multiple breaks.
  • Orphan A row in the reference file but not in the candidate file. Or vice-versa.
  • Value Break A difference in a column value between a row with matching key on Reference and Candidate file

Features

  • Any number of columns in the CSV can be defined as keys
  • CSV Files do not require any pre-sorting
  • Supports any delimiter, including string delimiters, e.g. "=="
  • Columns can be excluded (for example timestamps)
  • Rows can be excluded based on Key pattern matching
  • Orphans can be excluded based on pattern matching
  • Additional rows at the start and end of file can be excluded, for example rows with footer information
  • Supports Numeric tolerance, relative and absolute
  • Double Quotes in a row supported e.g. A,B,"C,D are one column",E

How to use

Run the CSVComparison executable with the following arguments

"Path to left hand side csv file" "Path to right hand side reference file" "Path to configuration file" "Optional Path to directory to save output

If no output file is specified the console will list the breaks between the files

Key:C, LeftHandSide Row:2, Value:2.5 != RightHandSide Row:2, Value:2.61

The output file will list the configuration used, the input files, time taken to run the comparison and a tabular view of the differences

<?xml version="1.0" encoding="utf-8"?>
<ComparisonDefinition xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <Delimiter>,</Delimiter>
  <KeyColumns>
    <Column>COL A</Column>
  </KeyColumns>
  <HeaderRowIndex>0</HeaderRowIndex>
  <ToleranceValue>0.1</ToleranceValue>
  <IgnoreInvalidRows>false</IgnoreInvalidRows>
  <ToleranceType>Relative</ToleranceType>
  <ExcludedColumns />
</ComparisonDefinition>

Date run: 11/01/2021 18:32:48
LeftHandSide: C:\temp\LeftHandSideDirectory\Test.csv
RightHandSide: C:\temp\RightHandSideDirectory\Test.csv
Number of LeftHandSide rows: 100001
Number of RightHandSide rows: 100001
Comparison took 906ms
Number of breaks 5

Break Type,Key - COL A,Column Name,LHS Row,LHS Value,RHS Row,RHS Value
ValueMismatch,1,COL B,2,A,2,"A,X"
ValueMismatch,7,COL D,8,32.1,8,42.1
ValueMismatch,77,COL B,78,B,78,A
RowInRHS_NotInLHS,100000,,-1,,100000,
RowInLHS_NotInRHS,99,,100,,-1,

In tabular form (open in a spreadsheet)

Break Type Key - COL A Column Name LHS Row LHS Value RHS Row RHS Value
ValueMismatch 1 COL B 2 A 2 "A,X"
ValueMismatch 7 COL D 8 32.1 8 42.1
ValueMismatch 77 COL B 78 B 78 A
RowInRHS_NotInLHS 100000 -1 100000
RowInLHS_NotInRHS 99 100 -1

Configuration

The configuration is used to define how to treat the CSV files:

<?xml version="1.0" encoding="utf-8"?>
<ComparisonDefinition xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <Delimiter>,</Delimiter>
  <KeyColumns>
    <Column>COL A</Column>
    <Column>COL B</Column>
  </KeyColumns>
  <ExcludedColumns>
    <Column>COL D</Column>
    <Column>COL E</Column>
  </ExcludedColumns>
  <IgnoreInvalidRows>true</IgnoreInvalidRows>
  <HeaderRowIndex>0</HeaderRowIndex>
  <ToleranceType>Relative</ToleranceType>
  <ToleranceValue>0.1</ToleranceValue>
  <OrphanExclusions>
    <ExclusionPattern>RegexPattern</ExclusionPattern>
  </OrphanExclusions>
  <KeyExclusions>
    <ExclusionPattern>RegexPattern</ExclusionPattern>
  </KeyExclusions>
</ComparisonDefinition>

Delimiter Allows other separaters, i.e. pipe '|' to be used

KeyColumns Lists the columns required to obtain a unique key for each row

ExcludedColumns List the columns to be excluded from the comparison

IgnoreInvalidRows If a row doesn't have the same number of columns (it may be a descriptive footer for example) do not include in the comparison

HeaderRowIndex Set the row for header columns, if some non csv data occurs at the start of the file

ToleranceType How to compare numeric values

ToleranceValue The tolerance to use for numeric values

OrphanExclusions A list of Regex Patterns used to exclude orphans whose key matches the pattern

KeyExclusions A list of Regex Patterns used to exclude Value breaks whose key matches the pattern

Directory comparison

If the LHS and RHS paths are directories you can compare multiple files. If the files have different structures a configuration can be created that can define all comparisons.

The FilePattern element is a Regex pattern that is used to determine the configuration.

<MultipleComparisonDefinition>
  <FileComparisonDefinitions>
    <Comparison>
      <Key>Test</Key>
      <FilePattern>Test</FilePattern>
      <ComparisonDefinition xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
        <Delimiter>,</Delimiter>
        <KeyColumns>
          <Column>COL A</Column>
        </KeyColumns>
        <HeaderRowIndex>0</HeaderRowIndex>
        <ToleranceValue>0.1</ToleranceValue>
        <ToleranceType>Relative</ToleranceType>
      </ComparisonDefinition>
    </Comparison>
    <Comparison>
      <Key>Test2</Key>
      <FilePattern>File</FilePattern>
      <ComparisonDefinition xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
        <Delimiter>,</Delimiter>
        <KeyColumns>
          <Column>COL 1</Column>
          <Column>COL 2</Column>
        </KeyColumns>
        <HeaderRowIndex>0</HeaderRowIndex>
        <ToleranceValue>0.1</ToleranceValue>
        <ToleranceType>Relative</ToleranceType>
      </ComparisonDefinition>
    </Comparison>
  </FileComparisonDefinitions>
</MultipleComparisonDefinition>

The comparison will check each file in the reference directory. If a Comparison is found using the file pattern an exact file match is first attempted in the candidate directory. If an exact match does not happen then a search for a single file that matches the pattern will be performed.

The comparison will be performed only if:

  • A comparison definition is found
  • Exactly one candidate file is found, either by exact match or file pattern match

API

To run from your own C# code:

 # You can either deserialize the comparison definition xml or create your own in code
 var comparisonDefinition = new ComparisonDefinition() { Delimiter = "," };
 comparisonDefinition.KeyColumns.Add("ABC");
 comparisonDefinition.KeyColumns.Add("DEF");

 var csvComparer = new CSVComparer(comparisonDefinition);
 var comparisonResult = csvComparer.CompareFiles(referenceDataFilePath, targetDataFilePath);

 # Add code to interrogate the comparison result.
 foreach(var breakDetail in comparisonResult.BreakDetails)
 {
      Console.WriteLine($"{breakDetail.BreakType} - {breakDetail.BreakDescription}");
 }

BenchmarkDotNet

Running a comparison of two files (containing 1000) rows. Generated using the TestDataGenerator tool

From a Release Build, open a command prompt in CSVComparer\Benchmark\bin\Release\net6.0 and run Benchmark.exe


BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1413/22H2/2022Update/SunValley2)
11th Gen Intel Core i7-11800H 2.30GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.202
  [Host]     : .NET 6.0.15 (6.0.1523.11507), X64 RyuJIT AVX2
  DefaultJob : .NET 6.0.15 (6.0.1523.11507), X64 RyuJIT AVX2


Method Mean Error StdDev Ratio Gen0 Gen1 Allocated Alloc Ratio
CompareIdentical 1,405,817.46 ns 21,308.380 ns 17,793.462 ns 1.000 54.6875 3.9063 683792 B 1.000
CompareDifferent 1,402,760.11 ns 11,363.135 ns 9,488.733 ns 0.998 54.6875 5.8594 690979 B 1.011
StringSplit 60.82 ns 0.655 ns 0.613 ns 0.000 0.0178 - 224 B 0.000
StringSplitWithQuotes 138.74 ns 1.322 ns 1.236 ns 0.000 0.0267 - 336 B 0.000

Legends

  • Mean : Arithmetic mean of all measurements
  • Error : Half of 99.9% confidence interval
  • StdDev : Standard deviation of all measurements
  • Ratio : Mean of the ratio distribution ([Current]/[Baseline])
  • RatioSD : Standard deviation of the ratio distribution ([Current]/[Baseline])
  • Gen0 : GC Generation 0 collects per 1000 operations
  • Gen1 : GC Generation 1 collects per 1000 operations
  • Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
  • Alloc Ratio : Allocated memory ratio distribution ([Current]/[Baseline])

Before changing to use ReadOnlySpan<char> for StringSplitWithQuotes, the benchmark was:

Method Mean Error StdDev Ratio RatioSD Gen0 Gen1 Allocated Alloc Ratio
StringSplitWithQuotes 352.94 ns 6.950 ns 6.501 ns 0.000 0.00 0.0267 - 336 B 0.000
Product Compatible and additional computed target framework versions.
.NET net6.0 is compatible.  net6.0-android was computed.  net6.0-ios was computed.  net6.0-maccatalyst was computed.  net6.0-macos was computed.  net6.0-tvos was computed.  net6.0-windows was computed.  net7.0 was computed.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 was computed.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
  • net6.0

    • No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
1.0.1 403 6/19/2023
1.0.0 155 5/11/2023

First published version