Combine Multiple CSV Files into Large CSV Files (Network Steve Forum)

Combine Multiple CSV Files into Large CSV Files

I have a BUNCH of csv files that I'd like to condense into fewer files however I'd like to create a new file each time a file reaches 1,000,000 rows. So I'd end up with something like "outputfile-1.csv" that contains a million rows, then "output-2.cvs" containing the next 1 million rows and so on. How can this be accomplish?

I have (gc *.csv) | set-content output.cvs

And that works but I don't know how to make it only do 1 million rows per csv file.

August 31st, 2015 12:00pm

Import the CSV then extract the rows

$csv=Import-Csv filename
$csv[0..(1000000-1)]|Export-Csv file1

Free Windows Admin Tool Kit Click here and download it now

August 31st, 2015 12:18pm

I have a TON of csv so could I do this rather than specifying a file name?

And if so how to append a number to the end of each output file? Like output1.csv, output2.csv etc

$csv=(gc *.csv)

$csv[0..(1000000-1)]| set-content output.cvs

August 31st, 2015 12:41pm

No. YOu cannot use text operations to manage a CSV file:

If all files are identical in structure you can do this:

$csv=Impoprt-Csv *.csv

Free Windows Admin Tool Kit Click here and download it now

August 31st, 2015 2:29pm

$csv = Get-ChildItem *.csv | Select -Expand FullName | Import-Csv
$rows = 1000000
$i = 0
$filenum = 1
do {
    $filename = "output{0:D3}.csv" -f $filenum
    $csv[$i..(($i+=$rows) - 1)] | Export-Csv $filename -NoType
    $filenum++
} 
until ($i -ge $csv.Count - 1)

August 31st, 2015 3:04pm

$csv = Get-ChildItem *.csv | Select -Expand FullName | Import-Csv
$rows = 1000000
$i = 0
$filenum = 1
do {
    $filename = "output{0:D3}.csv" -f $filenum
    $csv[$i..(($i+=$rows) - 1)] | Export-Csv $filename -NoType
    $filenum++
} 
until ($i -ge $csv.Count - 1)

Marked as answer by John Marcum [MVP]MVP 16 hours 43 minutes ago

Free Windows Admin Tool Kit Click here and download it now

August 31st, 2015 7:00pm

$csv = Get-ChildItem *.csv | Select -Expand FullName | Import-Csv
$rows = 1000000
$i = 0
$filenum = 1
do {
    $filename = "output{0:D3}.csv" -f $filenum
    $csv[$i..(($i+=$rows) - 1)] | Export-Csv $filename -NoType
    $filenum++
} 
until ($i -ge $csv.Count - 1)

Marked as answer by John Marcum [MVP]MVP Tuesday, September 01, 2015 2:28 PM

August 31st, 2015 7:00pm

Maybe what I am trying to do is simply not possible? I tried the code Leif-Arne posted. I left it running overnight and 17 hours after I started it nothing has happened and Powershell is "not responding".

Free Windows Admin Tool Kit Click here and download it now

September 1st, 2015 8:14am

I think you need a database or a bigger machine with more memory.

SQLExpress can do millions of records on even small machines.

The problem with CSVs, arrays and other scripting structures is that they exist only in memory. This cases swapping to happen and slows the whole thing down to a crawl.

Always use a database for large datasets.

Back to the drawing board, heh?

September 1st, 2015 8:25am

Hi John,

jrv is absolutely right, that this is really the job for a database. For the sake of PowerShell technique though, here's an example on how you can do this without overloading your memory usage:

The basic concept is to keep it all within a single pipeline, so that the lines of text flow through and get emitted by the final function as they come. This means, all the data may only flow through the process part of each involved function/cmdlet. Here's an example:

function Select-Set
{
	[CmdletBinding()]
	Param (
		[Parameter(ValueFromPipeline = $true)]
		$InputObject,
		
		[int]
		$Count
	)
	
	Begin
	{
		$list = New-Object PSObject -Property @{ List = @() }
	}
	
	Process
	{
		foreach ($Object in $InputObject)
		{
			$list.List += $Object
			if ($list.List.Length -ge $Count)
			{
				$list
				$list = New-Object PSObject -Property @{ List = @() }
			}
		}
	}
	
	End
	{
		if ($list.Length -gt 0)
		{
			$list
		}
	}
}

function Set-SerialContent
{
	[CmdletBinding()]
	Param (
		[Parameter(ValueFromPipeline = $true)]
		$InputObject,
		
		[string]
		$Path,
		
		[string]
		$Name,
		
		[string]
		$Extension
	)
	
	Begin
	{
		$count = 0
	}
	
	Process
	{
		foreach ($Object in $InputObject)
		{
			$Object.List | Set-Content "$Path\$($Name)$($count).$($Extension)"
			$count++
		}
	}
}


Get-Content "*.csv" | Select-Set -Count 1000000 | Set-SerialContent -Path "C:\Output" -Name "Csv_" -Extension "csv"

Please note, that this is not designed to be the most aesthetic code, but a proof of concept. The two helper functions are designed according to the basic rule I explained above: All objects passed through only do so on the pipeline. By dint of necessity Select-Set keeps up to the number specified in memory (so you need that amount of memory available).

Furthermore, for a tailored fit, I could have combined the two functions into one function (and saved me some trouble with collections and pipelines), but I like to keep separate functionalities separated between functions. A bit of pointless vanity perhaps?

Anyway, this may work, but even if it does, there's no way its performance will top (or even approach) a database.

Cheers,
Fred

Free Windows Admin Tool Kit Click here and download it now

September 1st, 2015 8:54am

I agree that a method that streams would work however...it is a CSV which has headers and the CSVs are not all 1000000 lines but need to be combined.

Yes there is a streaming solution but it is complex. A database would remove the complexity and provide the performance. Writing a solution is unnecessary as long as we have databases.

I can write this in a streaming format in about 15 minutes. I won't. Not interested.

Any takers?

Use a database.

September 1st, 2015 9:08am

My end goal here is to get these into MS SQL. I couldn't find a way in MS SQL to import multiple csv files at once so I wanted to get the number of files down to as few as possible. I also found that MS SQL import parsed the headers better when the file has been converted to excel which is why I need the 1 million rows per file. Excel can only handle just over 1 million rows. :-)

I ended up getting it to work by doing small chunks at a time and using the original code that Leif-Arne posted. I just separated my csv files into folders containing 400 files per folder. The script can work though that in no time at all. (I'm running this on a pretty beefy systems with lots of RAM and SSD's.)

Free Windows Admin Tool Kit Click here and download it now

September 1st, 2015 10:29am

BulkCopy can ;load multiple CSV files at very high speed. I have loaded at more than one-hundred thousand records per second. I built a system that bulk loaded batches of files. We loaded more than 250 million records a day between 6:0 and 7:00.

BulkCopy is what you want to use. Don't pay attention to the idiots who claim it cannot be used with CSV files.

September 1st, 2015 10:45am

BulkCopy can ;load multiple CSV files at very high speed. I have loaded at more than one-hundred thousand records per second. I built a system that bulk loaded batches of files. We loaded more than 250 million records a day between 6:0 and 7:00.

BulkCopy is what you want to use. Don't pay attention to the idiots who claim it cannot be used with CSV files.

Free Windows Admin Tool Kit Click here and download it now

September 1st, 2015 11:19am

SQL questions should be posted in a SQL forum rather than here (scripting forum).

September 1st, 2015 12:04pm

BulkCopy can ;load multiple CSV files at very high speed. I have loaded at more than one-hundred thousand records per second. I built a system that bulk loaded batches of files. We loaded more than 250 million records a day between 6:0 and 7:00.

BulkCopy is what you want to use. Don't pay attention to the idiots who claim it cannot be used with CSV files.

Free Windows Admin Tool Kit Click here and download it now

September 1st, 2015 1:38pm

Here is another version that uses a CSV as an example:

Param (
	[parameter(Mandatory = $true)]
	[string]$CsvPath,
	[parameter(Mandatory = $true)]
	[string]$CsvFile,
	[parameter(Mandatory = $true)]
	[string]$TargetServer,
	[parameter(Mandatory = $true)]
	[string]$TargetDatabase,
	[parameter(Mandatory = $true)]
	[string]$TargetTable,
	[switch]$Truncate
	#$ColumnMappings
)

Begin {
	
	Function Get-CsvReader {
		Param (
			$csvPath,
			$csvFile
		)
		
		$csvFile = $csvFile.Replace('.', '#')
		$connStr = 'Provider=Microsoft.ACE.OLEDB.12.0;Data Source={0};Extended Properties="Text;HDR=Yes;FORMAT=Delimited"' -f $csvPath
		$csvConnection = New-Object System.Data.OleDb.OleDbConnection($connStr)
		$csvConnection.Open()
		$csvCommand = $csvConnection.CreateCommand()
		$csvCommand.CommandText = "select * from [$csvFile]"
		$csvCommand.ExecuteReader()
	}
}

Process {
	Try {
		$csvReader = Get-CsvReader -csvPath $svPAth -csvFile $csvFile
		$targetConnStr = "Data Source=$ServerName;Initial Catalog=$DbName;Integrated Security=True;"
		$bulkCopy = New-Object Data.SqlClient.SqlBulkCopy($targetConnStr, [System.Data.SqlClient.SqlBulkCopyOptions]::KeepIdentity)
		$bulkCopy.DestinationTableName = $TargetTable
		$bulkCopy.WriteToServer($csvReader)
		Write-Host "Table $SrcTable in $SrcDatabase database on $SrcServer has been copied to table $DestTable in $DestDatabase database on $DestServer"
	} Catch {
		throw $_
	} Finally {
		$csvReader.Close()
		$bulkCopy.Close()
	}
}

Between the two you should be able to see how to do this.

September 1st, 2015 1:40pm

So you are actually leveraging powershell to call the bulk import into SQL? Okay, I'll give this a shot. Next time I should post my end goal and not just the baby step that I thought I needed to take to get there. :-)

Free Windows Admin Tool Kit Click here and download it now

September 1st, 2015 3:13pm

You are going to have to modify the examples to fit your scenario.

September 1st, 2015 3:20pm

This topic is archived. No further replies will be accepted.