Windows Server: Install and Configure Data Deduplication on Windows 2012

12 years ago

Timothy Conrad

12 minutes

Install and Configure Data Deduplication

Published: February 29, 2012

Updated: August 29, 2012

Applies To: Windows Server 2012, Windows Storage Server 2012

This document explains how to set up a server, enable data deduplication, optimize a volume, and carry out advanced deduplication operations.

Step 1: Set up the server

To install deduplication components on the server by using Server Manager

From the Add Roles and Features Wizard, under Server Roles, select File and Storage Services (if it has not already been installed).
Select the File Services check box, and then select the Data Deduplication check box.
Click Next until the Install button is active, and then click Install.

To install deduplication components on the server by using Windows PowerShell

Start Windows PowerShell. Right-click the Windows PowerShell icon on the taskbar, and then click Run as Administrator.

Run the following Windows PowerShell commands:

Copy

PS C:> Import-Module ServerManager
PS C:> Add-WindowsFeature -name FS-Data-Deduplication
PS C:> Import-Module Deduplication

Step 2: Enable data deduplication

To enable data deduplication by using Server Manager

From the Server Manager dashboard, right-click a data volume and choose Configure Data Deduplication. The Deduplication Settings page appears.
Select the Enable data deduplication check box, enter the number of days that should elapse from the date of file creation until files are deduplicated, enter the extensions of any file types that should not be deduplicated, and then click Add to browse to any folders with files that should not be deduplicated.
Click Apply to apply these settings and return to the Server Manager dashboard, or click the Set Deduplication Schedule button to continue to set up a schedule for deduplication.

To enable data deduplication by using Windows PowerShell

To enable deduplication on a volume, run the following Windows PowerShell command on the server. In this example deduplication is enabled on volume E.
Copy
PS C:> Enable-DedupVolume E:
Optionally, set the minimum number of days that must pass before a file is deduplicated by using the following command.
Copy
PS C:> Set-Dedupvolume E: -MinimumFileAgeDays 20
If you set MinimumFileAgeDays to 0, deduplication will process all files, regardless of their age. This is suitable for a test environment, where you want to exercise maximum deduplication. In a production environment, however, it is preferable to wait for a number of days (the default is 5 days), because files tend to change a lot for a brief period of time before the change rate slows. This allows for the most efficient use of your server resources.

To return a list of the volumes that have been enabled for data deduplication by using Windows PowerShell

Run the following Windows PowerShell commands on the server.

Copy

PS C:> Get-DedupVolume
PS C:> Get-DedupVolume | format-list

The first command returns summary information and the second returns details about the volume data deduplication settings.

Step 3: Set data deduplication jobs

In Windows Server 2012, data deduplication jobs can be run on demand (manually) or scheduled. There are three types of jobs that you can perform on a volume: Optimization, Data Scrubbing, and Garbage Collection.

Optimization jobs

The Data Deduplication feature comes with built-in jobs that will automatically launch and optimize the specified volume(s) on a regular basis. Optimization jobs deduplicate data and compress file chunks on a volume per the policy settings. After the initial optimization is complete, optimization jobs run on the files that are included in the policies, according to the job schedules that you have configured or the default job schedules that ship with the product.

You can trigger an optimization job on demand in Windows PowerShell by using the Start-DedupJob cmdlet. For example:

Copy

PS C:> Start-DedupJob –Volume E: –Type Optimization

This command returns immediately and the job is launched asynchronously. If you want the job to complete at a later time , add the –wait parameter, like this:

Copy

PS C:> Start-DedupJob E: –Type Optimization -Wait

You can query the progress of the job on the volume by using the Get-DedupJob cmdlet:

Copy

PS C:> Get-DedupJob

The Get-DedupJob command show current jobs that are running or are queued to run.

You can query the key status statistics including the achieved savings on the volume by using the Get-DedupStatus cmdlet:

Copy

PS C:> Get-DedupStatus | f1

The Get-DedupStatus command shows the free space, space saved, optimized files, InPolicyfiles (the number of files that fall within the volume deduplication policy, based on the defined file age, size, type, and location criteria), and the associated drive identifier.

Note
You can also view the deduplication savings in Server Manager on the Volumes page. From Server Manager, click File Services, and then click Volumes. Right-click the column heading to add Deduplication Savings.

Optimization job queuing

Optimization jobs are started in the following order:

Preemptive (manually run jobs that are not scheduled)Any manual jobs that include the –Preempt option will terminate any jobs that are currently running, and start immediately. (Note that the –Preempt option is ignored in scheduled jobs.)
StopWhenSystemBusy parameterJobs that contain this parameter will stop if resources are not available to run the job without interfering with the server’s workload.
PriorityAmong jobs that do not have the same StopWhenSystemBusy setting, high priority jobs are queued first, normal jobs are queued second, and low priority job are queued last.
Manual or scheduledManual jobs are queued before scheduled jobs.

Memory settings are not considered as part of the optimization job queue algorithm.

Optimization metadata

Metadata provides you with evidence about savings that you gleaned from using optimization. There are three cmdlets that output this metadata: Update-DedupStatus, Get-DedupMetadata, and Measure-DedupFileMetadata. This metadata can help you assess the impact of some optimization configuration options.

Update-DedupStatus returns the following metadata:

Metadata	What it indicates
DedupSavedSpace	Difference between the logical size of the optimized files and the logical size of the store (the deduplicated user data plus deduplication metadata). This number changes continually.
DedupRate	Ratio of DedupSavedSpace to the logical size of all of the files on the volume, and it is expressed as a percentage. This number changes continually.
OptimizedFilesCount	Number of optimized files on the specified volume. Note that this number will remain steady (instead of decrease) as users delete files from or add files to the volume, until you run a Garbage Collection job. This count is most accurate after a full garbage collection job runs.
OptimizedFilesSize	Aggregate size of all optimized files on the specified volume. Note that this number remains steady (instead of decreasing) as users delete files from or add new files to the volume, until you run a garbage collection job. This number is most accurate after a full garbage collection job runs.
InPolicyFilesCount	Number of files that currently qualify for optimization. This number stays relatively constant between optimization jobs.
InPolicyFilesSize	Aggregate size of all files that currently qualify for optimization. This number stays relatively constant between optimization jobs.
LastOptimizationTime	Date and time when an optimization job was last run on the specified volume. This date and time stays constant between optimization jobs.
LastGarbageCollectionTime	Date and time when a garbage collection job was run last on the specified volume. This date and time stays constant between optimization jobs.
LastScrubbingTime	Date and time when a scrubbing job was run last on the specified volume. This date and time stays constant between optimization jobs.

Get-DedupMetadata returns the following metadata:

Metadata	What it indicates
DataChunkCount	Number of data chunks on the volume.
DataContainerCount	Number of containers in the data store.
DataChunkAverageSize	Data store size (not including chunk metadata) divided by the total number of data chunks in the data store.
StreamMapCount	Number of data streams on the volume.
StreamMapContainerCount	Number of containers in the stream map store.
StreamMapAverageChunkCount	Stream map store size divided by the total number of streams in the store.
HotspotCount	Number of “hotspot” chunks on the volume. A hotspot is a chunk that is referenced over 100 times. All hotspot chunks are duplicated on the volume to provide automatic data corruption recovery in the event that corruption occurs on the disk and impacts one of these popular chunks.
HotspotContainerCount	Number of hotspot containers.
CorruptionLogEntryCount	Number of corrupted items on the volume.

Data Scrubbing jobs

Data Deduplication has built-in data integrity features such as checksum validation and metadata consistency checking. It also has built-in redundancy for critical metadata and the most popular data chunks. As data is accessed or jobs process data, these features may encounter corruption, and they will record the corruption in a log file. Scrubbing jobs use these features to analyze the chunk store corruption logs, and when possible, to make repairs. Possible repair operations include using three sources of redundant data:

Deduplication keeps backup copies of popular chunks when they are referenced over 100 times in an area called the hotspot. If the working copy is corrupted, deduplication will use the backup.
When using Storage Spaces in a mirrored configuration, deduplication can use the mirror image of the redundant chunk to serve the I/O and fix the corruption.
If a file is processed with a chunk that is corrupted, the corrupted chunk is eliminated, and the new incoming chunk is used to fix the corruption.

Scrubbing jobs output a summary report in the Windows event log located here:

Event ViewerApplications and Services LogsMicrosoftWindowsDeduplicationScrubbing

Data Deduplication defaults create a data integrity scrubbing job on a weekly basis, but you can also trigger one on demand by using the following command:

Copy

PS C:> Start-DedupJob E: –Type Scrubbing

This initiates a job that attempts to repair all corruptions that were logged in to the deduplication internal corruption log, during the I/O to deduplication files. To check the data integrity of all the deduplicated data on the volume, use the -full parameter:

Copy

PS C:> Start-DedupJob E: –Type Scrubbing -full

Also known as Deep Scrubbing, the -full parameter will scrub the entire set of deduplicated data and look for all corruptions that are causing data access failures.

Garbage Collection jobs

Data Deduplication includes garbage collection jobs to process deleted or modified data on the volume so that any data chunks no longer referenced are cleaned up. Garbage collection jobs process previously deleted or logically overwritten optimized content to create usable volume free space. When an optimized file is deleted or overwritten by new data, the old data in the chunk store is not immediately deleted.

Note
Garbage collection is a processing-intensive operation, so you should allow the deletion load to reach a threshold and then manually run this job type, or schedule it for off hours.

Garbage collection can also be triggered on demand. For example:

Copy

PS C:> Start-DedupJob E: –Type GarbageCollection

This command removes unreferenced chunks and compacts containers that have more than 5% unreferenced data. By adding the –full parameter, the job will compact all containers to the maximum extent possible. For example:

Copy

PS C:> Start-DedupJob E: –Type GarbageCollection -full

Step 4: Set data deduplication schedules

Data Deduplication comes with three schedules that are set up immediately. Optimization runs every hour, and Garbage Collection and Scrubbing are set for once a week. You can view the schedules by using this Windows PowerShell command:

Copy

PS C:> Get-DedupSchedule

Enabled    Type               StartTime              Days               Name
-------    ----               ---------              ----               ----
True       Optimization                                                 BackgroundOptimization
True       GarbageCollection  2:45 AM                Saturday           WeeklyGarbageCollection
True       Scrubbing          3:45 AM                Saturday           WeeklyScrubbing

Two additional schedules can be used immediately to add jobs. These job schedules run on all volumes on the server. If you want to run a job only on a particular volume, you must create a new job. You can create, modify, or view job schedules from the Deduplication Settings page in Server Manager, or by using the following Windows PowerShell commands:

Copy

PS C:> Set-DedupSchedule <ScheduleName> <properties>
PS C:> Remove-DedupSchedule <ScheduleName>

Note
Deduplication only supports weekly job scheduling. If you want to create a schedule for a monthly job or any other time period, use Windows Task Scheduler. However, you will be unable to view custom job schedules that are created or modified with Task Scheduler by using the Get-DedupSchedule cmdlet. Such schedules are not migrated with server upgrades.

The built-in job schedules that Data Deduplication handles are:

BackgroundModeOptimization Use this job schedule to run an Optimization job with the following parameters:

Parameter	Value
Enabled	True
Priority	Low
Memory	25
ContinueWhenSystemBusy	False
ScheduledTask	MicrosoftWindowsDeduplicationBackgroundModeOptimization
Start	00:00:00
Days	{Mon,Tues,Wed,Thurs,Fri,Sat,Sun}
Duration	0
Repeat	1

ThroughputModeOptimization Use this job schedule to run an Optimization job with the following parameters:

Parameter	Value
Enabled	False
Priority	Normal
Memory	50
ContinueWhenSystemBusy	False
ScheduledTask	MicrosoftWindowsDeduplicationThroughputModeOptimization
Start	00:00:00
Days	{Mon,Tues,Wed,Thurs,Fri,Sat,Sun}
Duration	4
Repeat	0

ThroughputModeOptimization2 Use this job schedule to run an Optimization job with the following parameters:

Parameter	Value
Enabled	False
Priority	Normal
Memory	50
ContinueWhenSystemBusy	False
ScheduledTask	MicrosoftWindowsDeduplicationThroughputModeOptimization
Start	00:00:00
Days	{Mon,Tues,Wed,Thurs,Fri,Sat,Sun}
Duration	4
Repeat	0

WeeklyGarbageCollection This default setting is scheduled to run a Garbage Collection job with the following parameters:

Parameter	Value
Enabled	True
Priority	Normal
Memory	50
ContinueWhenSystemBusy	False
ScheduledTask	MicrosoftWindowsDeduplicationWeeklyGarbageCollection
Start	01:45:00
Days	{Sat}
Duration	0
Repeat	0

WeeklyScrubbing Use this job schedule to run a Scrubbing job with the following parameters:

Parameter	Value
Enabled	True
Priority	Normal
Memory	50
ContinueWhenSystemBusy	False
ScheduledTask	MicrosoftWindowsDeduplicationWeeklyScrubbing
Start	02:45:00
Days	{Sat}
Duration	0
Repeat	0

Operational considerations

Some files cannot be read when the free disk space on a deduplicated volume approaches zero. To resolve this issue, do one of the following:

Run a Garbage Collection task to reclaim disk space.
Copy files elsewhere (if there is not a recent memory map of the files).
Run Robocopy.exe in non-cached Read mode to copy files elsewhere (if there is a recent memory map of the files). For more information about using Robocopy, see Robocopy.

Advanced deduplication policy considerations

With some server configurations, you may want to speed up deduplication. Here are some scenarios that might warrant additional job scheduling:

Condition	Action to consider
Significant incoming data	Add additional throughput optimization jobs.
More volumes than CPU core processors with significant incoming data	Add additional throughput optimization jobs.
Data deletions exceed 50 GB per hour, and you want to get the free space back as quickly as possible.	Add additional garbage collection jobs to reclaim the free space.