Go to ...

Techpository

A Better Technical Repository

RSS Feed

Windows Server: Install and Configure Data Deduplication on Windows 2012


Install and Configure Data Deduplication

Published: February 29, 2012

Updated: August 29, 2012

Applies To: Windows Server 2012, Windows Storage Server 2012

 This document explains how to set up a server, enable data deduplication, optimize a volume, and carry out advanced deduplication operations.

  1. From the Add Roles and Features Wizard, under Server Roles, select File and Storage Services (if it has not already been installed).
  2. Select the File Services check box, and then select the Data Deduplication check box.
  3. Click Next until the Install button is active, and then click Install.
  1. Start Windows PowerShell. Right-click the Windows PowerShell icon on the taskbar, and then click Run as Administrator.
  2. Run the following Windows PowerShell commands:
    Copy
    PS C:> Import-Module ServerManager
    PS C:> Add-WindowsFeature -name FS-Data-Deduplication
    PS C:> Import-Module Deduplication

 

 

  1. From the Server Manager dashboard, right-click a data volume and choose Configure Data Deduplication. The Deduplication Settings page appears.
  2. Select the Enable data deduplication check box, enter the number of days that should elapse from the date of file creation until files are deduplicated, enter the extensions of any file types that should not be deduplicated, and then click Add to browse to any folders with files that should not be deduplicated.
  3. Click Apply to apply these settings and return to the Server Manager dashboard, or click the Set Deduplication Schedule button to continue to set up a schedule for deduplication.
  1. To enable deduplication on a volume, run the following Windows PowerShell command on the server. In this example deduplication is enabled on volume E.
    Copy
    PS C:> Enable-DedupVolume E:
  2. Optionally, set the minimum number of days that must pass before a file is deduplicated by using the following command.
    Copy
    PS C:> Set-Dedupvolume E: -MinimumFileAgeDays 20

    If you set MinimumFileAgeDays to 0, deduplication will process all files, regardless of their age. This is suitable for a test environment, where you want to exercise maximum deduplication. In a production environment, however, it is preferable to wait for a number of days (the default is 5 days), because files tend to change a lot for a brief period of time before the change rate slows. This allows for the most efficient use of your server resources.

To return a list of the volumes that have been enabled for data deduplication by using Windows PowerShell

Run the following Windows PowerShell commands on the server.

Copy
PS C:> Get-DedupVolume
PS C:> Get-DedupVolume | format-list

The first command returns summary information and the second returns details about the volume data deduplication settings.

 

 

 

 

In Windows Server 2012, data deduplication jobs can be run on demand (manually) or scheduled. There are three types of jobs that you can perform on a volume: Optimization, Data Scrubbing, and Garbage Collection.

 

 

The Data Deduplication feature comes with built-in jobs that will automatically launch and optimize the specified volume(s) on a regular basis. Optimization jobs deduplicate data and compress file chunks on a volume per the policy settings. After the initial optimization is complete, optimization jobs run on the files that are included in the policies, according to the job schedules that you have configured or the default job schedules that ship with the product.

You can trigger an optimization job on demand in Windows PowerShell by using the Start-DedupJob cmdlet. For example:

Copy
PS C:> Start-DedupJob –Volume E: –Type Optimization

This command returns immediately and the job is launched asynchronously. If you want the job to complete at a later time , add the –wait parameter, like this:

Copy
PS C:> Start-DedupJob E: –Type Optimization -Wait

You can query the progress of the job on the volume by using the Get-DedupJob cmdlet:

Copy
PS C:> Get-DedupJob

The Get-DedupJob command show current jobs that are running or are queued to run.

You can query the key status statistics including the achieved savings on the volume by using the Get-DedupStatus cmdlet:

Copy
PS C:> Get-DedupStatus | f1

The Get-DedupStatus command shows the free space, space saved, optimized files, InPolicyfiles (the number of files that fall within the volume deduplication policy, based on the defined file age, size, type, and location criteria), and the associated drive identifier.

noteNote
You can also view the deduplication savings in Server Manager on the Volumes page. From Server Manager, click File Services, and then click Volumes. Right-click the column heading to add Deduplication Savings.

Optimization job queuing

Optimization jobs are started in the following order:

  1. Preemptive (manually run jobs that are not scheduled)Any manual jobs that include the –Preempt option will terminate any jobs that are currently running, and start immediately. (Note that the –Preempt option is ignored in scheduled jobs.)
  2. StopWhenSystemBusy parameterJobs that contain this parameter will stop if resources are not available to run the job without interfering with the server’s workload.
  3. PriorityAmong jobs that do not have the same StopWhenSystemBusy setting, high priority jobs are queued first, normal jobs are queued second, and low priority job are queued last.
  4. Manual or scheduledManual jobs are queued before scheduled jobs.

Memory settings are not considered as part of the optimization job queue algorithm.

Optimization metadata

Metadata provides you with evidence about savings that you gleaned from using optimization. There are three cmdlets that output this metadata: Update-DedupStatus, Get-DedupMetadata, and Measure-DedupFileMetadata. This metadata can help you assess the impact of some optimization configuration options.

Update-DedupStatus returns the following metadata:

Metadata What it indicates
DedupSavedSpace Difference between the logical size of the optimized files and the logical size of the store (the deduplicated user data plus deduplication metadata). This number changes continually.
DedupRate Ratio of DedupSavedSpace to the logical size of all of the files on the volume, and it is expressed as a percentage. This number changes continually.
OptimizedFilesCount Number of optimized files on the specified volume. Note that this number will remain steady (instead of decrease) as users delete files from or add files to the volume, until you run a Garbage Collection job. This count is most accurate after a full garbage collection job runs.
OptimizedFilesSize Aggregate size of all optimized files on the specified volume. Note that this number remains steady (instead of decreasing) as users delete files from or add new files to the volume, until you run a garbage collection job. This number is most accurate after a full garbage collection job runs.
InPolicyFilesCount Number of files that currently qualify for optimization. This number stays relatively constant between optimization jobs.
InPolicyFilesSize Aggregate size of all files that currently qualify for optimization. This number stays relatively constant between optimization jobs.
LastOptimizationTime Date and time when an optimization job was last run on the specified volume. This date and time stays constant between optimization jobs.
LastGarbageCollectionTime Date and time when a garbage collection job was run last on the specified volume. This date and time stays constant between optimization jobs.
LastScrubbingTime Date and time when a scrubbing job was run last on the specified volume. This date and time stays constant between optimization jobs.

Get-DedupMetadata returns the following metadata:

Metadata What it indicates
DataChunkCount Number of data chunks on the volume.
DataContainerCount Number of containers in the data store.
DataChunkAverageSize Data store size (not including chunk metadata) divided by the total number of data chunks in the data store.
StreamMapCount Number of data streams on the volume.
StreamMapContainerCount Number of containers in the stream map store.
StreamMapAverageChunkCount Stream map store size divided by the total number of streams in the store.
HotspotCount Number of “hotspot” chunks on the volume. A hotspot is a chunk that is referenced over 100 times. All hotspot chunks are duplicated on the volume to provide automatic data corruption recovery in the event that corruption occurs on the disk and impacts one of these popular chunks.
HotspotContainerCount Number of hotspot containers.
CorruptionLogEntryCount Number of corrupted items on the volume.

 

 

Data Deduplication has built-in data integrity features such as checksum validation and metadata consistency checking. It also has built-in redundancy for critical metadata and the most popular data chunks. As data is accessed or jobs process data, these features may encounter corruption, and they will record the corruption in a log file. Scrubbing jobs use these features to analyze the chunk store corruption logs, and when possible, to make repairs. Possible repair operations include using three sources of redundant data:

  1. Deduplication keeps backup copies of popular chunks when they are referenced over 100 times in an area called the hotspot. If the working copy is corrupted, deduplication will use the backup.
  2. When using Storage Spaces in a mirrored configuration, deduplication can use the mirror image of the redundant chunk to serve the I/O and fix the corruption.
  3. If a file is processed with a chunk that is corrupted, the corrupted chunk is eliminated, and the new incoming chunk is used to fix the corruption.

Scrubbing jobs output a summary report in the Windows event log located here:

Event ViewerApplications and Services LogsMicrosoftWindowsDeduplicationScrubbing

Data Deduplication defaults create a data integrity scrubbing job on a weekly basis, but you can also trigger one on demand by using the following command:

Copy
PS C:> Start-DedupJob E: –Type Scrubbing

This initiates a job that attempts to repair all corruptions that were logged in to the deduplication internal corruption log, during the I/O to deduplication files. To check the data integrity of all the deduplicated data on the volume, use the -full parameter:

Copy
PS C:> Start-DedupJob E: –Type Scrubbing -full

Also known as Deep Scrubbing, the -full parameter will scrub the entire set of deduplicated data and look for all corruptions that are causing data access failures.

 

 

Data Deduplication includes garbage collection jobs to process deleted or modified data on the volume so that any data chunks no longer referenced are cleaned up. Garbage collection jobs process previously deleted or logically overwritten optimized content to create usable volume free space. When an optimized file is deleted or overwritten by new data, the old data in the chunk store is not immediately deleted.

noteNote
Garbage collection is a processing-intensive operation, so you should allow the deletion load to reach a threshold and then manually run this job type, or schedule it for off hours.

Garbage collection can also be triggered on demand. For example:

Copy
PS C:> Start-DedupJob E: –Type GarbageCollection

This command removes unreferenced chunks and compacts containers that have more than 5% unreferenced data. By adding the –full parameter, the job will compact all containers to the maximum extent possible. For example:

Copy
PS C:> Start-DedupJob E: –Type GarbageCollection -full

 

 

Data Deduplication comes with three schedules that are set up immediately. Optimization runs every hour, and Garbage Collection and Scrubbing are set for once a week. You can view the schedules by using this Windows PowerShell command:

Copy
PS C:> Get-DedupSchedule

Enabled    Type               StartTime              Days               Name
-------    ----               ---------              ----               ----
True       Optimization                                                 BackgroundOptimization
True       GarbageCollection  2:45 AM                Saturday           WeeklyGarbageCollection
True       Scrubbing          3:45 AM                Saturday           WeeklyScrubbing

Two additional schedules can be used immediately to add jobs. These job schedules run on all volumes on the server. If you want to run a job only on a particular volume, you must create a new job. You can create, modify, or view job schedules from the Deduplication Settings page in Server Manager, or by using the following Windows PowerShell commands:

Copy
PS C:> Set-DedupSchedule <ScheduleName> <properties>
PS C:> Remove-DedupSchedule <ScheduleName>
noteNote
Deduplication only supports weekly job scheduling. If you want to create a schedule for a monthly job or any other time period, use Windows Task Scheduler. However, you will be unable to view custom job schedules that are created or modified with Task Scheduler by using the Get-DedupSchedule cmdlet. Such schedules are not migrated with server upgrades.

The built-in job schedules that Data Deduplication handles are:

  • BackgroundModeOptimization   Use this job schedule to run an Optimization job with the following parameters:

    Parameter Value
    Enabled True
    Priority Low
    Memory 25
    ContinueWhenSystemBusy False
    ScheduledTask MicrosoftWindowsDeduplicationBackgroundModeOptimization
    Start 00:00:00
    Days {Mon,Tues,Wed,Thurs,Fri,Sat,Sun}
    Duration 0
    Repeat 1
  • ThroughputModeOptimization   Use this job schedule to run an Optimization job with the following parameters:

    Parameter Value
    Enabled False
    Priority Normal
    Memory 50
    ContinueWhenSystemBusy False
    ScheduledTask MicrosoftWindowsDeduplicationThroughputModeOptimization
    Start 00:00:00
    Days {Mon,Tues,Wed,Thurs,Fri,Sat,Sun}
    Duration 4
    Repeat 0
  • ThroughputModeOptimization2    Use this job schedule to run an Optimization job with the following parameters:

    Parameter Value
    Enabled False
    Priority Normal
    Memory 50
    ContinueWhenSystemBusy False
    ScheduledTask MicrosoftWindowsDeduplicationThroughputModeOptimization
    Start 00:00:00
    Days {Mon,Tues,Wed,Thurs,Fri,Sat,Sun}
    Duration 4
    Repeat 0
  • WeeklyGarbageCollection   This default setting is scheduled to run a Garbage Collection job with the following parameters:

    Parameter Value
    Enabled True
    Priority Normal
    Memory 50
    ContinueWhenSystemBusy False
    ScheduledTask MicrosoftWindowsDeduplicationWeeklyGarbageCollection
    Start 01:45:00
    Days {Sat}
    Duration 0
    Repeat 0
  • WeeklyScrubbing   Use this job schedule to run a Scrubbing job with the following parameters:

    Parameter Value
    Enabled True
    Priority Normal
    Memory 50
    ContinueWhenSystemBusy False
    ScheduledTask MicrosoftWindowsDeduplicationWeeklyScrubbing
    Start 02:45:00
    Days {Sat}
    Duration 0
    Repeat 0

Operational considerations

Some files cannot be read when the free disk space on a deduplicated volume approaches zero. To resolve this issue, do one of the following:

  • Run a Garbage Collection task to reclaim disk space.
  • Copy files elsewhere (if there is not a recent memory map of the files).
  • Run Robocopy.exe in non-cached Read mode to copy files elsewhere (if there is a recent memory map of the files). For more information about using Robocopy, see Robocopy.

Advanced deduplication policy considerations

With some server configurations, you may want to speed up deduplication. Here are some scenarios that might warrant additional job scheduling:

Condition Action to consider
Significant incoming data Add additional throughput optimization jobs.
More volumes than CPU core processors with significant incoming data Add additional throughput optimization jobs.
Data deletions exceed 50 GB per hour, and you want to get the free space back as quickly as possible. Add additional garbage collection jobs to reclaim the free space.

By: Microsoft