How to Upgrade Large Windows 2008 R2 Workloads in Place

By Eran Sharon, Release Engineering Manager at Salesforce

Migration-4 As you are probably aware, Windows 2008 R2 has not been supported by Microsoft since January 14, 2020. Keeping your instances on this old version requires expensive extended support contracts and puts your workloads under a security risk.

On the other hand, upgrading legacy systems to the cloud can seem daunting and time-consuming. However, with the right combination of AWS Systems Manager, AWS Directory Service, and a few simple AWS Lambda functions, you can upgrade from your outdated Windows environments flawlessly, seamlessly, and at scale.

At Salesforce, I worked as an engineer on a large project that ran in-place upgrades for legacy Windows 2008 R2 workloads running on Amazon Web Services (AWS). Salesforce is an AWS Advanced Technology Partner with the AWS DevOps Competency.

Due to the tight project timelines and the scale of the work, we had to come up with an automatic process that would complete in-place upgrades using AWS System Manager with minimum service disruption.

The results were so good, I decided to share my experience. In this post, I will show you how to run an in-place upgrade of your Windows 2008 R2 production instances to Windows 2019 R2.

Overview

By orchestrating the process correctly, I was able to achieve almost zero downtime during this process. I was able to move large-scale production environments that had been running for many years on Windows 2008 R2 to Windows 2019 R2, without the need to recreate them from scratch and risk change management issues.

To do this, I started with this out-of-the-box (OOTB) AWS Systems Manager document (login required) provided by Amazon Web Services (AWS), and added three steps to support large batch upgrades:

Enhance the OOTB AWS Systems Manager document — I added additional steps to the document to achieve a working in-place upgrade flow.
.
Separate the production workloads from the temporary upgrade workloads — I created a dedicated WindowsUpgrade virtual private cloud (VPC) with AWS Directory Service. I did this to provide full separation during the upgrade process between the temporary upgrade instance and the production source instance.
.
Orchestrate the cutover — I used AWS Lambda functions to orchestrate the replacement of the original root volumes with the upgraded root volumes.

Following, I will walk through each step in this process.

Step 1: Enhance the OOTB AWS Systems Manager Document

The existing OOTB AWS Systems Manager document completes the upgrade process successfully if you manually apply additional steps. Alternatively, you can enhance the OOTB document itself to perform those prerequisites automatically as part of the automation flow. That’s what I did, in two enhancements.

First Enhancement

The first enhancement was to remove these server roles:

Remote Desktop Session Host (RDSH)
Remote Desktop Connection Broker (RDCB)
Remote Desktop Virtualization Host (RDVH)
Remote Desktop Web Access (RDWA)

In my version of the OOTB document, I created a separate RunCommand document named RemoveServerRolesBeforeWindows2008Upgrade, which runs this action and combines it within the main automation document right before the actual upgrade starts.

This RunCommand document is a RunPowerShellScript document that runs this JSON script:

"mainSteps": [
   {
      "action": "aws:runPowerShellScript", 
      "name": "example",
      "inputs": { 
         "runCommand": [
         "Import-Module Servermanager",
         "Remove-WindowsFeature RDS-RD-Server,RDS-Web-Access,RDS-Connection-Broker,RDS-Gateway -restart"
       ]
   }
}

I combined it within the main automation flow as a new step called RemoveServerRoles and called it right before the step runWindowsServerUpgrade (in YAML):

- name: RemoveServerRoles 
     action: 'aws:runCommand' 
     inputs:

        DocumentName: RemoveServerRolesBeforeWindows2008Upgrade InstanceIds:

        - '{{ getServerUpgradeInstance.InstanceId }}' 
        isCritical: 'true'

        onFailure: 'step:deleteServerUpgradeInstance' 
        nextStep: sleepBeforeWindowUpgradeAndStart

Second Enhancement

The second enhancement was to include a step after the upgrade that gives you the ability to copy and paste files when connecting to the instances in Microsoft Remote Desktop Protocol (RDP). By default, this capability is disabled after in-place operating system (OS) upgrades.

I added this capability by creating a RunCommand document named FixRegistryToAllowCopyPaste and combining it within the main automation flow right after the upgrade process is completed.

This RunCommand document is a RunPowerShellScript document that runs the following script (JSON):

"mainSteps": [
   {
   "action": "aws:runPowerShellScript", 
   "name": "example",
   "inputs": { 
      "runCommand": [ "Get-Command reg",
      "reg add \"HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Control\\Terminal Server\\WinStations\\RDP-Tcp\" /v fDisableCdm /t REG_DWORD /d 0 /f",
      "reg add \"HKEY_LOCAL_MACHINE\\SOFTWARE\\Policies\\Microsoft\\Windows NT\\Terminal Services\" /v fDisableCdm /t REG_DWORD /d 0 /f",
      "Restart-Computer"
      ]

I combined it within the main automation flow as a new step called FixRegistrySettings, and called it right after the step deletePreBackupAMIFromDriverUpgrade (YAML):

- name: deletePreBackupAMIFromDriverUpgrade 
   action: 'aws:deleteImage'
   inputs:
      ImageId: '{{ getPreBackUpAMIFromDriverUpgrade.ImageId }}' maxAttempts: 3
   isCritical: 'false' 
   onFailure: Continue 
   timeoutSeconds: 600
   nextStep: FixRegistrySettings
   - name: FixRegistrySettings 
     action: 'aws:runCommand' 
     inputs:
      DocumentName: FixRegistryToAllowCopyPaste InstanceIds:
         - '{{ getServerUpgradeInstance.InstanceId }}' 
      isCritical: 'true'
      onFailure: Continue 
      nextStep: sleep1

These two enhancements expand the automation flow and reduce the manual steps needed before and after running the automation. In my case, I added even more steps, including backing up and restoring Internet Information Services (IIS) settings and upgrading the .NET version after the upgrade.

Step 2: Separate Production Workloads from Temporary Upgrade Workloads

When running the upgrade process on instances that are part of an Active Directory and assigning them the same subnet as the original upgraded instances, you wind up with two instances in the same VPC owning the same host name.

This creates a naming conflict that impacts your production workloads during the upgrade process, which can cause a delay of several hours.

To avoid this problem, I took this approach:

Created dedicated VPC, subnets, and security groups for the Windows Upgrade (for example, WinUpgradeVPC, WinUpgradeSubnet1, WinUpgradeSubnet2).
.
Created a managed Active Directory within the Windows Upgrade VPC using AWS Directory Service.
.
Set up the AWS Systems Manager Upgrade Document to assign one of the Windows Upgrade subnets to the temporary instance that’s created by the automation flow to run the upgrade.

The idea behind this approach is to create a full separation between the original instances that are being upgraded and the temporary instances that are launched by the AWS System Manager automation.

This allows the upgrade process to progress smoothly on the dedicated VPC while registering the temporary parallel instances to the Managed Active Directory without creating any conflict and interference with the upgraded production workloads.

Figure 1 – WinUpgrade AD created within the WinUpgrade VPC and assigned to the WinUpgrade Subnets.

Step 3: Orchestrate the Cutover

If you have reached this step, it means by now you have an Amazon Machine Image (AMI) that contains an upgraded root volume for each and every one of your target instances.

Figure 2 – AMI with upgraded root volumes.

Everything I did up to this point I did in the background. It did not create any interference with production workloads. From this point forward, I was in the “money time.”

It’s time to orchestrate the replacement of production instances with their corresponding upgraded AMIs. To have minimal impact on the business, the instances should be replaced as quickly as possible, with no mistakes.

I developed three simple Lambda functions that can orchestrate the root volume replacement process by using proper tagging:

2008UpgradeDriveSwap_Prep function
2008UpgradeDriveSwap_Execute function
2008UpgradeDriveSwap_RollBack function

2008UpgradeDriveSwap_Prep Function

This function extracts the root volume out of the AMI. It runs over all the instances with a value in the CustomerName tag that matches the value entered in the Lambda parameters. It searches for a corresponding AMI that has gone through the in-place upgrade process (according to the InstanceId), and creates a volume out of the snapshots that contain the AMI root drive.

I use the CustomerName tag, but you can use any tag for that purpose as long as the Lambda function is modified accordingly.

def findami(instanceid, az):
   global output
   ec2 = boto3.client( 
         'ec2',
         region_name=environ["region"],
   )
   response = ec2.describe_images( 
   Filters=[
   {
      'Name': 'owner-id',  
      'Values': [
         'AccountNumber',
      ]
   },{
      'Name':'name',
      'Values': ['AWSEC2_UPGRADED_AMI_FOR_INSTANCE_'+instanceid+'*']
      }
   ],
      }

      if len(response['Images']) == 1:
      if len(response['Images'][0]['BlockDeviceMappings']) == 1:
         output += "Found one AMI/block device for instance {}, creating volume<br>".format(instanceid) createvolume(instanceid, az, response['Images'][0]['BlockDeviceMappings'][0]['Ebs']['SnapshotId'])
      else:
         output += 'Found more than 1 block device for instance {}, skipping. Please check manually<br>'.format(instanceid)
      else:
         output += 'Unable to determine AMI to use for instance {}, skipping. Please check manually<br>'.format(instanceid)

   def createvolume(instanceid,az, snapshot): 
      global output
      ec2 = boto3.client( 
            'ec2',
            region_name=environ["region"],)
      try:
         response = ec2.create_volume( 
            DryRun=dryrun, 
            AvailabilityZone=az, 
            Encrypted=True, 
            SnapshotId=snapshot, 
            TagSpecifications=[
               {
                  'ResourceType': 'volume', 
                  'Tags': [
                      {
                         'Key': 'Name',
                         'Value': 'Upgraded_'+instanceid
                       },
                    ]
                 }
              ]
           )
         output += 'Created {} for instance {}<br>'.format(instanceid,response['VolumeId']) except:
         output += "Error creating volume for {} in {} with {}<br>".format(instanceid,az, snapshot)

2008UpgradeDriveSwap_Execute Function

This function removes the root volume (Win 2008) of the instances whose value in the tag CustomerName matches the value entered in the Lambda function parameters. It adds the volume prepared in advance by the function 2008UpgradeDriveSwap_Prep as the new root volume (Win 2019), and starts the instance.

Be sure to stop the targeted instances before running this function.

def findnewvolume(instanceid): 
   global output
   ec2 = boto3.client( 
         'ec2',
         region_name=environ["region"],
   )
   response = ec2.describe_volumes(
      Filters=[
      {
         'Name': 'tag:Name', 
         'Values': [
            'Upgraded_'+instanceid,
            ]
         },
      ],
   )
   if len(response['Volumes']) == 1:
      return response['Volumes'][0]['VolumeId'] 
   else:
      output += 'Found more than one volume for {} please check manually<br>'.format(instanceid) return 
      #Returns empty
def findoldvolume(volumes):
   for volume in volumes:
      if volume['DeviceName'] == '/dev/sda1':
         return volume['Ebs']['VolumeId']

def driveswap(instanceid, oldvolumeid, newvolumeid):
   global output
      output += 'Starting driveswap on {} with {}<br>'.format(instanceid, newvolumeid) ec2 = boto3.client(
         'ec2', 
         region_name=environ["region"],
   )
   try:
      response = ec2.detach_volume( 
         VolumeId=oldvolumeid, 
         DryRun=dryrun
   )
   output += 'Detached volume {}<br>'.format(oldvolumeid) 
   sleep(2)
   response2 = ec2.attach_volume(
   Device='/dev/sda1', 
   InstanceId=instanceid, 
   VolumeId=newvolumeid, DryRun=dryrun
   )
   output += 'Attached volume {}<br>'.format(newvolumeid) 
   response = ec2.start_instances(
      InstanceIds=[ 
         instanceid,
      ],
      DryRun=dryrun
   )
   output += 'Started instance {}<br>'.format(instanceid)
except Exception as e:
   output += 'Swap failed for {} please check manually:{}<br>'.format(instanceid,e)

2008UpgradeDriveSwap_RollBack Function

This function rolls back the 2008UpgradeDriveSwap_Execute function and adds the original volumes back as the root volumes. It’s useful when an upgrade process does not end well, and you need to roll back.

def findnewvolume(instanceid): 
   global output
   ec2 = boto3.client( 
         'ec2',
         region_name=environ["region"],
   )
   response = ec2.describe_volumes( 
      Filters=[
      {
         'Name': 'tag:Pre2019Upgrade', 
         'Values': [
            instanceid,
         ]
      },
   ],
)
if len(response['Volumes']) == 1:
   return response['Volumes'][0]['VolumeId']
else:
   output += 'Found more than one volume for {} please check manually<br>'.format(instanceid) return #Returns empty

def findoldvolume(volumes):
   for volume in volumes:
      if volume['DeviceName'] == '/dev/sda1':
         return volume['Ebs']['VolumeId']

def driveswap(instanceid, oldvolumeid, newvolumeid):
   global output
   output += 'Starting driveswap on {} with {}<br>'.format(instanceid, newvolumeid) ec2 = boto3.client(
      'ec2', region_name=environ["region"],
   )
   try:
      response = ec2.detach_volume( 
         VolumeId=oldvolumeid, 
         DryRun=dryrun
      )
      output += 'Detached volume {}<br>'.format(oldvolumeid) 
      sleep(2)
      response2 = ec2.attach_volume( 
         Device='/dev/sda1', 
         InstanceId=instanceid, 
         VolumeId=newvolumeid, 
         DryRun=dryrun
      )
      output += 'Attached volume {}<br>'.format(newvolumeid) 
      response = ec2.start_instances(
         InstanceIds=[ 
            instanceid,
         ],
         DryRun=dryrun
      )
      output += 'Started instance {}<br>'.format(instanceid)
   except Exception as e:
      output += 'Swap failed for {} please check manually:{}<br>'.format(instanceid,e)

Conclusion

By combining AWS Systems Manager, AWS Directory Service, and a few simple AWS Lambda functions, you can run an in-place upgrade of your Windows 2008 R2 production instances to Windows 2019 R2.

If you control instance tagging and the Lambda function runs, you can orchestrate the root volume replacement process so it causes minimal impact to the business.

Be sure to read the out-of-the-box AWS Systems Manager document.

The content and opinions in this blog are those of the third party author and AWS is not responsible for the content or accuracy of this post.

Salesforce – AWS Partner Spotlight

Salesforce is an AWS Competency Partner and leading customer relationship management (CRM) platform that helps enterprises get more out of their customer data.

Contact Salesforce | Practice Overview

*Already worked with Salesforce? Rate the Partner

*To review an AWS Partner, you must be a customer that has worked with them directly on a project.