vSAN Health Alarm Check Script (Using PowerCLI)

NSX Manager to login with a local account

In the third and final part of this series I have taken my basic skeleton from the previous two blogs in order to solve the issue of bringing all of the vSAN Skyline Health checks into one central location using a vSAN Health Alarm Check Script.

My two previous blogs can be found here:
NSX Backup Check Script (Using the NSX Web API)
NSX Alarm Check Script (Using the NSX REST API)

Unfortunately this time, despite best efforts I was unable to get a suitable result using the vCenter REST API. Documentation is lacking and I was not able to get full results for the Skyline Health Checks. From asking around it seems that PowerCLI holds the answer for me, so it gave me an excuse to adapt the script again and get it to work with PowerCLI.

Again you might be asking ‘why not just use the vSAN Management Pack for vROps?’ but alas it does not keep pace with the vSAN Skyline Health and it is missing some alarms.

PowerCLI

For those not aware of everything PowerCLI can do you can find the full reference of the vSphere and vSAN cmdlets here:

https://developer.vmware.com/docs/powercli/latest/products/vmwarevsphereandvsan/

We are going to be using the Get-VSANView cmdlet in order to pull out the information from the vCenter.

The health information we can get with the “VsanVcClusterHealthSystem-vsan-cluster-health-system” Managed Object. Details of this can be found here:

https://vdc-download.vmware.com/vmwb-repository/dcr-public/3325c370-b58c-4799-99ff-58ae3baac1bd/45789cc5-aba1-48bc-a320-5e35142b50af/doc/vim.cluster.VsanVcClusterHealthSystem.html

The Code Changes

The Try Catch has been changed to connect to the vCenter first and then call a function to get the vSAN Health Summary

try{
    Connect-VIServer -Server $vCenter -Credential $encodedlogin
    $Clusters = Get-Cluster

    foreach ($Cluster in $Clusters) {
        Get-VsanHealthSummary -Cluster $Cluster
    }
 }
catch {catchFailure}

So lets have a look at the function itself.

The Get vSAN Cluster Health function

I have written a function to take in a cluster name as a parameter, find the Managed Object Reference (MORef) for the cluster, and then query the vCenter for the vSAN cluster health for that MORef and output any which are Yellow (Warning) or Red (Critical)

Function Get-VsanHealthSummary {

    param(
        [Parameter(Mandatory=$true)][String]$Cluster
    )
    
    $vchs = Get-VSANView -Id "VsanVcClusterHealthSystem-vsan-cluster-health-system"
    $cluster_view = (Get-Cluster -Name $Cluster).ExtensionData.MoRef
    $results = $vchs.VsanQueryVcClusterHealthSummary($cluster_view,$null,$null,$true,$null,$null,'defaultView')
    $healthCheckGroups = $results.groups
    $timestamp = (Get-Date).ToString("yyyy/MM/dd HH:mm:ss")

    foreach($healthCheckGroup in $healthCheckGroups) {
        
        $Health = @("Yellow","Red")
        $output = $healthCheckGroup.grouptests | where TestHealth -in $Health | select TestHealth,@{l="TestId";e={$_.testid.split(".") | select -last 1}},TestName,TestShortDescription,@{l="Group";e={$healthCheckGroup.GroupName}}
        $healthCheckTestHealth = $output.TestHealth
        $healthCheckTestName = $output.TestName
        $healthCheckTestShortDescription = $output.TestShortDescription
        
        if ($healthCheckTestHealth -eq "yellow") {
            $healthCheckTestHealthAlt = "Warning"
        }
        if ($healthCheckTestHealth -eq "red") {
            $healthCheckTestHealthAlt = "Critical"
            }
        if ($healthCheckTestName){
            Add-Content -Path $exportpath -Value "$timestamp [$healthCheckTestHealthAlt] $vCenter - vSAN Clustername $Cluster vSAN Alarm Name $healthCheckTestName Alarm Description $healthCheckTestShortDescription"
            Start-Sleep -Seconds 1
        }
    }
}

Saving Credentials

This time as we are using PowerCLI and Connect-VIServer we cannot use the encoded credentials we used last time for the Web and REST API, so we will use the cmdlet Export-CLIxml which allows us to create an XML-based representation of an object and stores it in a file.

Further details of this utility can be found here:

https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/export-clixml?view=powershell-7.3

We will use the Get-Credential to bring in the username and password to store and then export it to the path defined in the variables at the top of the script.

if (-Not(Test-Path -Path  $credPath)) {
    $credential = Get-Credential
    $credential | Export-Clixml -Path $credPath

}

$encodedlogin = Import-Clixml -Path $credPath

Handling the Outputs.

As per my previous scripts the outputs are formatted to be ingested into a syslog server (vRealize Log Insight in this case) which would then send emails to the appropriate places and allow for a nice dashboard for quick whole estate checks.

The Final vSAN Health Alarm Check Script

I have put all the variables at the top and the script is designed to be run in a folder and to have another separate folder with the logs. This was done in order to manage multiple scripts logging to the same location
eg:
c:\scripts\NSXBackupCheck\NSXBackupCheck.ps1
c:\scripts\Logs\NSXBackupCheck.log

param ($vCenter)

$curDir = &{$MyInvocation.PSScriptRoot}
$exportpath = "$curDir\..\Logs\vSANAlarmCheck.log"
$credPath = "$curDir\$vCenter.cred"
$scriptName = &{$MyInvocation.ScriptName}

add-type @"
   using System.Net;
   using System.Security.Cryptography.X509Certificates;
   public class TrustAllCertsPolicy : ICertificatePolicy {
      public bool CheckValidationResult(
      ServicePoint srvPoint, X509Certificate certificate,
      WebRequest request, int certificateProblem) {
      return true;
   }
}
"@
[System.Net.ServicePointManager]::CertificatePolicy = New-Object TrustAllCertsPolicy

Function Get-VsanHealthSummary {

    param(
        [Parameter(Mandatory=$true)][String]$Cluster
    )
    
    $vchs = Get-VSANView -Id "VsanVcClusterHealthSystem-vsan-cluster-health-system"
    $cluster_view = (Get-Cluster -Name $Cluster).ExtensionData.MoRef
    $results = $vchs.VsanQueryVcClusterHealthSummary($cluster_view,$null,$null,$true,$null,$null,'defaultView')
    $healthCheckGroups = $results.groups
    $timestamp = (Get-Date).ToString("yyyy/MM/dd HH:mm:ss")

    foreach($healthCheckGroup in $healthCheckGroups) {

        
        $Health = @("Yellow","Red")
        $output = $healthCheckGroup.grouptests | where TestHealth -in $Health | select TestHealth,@{l="TestId";e={$_.testid.split(".") | select -last 1}},TestName,TestShortDescription,@{l="Group";e={$healthCheckGroup.GroupName}}
        $healthCheckTestHealth = $output.TestHealth
        $healthCheckTestName = $output.TestName
        $healthCheckTestShortDescription = $output.TestShortDescription
        
        if ($healthCheckTestHealth -eq "yellow") {
            $healthCheckTestHealthAlt = "Warning"
        }

        if ($healthCheckTestHealth -eq "red") {
            $healthCheckTestHealthAlt = "Critical"
            }


        if ($healthCheckTestName){
            Add-Content -Path $exportpath -Value "$timestamp [$healthCheckTestHealthAlt] $vCenter - vSAN Clustername $Cluster vSAN Alarm Name $healthCheckTestName Alarm Description $healthCheckTestShortDescription"
            Start-Sleep -Seconds 1
        }
    }

}

function catchFailure {
    $timestamp = (Get-Date).ToString("yyyy/MM/dd HH:mm:ss")
    if (Test-Connection -BufferSize 32 -Count 1 -ComputerName $vCenter -Quiet) {
        Add-Content -Path $exportpath -Value "$timestamp [ERROR] $vCenter - $_"
    }
    else {
        Add-Content -Path $exportpath -Value "$timestamp [ERROR] $vCenter - Host Not Found"
    }
exit
}

if (!$vCenter) {
    Write-Host "please provide parameter 'vCenter' in the format '$scriptName -vCenter [FQDN of vCenter Server]'"
    exit
    }

if (-Not(Test-Path -Path  $credPath)) {
    $credential = Get-Credential
    $credential | Export-Clixml -Path $credPath

}

$encodedlogin = Import-Clixml -Path $credPath


try{
    Connect-VIServer -Server $vCenter -Credential $encodedlogin
    $Clusters = Get-Cluster

    foreach ($Cluster in $Clusters) {
        Get-VsanHealthSummary -Cluster $Cluster
    }
 }
catch {catchFailure}

Disconnect-VIServer $vCenter -Confirm:$false

Overview

The final script above can be altered to be used as a skeleton for any other PowerShell or PowerCLI commands, as well as being adapted for REST APIs and Web API as per the previous Blogs. Important to note that these will use a different credential store function.

The two previous blogs can be found here:
NSX Backup Check Script (Using the NSX Web API)
NSX Alarm Check Script (Using the NSX REST API)

NSX Alarm Check Script (Using the NSX REST API)

NSX Manager to login with a local account

In my previous blog I created a script to get the last backup status from NSX Manager in order to quickly check multiple NSX Managers. Today I had a need to bring the alarms raised in all of these NSX Managers into one single location, which necessitated creating an NSX Alarm Check Script.
‘But surely the NSX Management Pack would allow you to do this you?’ may ask. Unfortunately it is missing some of the alarms which gets raised on the NSX Managers such as passwords expiring for example. This one being an annoyance if you do not notice until after it’s expired and you are having LDAP issues.

Now luckily this time, we CAN use the NSX REST API to get these details, and I had a script lying around which could provide a skeleton for this. You can find that script here: NSX Backup Check Script

In order to adapt this script to use REST we need to change the Invoke-WebMethod to Invoke-RestMethod

Interrogating NSX REST API

I used the documentation from VMware {code} to find this API and how to handle the results. Luckily this is a lot more detailed than the web API. You can find the NSX API details here:

https://developer.vmware.com/apis/547/nsx-t

so we want to request /api/v1/alarms in order to return a list of all alarms on the nsx managers.

$result = Invoke-RestMethod -Uri https://$nsxmgr/api/v1/alarms -Headers $Header -Method 'GET' -UseBasicParsing

Handling the Outputs.

Running this command will give a response similar to this:

{
  "result_count": 4,
  "results": [
      {
        "id": "xxxx",
        "status": "OPEN",
        "feature_name": "manager_health",
        "event_type": "manager_cpu_usage_high",
        "feature_display_name": "Manager Health",
        "event_type_display_name": "CPU Usage High",
        "node_id": "xxxx",
        "last_reported_time": 1551994806,
        "description": |
          "The CPU usage for the manager node identified by 
           appears to be\nrising.",
        "recommended_action": |
          "Use the top command to check which processes have the most CPU
           usages, and\nthen check \/var\/log\/syslog and these processes'
           local logs to see if there\nare any outstanding errors to be
           resolved.",
        "node_resource_type": "ClusterNodeConfig",
        "severity": "WARNING",
        "entity_resource_type": "ClusterNodeConfig",
      },
      ...
  ]
}

From this output I wanted to pull out the severity, status, alarm description and the node which was impacted, so I pulled these into an array and add the items to variables.

$nsxAlarms = $result.results 
    foreach ($nsxAlarm in $nsxalarms) {
        $nsxAlarmCreated = (get-date 01.01.1970).AddSeconds([int]($nsxAlarm._create_time/1000)).ToString("yyyy/MM/dd HH:mm:ss")
        $timestamp = (Get-Date).ToString("yyyy/MM/dd HH:mm:ss")
        $nsxAlarmSeverity = $nsxAlarm.severity
        $nsxAlarmStatus = $nsxAlarm.status
        $nsxAlarmNode_display_name = $nsxAlarm.node_display_name
        $nsxAlarmDescription = $nsxAlarm.description

From here I wanted to only include any alarms which had not been marked acknowledged or resolved to avoid constantly reporting a condition which was known about.

if($nsxAlarm.status -ne "ACKNOWLEDGED" -and $nsxAlarm.status -ne "RESOLVED"){ 
    Add-Content -Path $exportpath -Value "$timestamp [$nsxAlarmSeverity] $NSXMGR - Alarm Created $nsxAlarmCreated Status $nsxAlarmStatus Affected Node $nsxAlarmNode_display_name Description  $nsxAlarmDescription"
}

It is also possible to bypass this by running the following command, however I wanted to pull in all alarms for my specific use case.

GET /api/v1/alarms?status=OPEN

As per the previous script, this was wrapped in a try catch and the catch failure tested if the host was up. A full explanation can be found on the blog about this script here: NSX Backup Check Script

.

The Final NSX Alarm Check Script

param ($nsxmgr)

$curDir = &{$MyInvocation.PSScriptRoot}
$exportpath = "$curDir\..\Logs\NSXAlarmCheck.log"
$credPath = "$curDir\$nsxmgr.cred"
$scriptName = &{$MyInvocation.ScriptName}

add-type @"
   using System.Net;
   using System.Security.Cryptography.X509Certificates;
   public class TrustAllCertsPolicy : ICertificatePolicy {
      public bool CheckValidationResult(
      ServicePoint srvPoint, X509Certificate certificate,
      WebRequest request, int certificateProblem) {
      return true;
   }
}
"@
[System.Net.ServicePointManager]::CertificatePolicy = New-Object TrustAllCertsPolicy

function catchFailure {
    $timestamp = (Get-Date).ToString("yyyy/MM/dd HH:mm:ss")
    if (Test-Connection -BufferSize 32 -Count 1 -ComputerName $nsxmgr -Quiet) {
        Add-Content -Path $exportpath -Value "$timestamp [ERROR] $NSXMGR - $_"
    }
    else {
        Add-Content -Path $exportpath -Value "$timestamp [ERROR] $NSXMGR - Host Not Found"
    }
exit
}

if (!$nsxmgr) {
    Write-Host "please provide parameter 'nsxmgr' in the format '$scriptName -nsxmgr [FQDN of NSX Manager]'"
    exit
    }

if (-Not(Test-Path -Path  $credPath)) {
    $username = Read-Host "Enter username for NSX Manager" 
    $pass = Read-Host "Enter password" -AsSecureString 
    $password = [System.Runtime.InteropServices.Marshal]::PtrToStringAuto([System.Runtime.InteropServices.Marshal]::SecureStringToBSTR($pass))
    $userpass  = $username + ":" + $password

    $bytes= [System.Text.Encoding]::UTF8.GetBytes($userpass)
    $encodedlogin=[Convert]::ToBase64String($bytes)
    
    Set-Content -Path $credPath -Value $encodedlogin
}

$encodedlogin = Get-Content -Path $credPath

$authheader = "Basic " + $encodedlogin
$header = New-Object "System.Collections.Generic.Dictionary[[String],[String]]"
$header.Add("Authorization",$authheader)

try{
    $result = Invoke-RestMethod -Uri https://$nsxmgr/api/v1/alarms -Headers $Header -Method 'GET' -UseBasicParsing

        $nsxAlarms = $result.results 
        foreach ($nsxAlarm in $nsxalarms) {
            
            $nsxAlarmCreated = (get-date 01.01.1970).AddSeconds([int]($nsxAlarm._create_time/1000)).ToString("yyyy/MM/dd HH:mm:ss")
            $timestamp = (Get-Date).ToString("yyyy/MM/dd HH:mm:ss")
            $nsxAlarmSeverity = $nsxAlarm.severity
            $nsxAlarmStatus = $nsxAlarm.status
            $nsxAlarmNode_display_name = $nsxAlarm.node_display_name
            $nsxAlarmDescription = $nsxAlarm.description

            if($nsxAlarm.status -ne "ACKNOWLEDGED" -and $nsxAlarm.status -ne "RESOLVED"){ 
                Add-Content -Path $exportpath -Value "$timestamp [$nsxAlarmSeverity] $NSXMGR - Alarm Created $nsxAlarmCreated Status $nsxAlarmStatus Affected Node $nsxAlarmNode_display_name Description  $nsxAlarmDescription"
            }
        
    }
 }
catch {catchFailure}

Overview

The final script above can be altered to be used as a skeleton for any other Invoke-RestRequest APIs as well as simply being adapted for Web API. I will be following up this post with further updates to adapt the script in order to use PowerCLI, which required a different credential store.


NSX Backup Check Script (Using the NSX Web API)

NSX Manager to login with a local account

I was recently asked for a way to have a simple check and report on the last backup status for a global company with multiple VMware NSX managers.

For some reason their NSX Managers were not reporting the backup status via syslog to VMware vRealize Log Insight (vRLI) and even if it was, they only have one vRLI cluster per site and wanted one simple place to do their daily checks.

So let’s make an NSX backup check script, PowerShell and REST API to the rescue! … right?

So … no, you cannot get the backup status via REST API

Great.

But you can via the WebAPI!

Hurrah! Lets throw in some Invoke-WebRequest and get the data we need.

After some basic checks, I got the info I wanted – now I need to schedule it and have it run a short period after the backup window.

This part resulted in a path of trying to figure out a way to hold account passwords in a usable manner without them being written in clear-text anywhere because that’s just no good. There are a few different ways to do this, but they either tie it to one user profile and computer, or don’t work with the basic auth needed to run against NSX to get the data via webrequest. I will go into how I achieved that further down, but first, the web API to get backup status.

Interrogating NSX Web API

After some looking around, I discovered the following URL called via Invoke-WebRequest would give us the backup results:

Invoke-WebRequest -Uri https://[nsxmgr]/api/v1/cluster/backups/overview -Headers $Header 

Now the big problem with Invoke-WebRequest is that you would have assumed that it would return any response status such as 403 Forbidden. Nope!

You don’t get any helpful error catching, it either works or bombs out. Not much good for an unattended script that you want to tell you about any issues.

So the best fix I came up with was using a try and catch

try { 
    $result = Invoke-WebRequest -Uri https://...
    }
catch {catchFailure}

I then created a function to run in the event of the failure which will ping the host to see if it’s online and if it is output the error, if it isn’t output that the host is unreachable.

if (Test-Connection -BufferSize 32 -Count 1 -ComputerName $nsxmgr -Quiet) {
        <error output>
    } else { <host offline output> } exit

Job jobbed, no more bombing out with red text.

Dealing with certificates

When you run Invoke-WebRequest against an NSX manager with self signed certificates you get the error "The Underlying connection was closed: Could not establish trust relationship for the SSL/TLS secure channel"

The fix for this is to add in this code near the top of your script to resolve the error.

add-type @"
   using System.Net;
   using System.Security.Cryptography.X509Certificates;
   public class TrustAllCertsPolicy : ICertificatePolicy {
      public bool CheckValidationResult(
      ServicePoint srvPoint, X509Certificate certificate,
      WebRequest request, int certificateProblem) {
      return true;
   }
}
"@
[System.Net.ServicePointManager]::CertificatePolicy = New-Object TrustAllCertsPolicy

Password Management

Great stuff I now have a working Script, but ideally, I want it to be scheduled and unattended.

This is where I spun around for a while trying different ways to store credentials in a secure format, because passwords in plain text is uncool.

I initially was trying to use the encoded credentials modules but having little luck getting it passed as a header value, so bugged a colleague (@pauldavey_79) for some help and ideas from his many years of experience prodding APIs.

What we came up with was to take the username and password as an requested input via Read-Host and encode it in the Base 64 format required to pass via the header in Invoke-WebRequest and store that in a text file.

[System.Runtime.InteropServices.Marshal]::PtrToStringAuto([System.Runtime.InteropServices.Marshal]::SecureStringToBSTR($pass))
    $userpass  = $username + ":" + $password

    $bytes= [System.Text.Encoding]::UTF8.GetBytes($userpass)
    $encodedlogin=[Convert]::ToBase64String($bytes)
    
    Set-Content -Path $credPath -Value $encodedlogin

This worked a charm.

Handling the Outputs.

With this script I wanted to feed the output into a syslog server (vRealize Log Insight in this case) which would then send emails to the appropriate places and allow for a nice dashboard for quick whole estate checks.

In order to achieve this, I used the Add-Content command to append the data to a .log file which was monitored by the Log Insight Agent and sent off to the Log Insight Server.

if($LatestBackup.success -eq $true){ 
  Add-Content -Path $exportpath -Value "$timestamp [INFO] $NSXMGR - Last backup successful. Start time $start End time $end"
} else{ 
  Add-Content -Path $exportpath -Value "$timestamp [ERROR] $NSXMGR - Last backup failed $start $end"

This gives us a nice syslog formatted output which can be easily manipulated within Log Insight. Hurrah.

One thing to note is that the NSX WebAPI returned the start and end times in the usual unix format, so I needed to convert that to a more suitable human readable date, that was done with the line:

 $var = (get-date 01.01.1970).AddSeconds([int]($LatestBackup.end_time/1000))

I also needed to get my try-catch error collector to output the error messages in the same format so that was done as so:

Add-Content -Path $exportpath -Value "$timestamp [ERROR] $NSXMGR - $_"

Pulling all of that together we get the final script which can be used as a skeleton for any future work required. A few of them will be posted at a later date.

The Final NSX Backup Check Script

I have put all the variables at the top and the script is designed to be run in a folder and to have another separate folder with the logs. This was done in order to manage multiple scripts logging to the same location
eg:
c:\scripts\NSXBackupCheck\NSXBackupCheck.ps1
c:\scripts\Logs\NSXBackupCheck.log

param ($nsxmgr)

$curDir = &{$MyInvocation.PSScriptRoot}
$exportpath = "$curDir\..\Logs\NSXBackupCheck.log"
$credPath = "$curDir\$nsxmgr.cred"
$scriptName = &{$MyInvocation.ScriptName}

add-type @"
   using System.Net;
   using System.Security.Cryptography.X509Certificates;
   public class TrustAllCertsPolicy : ICertificatePolicy {
      public bool CheckValidationResult(
      ServicePoint srvPoint, X509Certificate certificate,
      WebRequest request, int certificateProblem) {
      return true;
   }
}
"@
[System.Net.ServicePointManager]::CertificatePolicy = New-Object TrustAllCertsPolicy

function catchFailure {
    $timestamp = (Get-Date).ToString("yyyy/MM/dd HH:mm:ss")
    if (Test-Connection -BufferSize 32 -Count 1 -ComputerName $nsxmgr -Quiet) {
        Add-Content -Path $exportpath -Value "$timestamp [ERROR] $NSXMGR - $_"
    }
    else {
        Add-Content -Path $exportpath -Value "$timestamp [ERROR] $NSXMGR - Host Not Found"
    }
exit
}

if (!$nsxmgr) {
    Write-Host "please provide parameter 'nsxmgr' in the format '$scriptName -nsxmgr [FQDN of NSX Manager]'"
    exit
    }

if (-Not(Test-Path -Path  $credPath)) {
    $username = Read-Host "Enter username for NSX Manager" 
    $pass = Read-Host "Enter password" -AsSecureString 
    $password = [System.Runtime.InteropServices.Marshal]::PtrToStringAuto([System.Runtime.InteropServices.Marshal]::SecureStringToBSTR($pass))
    $userpass  = $username + ":" + $password

    $bytes= [System.Text.Encoding]::UTF8.GetBytes($userpass)
    $encodedlogin=[Convert]::ToBase64String($bytes)
    
    Set-Content -Path $credPath -Value $encodedlogin
}

$encodedlogin = Get-Content -Path $credPath

$authheader = "Basic " + $encodedlogin
$header = New-Object "System.Collections.Generic.Dictionary[[String],[String]]"
$header.Add("Authorization",$authheader)

try{
    $result = Invoke-WebRequest -Uri https://$nsxmgr/api/v1/cluster/backups/overview -Headers $Header -UseBasicParsing
    if($result.StatusCode -eq 200) {
        $nsxbackups = $result.Content | ConvertFrom-Json
        $LatestBackup = $nsxbackups.backup_operation_history.cluster_backup_statuses
        $start = (get-date 01.01.1970).AddSeconds([int]($LatestBackup.start_time/1000))
        $end = (get-date 01.01.1970).AddSeconds([int]($LatestBackup.end_time/1000))
        $timestamp = (Get-Date).ToString("yyyy/MM/dd HH:mm:ss")
        if($LatestBackup.success -eq $true){ 
            Add-Content -Path $exportpath -Value "$timestamp [INFO] $NSXMGR - Last backup successful. Start time $start End time $end"
        } else{ 
            Add-Content -Path $exportpath -Value "$timestamp [ERROR] $NSXMGR - Last backup failed $start $end"
        }
    }
 }
catch {catchFailure}

Overview

The final script above can be altered to be used as a skeleton for any other Invoke-WebRequest APIs as well as simply being adapted for REST API. I will be following up this post with further updates to this script using RESTAPI and also an adaption to use PowerCLI which required a different credential store.

The REST API Script can be found here: NSX Alarm Check Script

Registration failed: Log Insight Adaptor Object Missing

I recently came across a problem at a client’s with integrating Log Insight (vRLI) with vROps. The connection tests successfully and alert integration works, however launch in context returns the error “Registration failed: Log Insight Adapter Object Missing”

After a discussion with GSS it was discovered this is actually a known issue due to the vROps cluster being behind a load balancer and the following errors are shown in the Log Insight log /storage/var/loginsight/vcenter_operations.log

[2018-05-15 09:51:02.621+0000] ["https-jsse-nio-443-exec-3"/10.205.73.139 INFO] [com.vmware.loginsight.vcopssuite.VcopsSuiteApiRequest] [Open connection to URL https://vrops.domain.com/suite-api/api/versions/current]
[2018-05-15 09:51:02.621+0000] ["https-jsse-nio-443-exec-3"/10.205.73.139 INFO] [com.vmware.loginsight.vcopssuite.VcopsSuiteApiRequest] [http connection, setting request method 'GET' and content type 'application/json; charset=utf-8']
[2018-05-15 09:51:02.621+0000] ["https-jsse-nio-443-exec-3"/10.205.73.139 INFO] [com.vmware.loginsight.vcopssuite.VcopsSuiteApiRequest] [reading server response]
[2018-05-15 09:51:02.626+0000] ["https-jsse-nio-443-exec-3"/10.205.73.139 ERROR] [com.vmware.loginsight.vcopssuite.VcopsSuiteApiRequest] [failed to post resource to vRealize Operations Manager]
javax.net.ssl.SSLProtocolException: handshake alert:  unrecognized_name

This is caused by some security updates to the Apache Struts, JRE, kernel-default, and other libraries from vRealize Log Insight 4.5.1. These updated libraries affect the SSL Handshake that takes place when testing the vRealize Operations Manager integration.

To resolve this issue we needed to add the FQDN of the vROps load balancer as an alias to the apache2 config. This can be done by following these steps.

  1. ​Log into the vRealize Operations Manager Master node as root via SSH or Console.
  2. Open /usr/lib/vmware-vcopssuite/utilities/conf/vcops-apache.conf in a text editor.
  3. Find the ServerName ${VCOPS_APACHE_SERVER_NAME} line and insert a new line after it.
  4. On the new line enter the following:
ServerAlias vrops.domain.com

Note: Replace vrops.domain.com with the FQDN of vRealize Operations Manager’s load balancer.

5. Save and close the file.

6. Restart the apache2 service:

service apache2 restart

7. Repeat steps 1-6 on all nodes in the vRealize Operations Manager cluster.

vRealize Log Insight 4.8 has been released

Image result for log insight logo

After months of waiting vRealize Log Insight 4.8 (vRLI 4.8) was released last night.

I’ve been waiting on this release as it fixes a number of minor CVEs (Java of course) and the major improvement which has been ask for by almost every customer who I’ve spoken to – Data retention configuration options based on time!

You now have the option to configure the data retention period based on your needs from a few days to 12 months instead of having to exactly size the appliances to guestimate your retention needs.

Another major additions is that there is now a JSON parser so that JSON logs can be easily sent and parsed into vRLI. Additionally the parser can be configured for conditional parsing. Users can specify if a parser should be applied based on the value of a parsed field.

There have been a number of minor security improvements including one which could delay upgrade for those with older SSL certificates. From 4.8, the minimum key size for the virtual appliance certificate must be 2048 bits or greater.

There are a couple of resolved issues which have bugged me (and clients) in the previous releases

  • Launch in context for vROps is now working correctly.
  • Queries now support time-related terms that when entered are automatically translated to the current time.
  • The “From” date bug is fixed

VMware are yet to update the Interoperability Matrix but hopefully there won’t be any major surprises in store.

So all in all, more minor evolution than revolution. as many were expecting the next release of vRLI to herald the change to PhotonOS like many other VMware appliances, but it is welcome all the same.

The download is already available on my.vmware.com, and as per usual you must be running vRealize Log Insight 4.7 or 4.7.1 to upgrade to 4.8. Follow my guide HERE for upgrading Log Insight.

The full release notes can be found HERE

How to Reset the vRLI 4.7.x Local Admin Password

The process to reset the vRealize Log Insight (vRLI) Admin password has been changed in vRLI 4.7 and above with the introduction of Cassandra authentication. This means if you need to reset the local admin password, the previously documented methods no longer work.

In order to reset the admin password you will need to connect to all of the vRLI nodes using the root account. Please follow the steps below

1. Download the li-reset-admin-passwd.sh script from HERE

2. Copy li-reset-admin-passwd.sh to the following location on each node using Secure Copy (WinSCP in Windows) and overwrite the existing li-reset-admin-passwd.sh

/opt/vmware/bin

3. Log into the vRealize Log Insight Master node as root via SSH or Console.

4. Run the following command to set permissions on the script:

chmod 755 /opt/vmware/bin/li-reset-admin-passwd.sh

5. Repeat steps 3-4 on all nodes in the vRealize Log Insight cluster.

6. Run the following command to get the Cassandra credentials and note the user and password values:

/usr/lib/loginsight/application/lib/apache-cassandra-3.11.2/bin/credentials-look-up

Note: You will see output similar to:

<cassandra-user value="lisuper" />
<cassandra-password value="l337nuFvPbsWXlYIx2MsVqo4RotfgAXx" />

7. Run the following command to reset the admin password:

li-reset-admin-passwd.sh user password

Note: Replace user and password with the values noted in step 6 respectively.

Example

li-reset-admin-passwd.sh lisuper I337nuHyPbsPZlYIx2MsEso4RotfgAXx 

Log Insight Agent Buffer

I was recently asked about the Log Insight Agent buffer by a client who wasn’t sure what it was, how it works and how to configure it.

The buffer is used to locally store events and vRLI Agent logs when the vRLI cluster is unreachable.

By default it is set to a maximum of 200MB. When the specified max_disk_buffer is reached, the agent begins to drop new incoming events.

The buffer size is defined in the agent configuration file either locally on each agent by altering the liagent.ini file or centrally using an agent group on the vRLI Administration UI.

Insert the following above the system log definition section to set the buffer to 2GB

[storage] 
max_disk_buffer=2000 

Prior to Agent 4.6 the maximum level was 2000 MB

From Agent 4.6 the maximum has been increased from 2GB to 8GB

VMware vCenter Security Log Events

I had a requirement from a customer to identify log events in order to create alerts for several threat scenarios. This post is intended to provide a high-level description of the results for the scenarios for future reference or in case anyone finds a use. Please see the earlier post on enabling additional vCenter and PSC logging. http://www.caenotech.co.uk/vmware/configuration-of-rsyslog-on-vcsa-and-psc/

Access to vCenter Administrator role

The objective of the following is to ensure nobody other than certain colleagues have access to the Cryptography operations within vCenter and that all work carried out on crypto operations is done under suitable change control.

As can be seen the default syslog details the Administrator user logging in as VSPHERE.LOCAL\Administrator and the IP it has originated from

<datetime> <vCenterHostname> vcenter-server: User <Domain>\<Username>@<IPAddress> logged in as JAX-WS RI 2.2.9-b130926.1035 svn-revisions#<UID>

<datetime> <vCenterHostname> vpxd <eventID> - - Event [<LineID>] [1-1] [<datetime>] [vim.event.UserLoginSessionEvent] [info] [<Domain>\<Username>] [] [LineID] [User <Domain>\<Username>@<IPAddress> logged in as JAX-WS RI 2.2.9-b130926.1035 svn-revisions#<UID>]

<datetime> <vCenterHostname> vcenter-server: User <Domain>\<Username>@<IPAddress> logged out (login time: <datetime>, number of API invocations: <x>, user agent: JAX-WS RI 2.2.9-b130926.1035 svn-revisions#<UID>)

<datetime> <vCenterHostname> vpxd <eventID> - - Event [<LineID>] [1-1] [<datetime>] [vim.event.UserLoginSessionEvent] [info] [<Domain>\<Username>] [] [LineID] [User <Domain>\<Username>@<IPAddress> logged out (login time: <datetime>, number of API invocations: <x>, user agent: JAX-WS RI 2.2.9-b130926.1035 svn-revisions#<UID>)]

the text strings “vim.event.UserLoginSessionEvent” and “vim.event.UserLogoutSessionEvent” can be used to alert on people logging into the vCenter


Alteration of vCenter Roles

Creation of a new vCenter role “newCryptoRole”

From the default log we can show that the new role is created however does not show whom by or which permissions it is given.

<datetime> <vCenterHostname> vcenter-server: New role <roleName> created

<datetime> <vCenterHostname> vpxd <eventID> - - Event [<LineID>] [1-1] [<datetime>] [vim.event.RoleAddedEvent] [info] [] [] [LineID] [New role <roleName> created]

This is where the additional vpxd-svcs log is required for details of who completed the action and what permissions were assigned to the role

[tomcat-exec-176  INFO  AuthorizationService.AuditLog  opId=] Action performed by principal(name=VSPHERE.LOCAL\Administrator,isGroup=false):Add role Id=-922973018,Name=newCryptoRole,Description=,Tenant=Privileges=[System.Anonymous, System.Read, System.View, Cryptographer.Clone, Cryptographer.Encrypt, Cryptographer.Migrate, Cryptographer.RegisterVM, Cryptographer.ManageKeyServers, Cryptographer.Decrypt, Cryptographer.AddDisk, Cryptographer.ManageKeys, Cryptographer.ManageEncryptionPolicy, Cryptographer.Access, Cryptographer.Recrypt, Cryptographer.RegisterHost, Cryptographer.EncryptNew]

Modification of permissions to any vCenter role

<datetime> <vCenterHostname> vcenter-server: Role modified 
Previous name: <roleName>, new name <newRoleName>
Added privileges: <privilegesAdded>
Removed privileges: <privilegesRemoved>

<datetime> <vCenterHostname> vpxd <eventID> - - Event [<LineID>] [1-1] [<datetime>] [vim.event.RoleUpdatedEvent] [info] [] [] [LineID] [Role modified 
Previous name: <roleName>, new name <newRoleName>
Added privileges: <privilegesAdded>
Removed privileges: <privilegesRemoved>]

From the default log we can show that the role is modified and which permissions have been added, however does not show whom by. This is where the additional vpxd-svcs log is required for details of who completed the action

[tomcat-exec-17  INFO  AuthorizationService.AuditLog  opId=a794037d-a725-4b89-ab96-d3a23a58648c] Action performed by principal(name=VSPHERE.LOCAL\Administrator,isGroup=false):Update role Id=-922973018,Name=newCryptoRole,Description=,Tenant=Privileges=[System.Anonymous, Cryptographer.Clone, Cryptographer.Encrypt, Cryptographer.Migrate, Cryptographer.RegisterVM, Cryptographer.ManageKeyServers, Cryptographer.Decrypt, Cryptographer.AddDisk, Cryptographer.ManageKeys, Cryptographer.ManageEncryptionPolicy, System.View, Cryptographer.Access, Cryptographer.Recrypt, Cryptographer.RegisterHost, System.Read, Cryptographer.EncryptNew, Network.Assign, Network.Config, Network.Move, Network.Delete, Task.Create, Task.Update]

Deletion of a vCenter role

<datetime> <vCenterHostname> vcenter-server: New role <roleName> removed

<datetime> <vCenterHostname> vpxd <eventID> - - Event [<LineID>] [1-1] [<datetime>] [vim.event.RoleRemovedEvent] [info] [] [] [LineID] [Role <roleName> removed]

From the default log we can show that the role is removed, however does not show whom by. This is where the additional vpxd-svcs log is required for details of who completed the action

 
[tomcat-exec-2  INFO  AuthorizationService.AuditLog  opId=c0100be8-9114-4e60-9520-4cf1b6015793] Action performed by principal(name=VSPHERE.LOCAL\Administrator,isGroup=false):Delete role -922973018  

Assignment of User to a Role

Assigning a user to a role is not recorded in the default logs, this requires the additional vpxd-svcs log

 [tomcat-exec-232  INFO  AuthorizationService.AuditLog  opId=] Action performed by principal(name=VSPHERE.LOCAL\Administrator,isGroup=false):Added access control [ Principal=Name=VSPHERE.LOCAL\newCryptoUser,isGroup=false,roles=[-922973018],propogating=true ] to document urn:acl:global:permissions

If you attempt to assign a user to a role with higher permissions that your current user you will receive the following error message in the vCenter Web UI

Additionally the following event is recorded in the vpxd-svcs.log

[tomcat-exec-293  WARN  com.vmware.cis.authorization.impl.AclPrivilegeValidator  opId=] User VSPHERE.LOCAL\newUser does not have privileges [System.Anonymous, Cryptographer.Clone, Cryptographer.Encrypt, Cryptographer.Migrate, Cryptographer.RegisterVM, Cryptographer.ManageKeyServers, Cryptographer.Decrypt, Cryptographer.AddDisk, Cryptographer.ManageKeys, Cryptographer.ManageEncryptionPolicy, System.View, Cryptographer.Access, Cryptographer.Recrypt, Cryptographer.RegisterHost, Authorization.ModifyPermissions, System.Read, Cryptographer.EncryptNew] on object urn%3Aacl%3Aglobal%3Apermissions

Adding user to Platform Services Controller SSO Groups

In order to capture logs showing adding user to the “SystemConfiguration.BashShellAdministrators” group we require the additional logs ssoAdminServer.log and vmdir-syslog.log

./sso/ssoAdminServer.log:

pool-4-thread-1 opId=73c87e6b-746c-46f2-9b59-a5da95f5a1c1 INFO  com.vmware.identity.admin.vlsi.PrincipalManagementServiceImpl] [User {Name: Administrator, Domain: vsphere.local} with role 'Administrator'] Adding users to local group 'SystemConfiguration.BashShellAdministrators'

./vmdird/vmdird-syslog.log:

info vmdird  t@139993972463360: MOD 1,add,member: (CN=Administrator,CN=Users,DC=vsphere,DC=local) info vmdird  t@139993972463360: Modify Entry (CN=SystemConfiguration.BashShellAdministrators,DC=vsphere,DC=local)(from 127.0.0.1)(by <PSCName>@vsphere.local)(via Ext)(USN 4974) 


Cryptographic Components

The objective of these alerts are to ensure that vSAN encryption is not disabled (where enabled) or enabled (where it’s not).  Equally, any tampering with KMS (required for encryption) should be correlated back to change control / incident management.

As user with “Administrator – No Cryptography” if you try to disable encryption on vSAN they do not receive the option due to a lack of privileges

Disable vSAN Encryption

In this test, vSAN encryption was disabled.  This is considered a reconfiguration of vSAN and logged accordingly.

Default vCenter logs show that vSAN is being reconfigured:

<datetime> <vCenterHostname> vcenter-server: Task: Reconfigure vSAN cluster

However this is not much help as it only indicates that a change has been made, but no details of the changes.

ESXi Host logs show that on the string [VsanSystemImpl::Update] the vSAN is being reconfigured and has encryption set to ‘enabled=false’.

The result was a vSAN with no encryption.

Enabling vSAN encryption

In this test, vSAN encryption was enabled.  This is considered a reconfiguration of vSAN and logged accordingly.

Default vCenter logs show that vSAN is being reconfigured:

<datetime> <vCenterHostname> vcenter-server: Task: Reconfigure vSAN cluster

<datetime> <vCenterHostname> vpxd <eventID> - - Event [<LineID>] [1-1] [<datetime>] [vim.event.TaskEvent] [info] [<domain>\<username>] [<clusterName>] [LineID] [Task: Reconfigure vSAN cluster]

ESXi Host logs show that on the string [VsanSystemImpl::Update] the vSAN is being reconfigured and has encryption set to ‘enabled=true’.

Adding a KMS Server

The event of adding an additional KMS is logged, specifying the KMS alias name and the KMS Cluster into which it is added.

<datetime> <vCenterHostname> vpxd <eventID> - - <date> info vpxd[<Rand>] [Originator@xxxx sub=CryptoManager opID-KmipServerPageMediator-add-xxxxx-ngc:<rand>] A new Kmip Server <KMSName> is registered in cluster <KMSCluster>

The string “A new Kmip Server” can be used to alert on a new KMS server being added to the KMS Cluster.

Delete a KMS Server

The KMS Server was unregistered from the VMware vCenter.

The following event described the removal.

<datetime> <vCenterHostname> vpxd <eventID> - - <date> info vpxd[<Rand>] [Originator@xxxx sub=CryptoManager opID-KmipServerActionResolver-remove-xxxxx-ngc:<rand>] Kmip Server <KMSName> is removed from cluster <KMSCluster>

vMotion

vMotion a VM from vSAN Datastore to Local Storage

The Test Virtual Machine (permbound1) was migrated from vSAN ‘vSANDatastore’ to local storage named ‘ds-local-ESXiHostnameLocalDS’

The following events were recorded by the default vCenter logs.

vcenter-server: Migrating <VMname> from <ESXiHostname>, <datastoreName> to <ESXiHostname>, <datastoreName> in <vCenterDatacenter>

The event is in the format and notes the time, who carried out the migration under the field “vc_username”, what was migrated, and the source/destination hosts and datastores.

Upgrading VMware Log Insight

Upgrading the existing VMware Log Insight (vRLI) appliances using upgrade pak method. The steps are for 4.6.0 to 4.6.1 but are applicable to all 4.x upgrades.

Overview

This will detail the steps required for the In-place Upgrade Procedure of a VMware vRealize Log Insight (vRLI) Appliance

Pre-requisites:

  1. Verify that VMware Log Insight is properly configured.
  • Download required upgrade files and update script.
    • Download Upgrade Files: VMware vRealize Log Insight 4.6.1 – Upgrade Package from my.vmware.com.
  • Upgrading must be done from the master node’s FQDN. Upgrading with the Integrated Load Balancer IP address is not supported.
  • When performing a manual upgrade, you must upgrade workers one at a time. Upgrading multiple workers at the same time causes an upgrade failure. When you upgrade the master node to vRealize Log Insight 4.6.1, a rolling upgrade occurs unless specifically disabled.
  • If the vRealize Log Insight upgrade (.pak file) has a  new JRE version, then the user-installed certificates in a vRealize Log Insight setup (such as for event forwarding) become invisible after upgrade. 

Upgrade Method:

  1. Take snapshots of the VMware Log Insight nodes. 
    1. Recommendation: Shutdown appliances before taking snapshots if you cannot guarantee application consistency.
  2. To apply the update we need to login into our Log Insight appliance web interface. Choose Administration in the upper right corner.
  • In the navigation bar on the left side we select Management > Cluster > Upgrade Cluster.
  • After clicking Upgrade Cluster you need to browse to the PAK file which was downloaded.
  • After clicking “Upgrade” the package will be uploaded to the appliance.
  • Accept the EULA to start the update. The procedure will take a couple of minutes.
  • After successfully updating the appliance you’ll get a message with the now active version of vRealize Log Insight. There’s no need for a reboot.

Configuration of rSyslog on VMware vCenter Appliance VCSA and PSC for Logging Authentication and Authorisation Activities

Introduction

As part of a client’s environment, there was a requirement from the end customer to forward additional logging information above the default logs forwarded by vCenter Server and Platform Services Controller (PSC).

In order to provide these additional logs configuration of rSyslog is required to specify these files.

This post is intended to provide steps to implement these changes.

Additional logging available from non default vCenter logs

Single Sign-On Activities

  • Successful SSO Login
  • Successful SSO Logout
  • Successful SSO Active Directory Login
  • Successful SSO Active Directory Logout
  • Failed SSO Login
  • Failed SSO Login (User not found)
  • Failed SSO Active Directory Login
  • Failed SSO Active Directory Login (User not found)
  • SSO User Creation
  • SSO User Password Change
  • SSO User Deletion
  • SSO Group Creation
  • SSO Group Assignment
  • SSO Group Deletion
  • SSO Password policy update

vCenter Server Activities

  • Successful vCenter Server Login
  • Successful vCenter Server Logout
  • vSphere Permission Created
  • vSphere Permission Updated
  • vSphere Permission Deleted
  • vSphere Role Creation
  • vSphere Role Update
  • vSphere Role Deletion

In order to capture the above activities, you will need to forward the following log files:

  • /var/log/vmware/sso/vmware-sts-idmd.log
  • /var/log/vmware/sso/ssoAdminServer.log
  • /var/log/vmware/vpxd-svcs/vpxd-svcs.log
  • /var/log/vmware/vpx/vpxd.log

NOTE: I am not including the vpxd.log in my implementation below as it is an extremely verbose log and we did not require it for the security events we wished to capture. Additionally I don’t want someone blindly copying the config below without understanding it and accidentally upsetting their environment.

Implementation Steps

VMware Appliance Management Interface (VAMI)

Step 1 – Connect to the VAMI interface for all vCenters and PSCs on HTTPS with port 5480

https://<appliancename>:5480

Step 2 – Configure Syslog with the following settings.

  • Common Log Level
    • Info
  • Remote Syslog Host
    • <vRLI-LoadBalancer-VIP>
  • Remote Syslog Port
    • 6514
  • Remote Syslog Protocol
    • TLS

vCenter Server Appliance

Step 1 – SSH to the VCSA and open the following file /etc/rsyslog.conf for editing.

vi /etc/rsyslog.conf

Step 2 – Press [Insert] to put vi into insert mode and add following entry towards the top of the file at the bottom of the ###### Module declarations ###### section.

$ModLoad imfile

Step 3 – Add the following right below the “###### Rule declarations
######” section of the rsyslog configuration file

$InputFileName /var/log/vmware/vpxd-svcs/vpxd-svcs.log
$InputFileTag vpxd-svcs
$InputFileStateFile vpxd-svcs
$InputFileSeverity info
$InputFileFacility local7
$InputRunFileMonitor
$InputFilePollInterval 20

$InputFileName specifies the log file that we want to forward.

$InputFileTag is the appname that will show up when it is forwarded to your remote syslog server

$InputFileStateFile is the log monitoring file.

$InputFilePollInterval is set 20 seconds, the default is 10 if you leave it blank.

Step 4 – Save your changes by pressing [Esc] and typing and pressing enter:

 :wq

Step 5 – Restart the rsyslog service in the VCSA for the changes to go into effect by running the following command:

systemctl restart rsyslog

Platform Services Controller Appliance

Step 1 – SSH to the PSC and open the following file /etc/rsyslog.conf for editing.

 vi /etc/rsyslog.conf

Step 2 – Press [Insert] to put vi into insert mode and add following entry towards the top of the file at the bottom of the ###### Module declarations ###### section.

$ModLoad imfile

Step 3 – Add the following right below the “###### Rule declarations ######” section of the rsyslog configuration file

$InputFileName /var/log/vmware/vpxd-svcs/vpxd-svcs.log
$InputFileTag vpxd-svcs
$InputFileStateFile vpxd-svcs
$InputFileSeverity info
$InputFileFacility local7
$InputRunFileMonitor
$InputFilePollInterval 20
  • $InputFileName specifies the log file that we want to forward.
  • $InputFileTag is the appname that will show up when it is forwarded to your remote syslog server
  • $InputFileStateFile is the log monitoring file.
  • $InputFilePollInterval is set 20 seconds, the default is 10 if you leave it blank.

Step 4 – Save your changes by pressing [Esc] and typing and pressing enter:

 :wq

Step 5 – Restart the rsyslog service in the VCSA for the changes to go into effect by running the following command:

systemctl restart rsyslog

Log Insight Agent Recursive Directory Support Limitations

Since vRealize Log Insight 4.5 the agent has recursive directory support, however there is a requirement for them to be at least 2 levels deep

If you attempt to configure a log directory less than 2 levels deep using a wildcard it will be accepted on the Log Insight Agent config UI however on the end point you will receive the following error within the LIAgent log directory. (C:\ProgramData\VMware\Log Insight Agent\Logs)

2019-02-07 11:01:59.592025 0x000026e4 <error> FLogCollectorEx:452| Failed to initialize channel [filelog|com.microsoft.iis.IISWildcard] because of improper configuration. DirectoryMonitorEx::CheckBasePathEligible(): The base path should be at least 2 level(s) deep: [D:\Logfiles].

Unfortunately at this time there is no work around.