Monday, October 10, 2011

Rerunning the “Configure” option for Array Managers throws the error: “Error occurred: XML document is empty.” after a rebuilding VMware SRM at a site that suffered a host failure

Problem

You’ve just rebuilt VMware SRM (Site Recovery Manager) 4.1.1 after suffering a host failure and would like to rerun the Configure option for the Array Manager:

image

You navigate through the wizard to the Review Replicated Datastores window and decide to run the Rescan Arrays option:

image

The rescan option begins but errors out with the following error message:

Error occurred: XML document is empty.

image

Solution

The reason why a Rescan Arrays operation errors out with this message is because you need to re-enter the credentials for the array so hit the Back button to the Protected Site Array Managers step in the wizard, highlight the array manager and click on the Edit button:

image

You’ll notice that the same error message will immediately pop up so continue and click the OK button:

image

Continue and re-enter the username and passwords to access the array:

image

Select the array at the site you’ve rebuilt SRM:

image

Proceed back to the Review Replicated Datastores window and continue to run the Rescan Arrays option:

image

image

You will now notice that the scan completes without errors.

Thursday, October 6, 2011

Resetting a forgotten root password of an ESXi 4.1 server with a “Repair” install

I was asked by a client earlier this week for the root password of an ESX 4.1 server one of my colleagues installed.  While we did get a response from my colleague before he flew off on vacation, we weren’t able to log in after numerous attempts with all the variations we could think of.  The next step was to reset the password and as most administrators know, resetting an ESX server was fairly easy but an ESXi server wasn’t.  The only VMware supported way which had a public KB (http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1317898) was to perform a “Repair” via an installation disk on the ESXi server.  Now before the reader begins wondering why it was important that we recovered the password, it’s because we had a local datastore on the local disks of the server that had virtual machines running on it.  With that being said, the alternative to performing a “Repair” was to boot from a linux CD and fiddle around with the shadow file and because I haven’t done this before, I wasn’t prepared to try it out on a production ESXi server.

While the client did agree that he’s willing to take the safer route which was to perform a “Repair” on the ESXi host, I faintly remembered that the last time I did such a repair, all of the configuration on the server was lost.  This prompted me to ask if we had information about all the port groups the ESXi server had and as expected, we didn’t.  What we did in the end was the unsupported method of resetting the password but this post will show “how to” and what happens to an ESXi server when you prefer a repair on the hypervisor.

Obtain an ESXi CD or DVD with the same major release (i.e. 4.1 or 4.1 U1 for a 4.1 server) and boot the server into the installer:

image

Choose the Repair option by hitting the R key:

image

------------------------------------------------------------------------------------------------------------------------------------------------------------------

Note:  Make sure you DO NOT use the Install option or you’ll see the following screens:

image

image

If you see the following message:

You have selected a disk that contains at least one partition with existing data.

If you continue the selected disk will be overwritten.

… then you have selected the WRONG option.

image

------------------------------------------------------------------------------------------------------------------------------------------------------------------

Proceed with agreeing to the EULA:

image

Select the partition you would like to install ESXi:

image

Note that you will now be presented with the following message:

Confirm Disk Selection

You have selected a disk that contains at least one partition with existing data.

The partition table on the disk will be examined before the recovery process begins. If any VMFS partitions are found an attempt will be made to preserve them. You will be notified if any potential problem is encountered before any destructive operations occur on the disk.

The message above is what we want to see.

image

Proceed to confirm with the install:

image

You’ll notice that the repair will now being:

image

image

Upon completion, you will asked to reboot the server:

image

image

The server will proceed to reboot and boot into ESXi:

image

Upon successfully booting into ESXi, you’ll notice that your management network IP is now back to 0.0.0.0:

image

Proceed with logging in via the root account with a blank server, navigate to your management network’s NICs and you’ll notice that your vmnics for the management network will be back to the defaults:

image

Logging into the server via the vSphere Client will show that you pretty much have a plain new install in evaluation mode:

image

image

However, you’ll notice that your VMFS volume was left untouched:

image

… and there you have it!  Make sure you have all the information you need to reconfigure your ESXi host if you are going to perform a repair on the hypervisor.

I’ll be writing another blog post about resetting the password via the shadow file when I get the chance.

Wednesday, October 5, 2011

Unable to relay mail to external domain recipient through Exchange Server 2010

I ran into an issue today that got me frustrated not because I couldn’t solve right away but because I’ve come across this before but forgot what I did to solve it so now that I’ve resolved the issue, I figure I’d blog it so I can reference this blog in the future.

I encountered the problem when I completed the setup of a new Kiwi Syslog server which allowed me to send VMware Data Recovery reports and logs to and in turn email recipients so they would know whether the daily backups were successful or not.  The environment had Exchange Server 2010 as their messaging service and I’ve configured numerous Exchange Receive Connectors for mail relaying for various devices but as I completed my syslog configuration and executed a test email to be sent to my email address, the email never came.  I initially thought it was something minor such as firewall or IP address so I went ahead and checked the obvious.  After going through the configuration numerous times to verify I haven’t made any mistakes, I realized that I wasn’t anywhere closer to solving the problem.  Seeing how I wasn’t really getting anywhere, I began reviewing the logs from the syslog server which showed the following:

2011-10-05 16:45:10 PI Message to: tluk@ccs.bm, adminterence@domain.bm
2011-10-05 16:45:10 PI Message from: vdr01@domain.bm
2011-10-05 16:45:10 PI Subject: Syslog message from vc03 - vDR traffic
2011-10-05 16:45:10 PI Date: Wed, 05 Oct 2011 16:45:10 –0300
2011-10-05 16:45:20 PI Mail error: 550 5.7.1 Unable to relay

image

Then I proceeded to turn on verbose logging on the Exchange 2010 Hub Transport server’s receive connector handling the connection and navigated to the path:

C:\Program Files\Microsoft\Exchange Server\V14\TransportRoles\Logs\ProtocolLog\SmtpReceive

… where the log was written:

image

Opening up the logs showed the following:

#Software: Microsoft Exchange Server
#Version: 14.0.0.0
#Log-type: SMTP Receive Protocol Log
#Date: 2011-10-05T19:17:57.149Z
#Fields: date-time,connector-id,session-id,sequence-number,local-endpoint,remote-endpoint,event,data,context
2011-10-05T19:17:57.149Z,CAS01\VC03,08CE51AA847AC9C0,0,10.10.1.59:25,10.10.1.53:49654,+,,
2011-10-05T19:17:57.149Z,CAS01\VC03,08CE51AA847AC9C0,1,10.10.1.59:25,10.10.1.53:49654,*,SMTPSubmit SMTPAcceptAnySender SMTPAcceptAuthoritativeDomainSender AcceptRoutingHeaders,Set Session Permissions
2011-10-05T19:17:57.149Z,CAS01\VC03,08CE51AA847AC9C0,2,10.10.1.59:25,10.10.1.53:49654,>,"220 CAS01.domainnet.com Microsoft ESMTP MAIL Service ready at Wed, 5 Oct 2011 16:17:56 -0300",
2011-10-05T19:17:57.164Z,CAS01\VC03,08CE51AA847AC9C0,3,10.10.1.59:25,10.10.1.53:49654,<,EHLO vc03,
2011-10-05T19:17:57.164Z,CAS01\VC03,08CE51AA847AC9C0,4,10.10.1.59:25,10.10.1.53:49654,>,250-CAS01.domainnet.com Hello [10.10.1.53],
2011-10-05T19:17:57.164Z,CAS01\VC03,08CE51AA847AC9C0,5,10.10.1.59:25,10.10.1.53:49654,>,250-SIZE 10485760,
2011-10-05T19:17:57.164Z,CAS01\VC03,08CE51AA847AC9C0,6,10.10.1.59:25,10.10.1.53:49654,>,250-PIPELINING,
2011-10-05T19:17:57.164Z,CAS01\VC03,08CE51AA847AC9C0,7,10.10.1.59:25,10.10.1.53:49654,>,250-DSN,
2011-10-05T19:17:57.164Z,CAS01\VC03,08CE51AA847AC9C0,8,10.10.1.59:25,10.10.1.53:49654,>,250-ENHANCEDSTATUSCODES,
2011-10-05T19:17:57.164Z,CAS01\VC03,08CE51AA847AC9C0,9,10.10.1.59:25,10.10.1.53:49654,>,250-AUTH,
2011-10-05T19:17:57.164Z,CAS01\VC03,08CE51AA847AC9C0,10,10.10.1.59:25,10.10.1.53:49654,>,250-8BITMIME,
2011-10-05T19:17:57.164Z,CAS01\VC03,08CE51AA847AC9C0,11,10.10.1.59:25,10.10.1.53:49654,>,250-BINARYMIME,
2011-10-05T19:17:57.164Z,CAS01\VC03,08CE51AA847AC9C0,12,10.10.1.59:25,10.10.1.53:49654,>,250 CHUNKING,
2011-10-05T19:17:57.274Z,CAS01\VC03,08CE51AA847AC9C0,13,10.10.1.59:25,10.10.1.53:49654,<,RSET,
2011-10-05T19:17:57.274Z,CAS01\VC03,08CE51AA847AC9C0,14,10.10.1.59:25,10.10.1.53:49654,*,Tarpit for '0.00:00:05',
2011-10-05T19:18:02.290Z,CAS01\VC03,08CE51AA847AC9C0,15,10.10.1.59:25,10.10.1.53:49654,>,250 2.0.0 Resetting,
2011-10-05T19:18:02.290Z,CAS01\VC03,08CE51AA847AC9C0,16,10.10.1.59:25,10.10.1.53:49654,<,MAIL FROM: <vdr01@domain.bm>,
2011-10-05T19:18:02.290Z,CAS01\VC03,08CE51AA847AC9C0,17,10.10.1.59:25,10.10.1.53:49654,*,08CE51AA847AC9C0;2011-10-05T19:17:57.149Z;1,receiving message
2011-10-05T19:18:02.290Z,CAS01\VC03,08CE51AA847AC9C0,18,10.10.1.59:25,10.10.1.53:49654,>,250 2.1.0 Sender OK,
2011-10-05T19:18:02.305Z,CAS01\VC03,08CE51AA847AC9C0,19,10.10.1.59:25,10.10.1.53:49654,<,RCPT TO:<tluk@ccs.bm>,
2011-10-05T19:18:02.305Z,CAS01\VC03,08CE51AA847AC9C0,20,10.10.1.59:25,10.10.1.53:49654,*,Tarpit for '0.00:00:05',
2011-10-05T19:18:07.321Z,CAS01\VC03,08CE51AA847AC9C0,21,10.10.1.59:25,10.10.1.53:49654,>,550 5.7.1 Unable to relay,
2011-10-05T19:18:07.337Z,CAS01\VC03,08CE51AA847AC9C0,22,10.10.1.59:25,10.10.1.53:49654,<,QUIT,
2011-10-05T19:18:07.337Z,CAS01\VC03,08CE51AA847AC9C0,23,10.10.1.59:25,10.10.1.53:49654,>,221 2.0.0 Service closing transmission channel,
2011-10-05T19:18:07.337Z,CAS01\VC03,08CE51AA847AC9C0,24,10.10.1.59:25,10.10.1.53:49654,-,,Local

image

What I noticed to be consistent was the error:

550 5.7.1 Unable to relay

The hunch I immediately had was that Exchange was not allowing me to relay email out to a recipient at a domain that wasn’t internal so I went ahead to do a few telnet tests:

To an external email address

220 CAS01.domain.com Microsoft ESMTP MAIL Service ready at Wed, 5 Oct 2011 16:24:14 –0300
helo
250 CAS01.domain.com Hello [10.10.1.53]
mail from:vdr01@domain.bm
250 2.1.0 Sender OK
rcpt to:tluk@ccs.bm
550 5.7.1 Unable to relay

image

To an internal email address

220 CAS01.domain.com Microsoft ESMTP MAIL Service ready at Wed, 5 Oct 201
1 16:26:49 -0300
helo
250 CAS01.domain.com Hello [10.10.1.53]
mail from:vdr01@domain.bm
250 2.1.0 Sender OK
rcpt to:adminerence@domain.bm
250 2.1.5 Recipient OK

image

Solution

What the telnet tests revealed was that I had no issues relaying mail to domains that Exchange hosted internally which was no surprise because the receive connector I used was allowing other scanners and printers to relay email.  This was when I remembered that Exchange handled mail relay requests for internal and external domains based on the settings configured in the Authentication tab for the Receive Connector.  Opening the receive connector’s properties and navigating to the Authentication tab will show the following options:

  • Transport Layer Security (TLS)
    • Enable Domain Security (Mutual Auth TLS)
  • Basic Authentication
    • Offer Basic authentication only after starting TLS
  • Exchange Server authentication
  • Integrated Windows authentication
  • Externally Secured (for example, with IPsec)

image

Without dragging this on, the security mechanism to enable so relaying servers can send to external recipients is the option:

Externally Secured (for example, with IPsec)

With this option selected, you will also notice that it will also be mandatory to select the option:

Exchange servers

image

The configuration shown above

Authentication –> Externally Secured (for example, with IPsec)

Permissions Group –> Exchange servers

… will allow you to relay emails out to an external recipient with an internal email address.  If you were to use an external domain as the sender address, you’ll also need to check the option:

Permissions Group –> Anonymous

image

Hope this helps anyone who might be experiencing the same problem as I did today and I’m glad I’ll be able to reference this post when I get that “déjà vu” feeling again in the future when I’m bound to come across this again.

Monday, October 3, 2011

Recovering / reinstalling SRM (Site Recovery Manager) 4.1.1 after suffering a host failure

I’ve been meaning to write a post about recovering / reinstalling SRM 4.1 after having to rebuild one when a client suffered a host failure but never got the chance to until this weekend.  The incident happened during a planned datacenter move a few weeks ago where the environment had SRM 4.1 collocated with vCenter 4.1 on a physical server and someone decided to perform firmware upgrades during the move which resulted in the vCenter server continuously bluescreen-ing after the upgrade.  The priority during that when the host failed was obviously not the recovery of SRM because vCenter was more important but I ended up going in to recover SRM a few days later.  What I noticed as I started the recovery was that I wasn’t able to find a public KB from VMware that clearly outlined steps for situations such as these and the closest KB article I was able to find was the:

Migrating an SRM server to run on a different host
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1008426

image

So armed with this KB, I went on reinstall SRM 4.1 onto the production vCenter 4.1 server (the protected site). 

Assumptions

The following are the assumptions for the environment:

  1. There have been no changes made to the SAN replication and it is still in working order.
  2. You are using the same vCenter version prior to the failure.
  3. You have a backup of the SRM database.
  4. SRM is using Microsoft SQL server for the database service.

Downloading the SRA (Storage Replicator Adapters)

Before proceeding to reinstall SRMs, you should download the SRA for the SAN so proceed with opening up a web browser and navigate to: http://www.vmware.com/download/srm:

image 

Click on the Show Details link to expand the list of downloads:

image

Proceed with scrolling down the list of downloads to the one for your SAN:

image image

Download and install Microsoft SQL server for SRM 4.1

The next step for the recovery process is to install Microsoft SQL and I can’t help but to vent that I’ve come across way too many environments with the incorrect Microsoft SQL server installed.  While I have yet to see an install cease to function because an unsupported Microsoft SQL server was used, I still prefer to stick with what VMware has listed in the SRM Compatibility Matrix 4.x (srm_compat_matrix_4_x.pdf) so please refer to the following list for the support SQL Server editions:

image

For the purpose of this demonstration, I will be using SQL Server 2005 Express Edition SP2 which can be downloaded here:

http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=22625

image

Proceed with the install by running the executable:

image

Configuring Microsoft SQL Server for SRM

Once SQL server has been installed, open the SQL Server 2005 Surface Area Configuration for the instance:

image

Navigate to Instance Name –> Database Engine –> Remote Connections and change Local connections only to Local and remote connections with the option Using TCP/IP only:

image image

Clicking on the Apply button will prompt you with a warning message that the changes will not apply until you restart the database services but you won’t need to restart the services just yet as there are still changes required to be made:

image

Proceed with opening SQL Server Configuration Manager and navigate to SQL Server 2005 Network Configuration (32bit) –> Protocols for SQLEXPRESS and enable TCP/IP:

image

image

You will again be prompted with a warning that you will need to restart the services for the changes to take effect but there are still change required to be made for a restart so proceed with the next steps:

image

If you’re using SQL Server Express as shown in this demonstration, you will need to remove the dynamic ports that the default installation sets so open SQL Server Configuration Manager and navigate to SQL Server 2005 Network Configuration (32bit) –> Protocols for SQLEXPRESS right click on TCP/IP and choose Properties:

image

image

Navigate to the IP Addresses tab and make sure you change all of the TCP Port to 1433 (default for Microsoft SQL) and TCP Dynamic Ports to the value of 0:

image imageimage

Applying the changes will once again warn you that a service restart will be required for the changes to take affect but we’re not done with the changes so proceed on with the next steps:

image

There is no need for Shared Memory, Named Pipes or VIA to enabled so disable the protocols if they’re still enabled:

image

With all of these changes made, proceed with restarting the services either in the service console:

image

… or the SQL Server 205 Surface Area Configuration:

image

Restoring SRM Database

Proceed with launching the Microsoft SQL Server Management Studio administration console:

image

From here, you have the options of:

  1. Restore an .bak file of your SRM database from a previous backup.
  2. Re-attach the .mdf and .ldf files for your SRM database.

In my situation, the client had the .mdf and .ldf files stored on a separate LUN so all I had to do was reattach the LUN to the server and reattach the database.  With that being said, if you don’t intend on restoring the master database to this SQL Express server such as what I’m doing here, the security logins for the server will be missing so prior to reattaching or restoring the database, you will need to configure the security account used for the DSN connection on the SQL server instance first.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

If you’re going to restore the master database, you can ignore the following step:

Navigate to localhost\SQLEXPRESS –> Security then right click on Logins and select New Login.  Within the Login – New window, select the account you used to connect to the SRM database prior to the reinstall:

image

Once you’ve added the login configured, proceed with clicking the OK button and confirm that the login is now listed under the Logins node:

image

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

With the service account created under the SQL Express database’s login, proceed with restoring your SRM database.  The following demonstration will use the Attach feature:

image

image

image

With the database restored, we will now proceed with configuring the other requirements required for the SRM database outlined in the deployment guide:

image

Open up a new SQL query and execute the following command:

CREATE SCHEMA VMW_SRM

Note that VMW_SRM is the database in this demonstration and is not a requirement to be named that way.

image

With the schema with the same name as the service account used for the DSN accessing the database created, open up the SRM database’s properties and configure the Default schema with the schema we created:

image

image

image

With the databases’ configuration complete, proceed with opening the properties of the service account you’re using for the DSN connection and give it bulkadmin and public roles:

image

Map the account to the SRM database:

image

Configuring the 32-bit SRM DSN

With the configuration for the database and user account completed, proceed with creating the 32-bit DSN for the SRM database.  I won’t go into too much detail but for more information, please refer to one of my vCenter / Update Manager posts (use the 32-bit instructions):

http://terenceluk.blogspot.com/2011/02/creating-vcenter-update-manager-41-sql.html

cd\Windows\syswow64

odbcad32.exe

image

image

Installing SRM 4.1

Now that all of the prerequisites have been installed and configured, proceed with running the installation binaries for VMware Site Recovery Manager:

image

image

Note that you’ll be warned that your production vCenter server already has an extension registered for SRM during the vCenter server registration section and since you’re recovering from a host failure, proceed with selecting Yes:

image

Note that it is important that you fill in the field Local Site Name with the same site name you used for the SRM site you are recovering or you’ll receive the an error when you’ve completed the recovery:

image

image

Make sure you select the Use existing database option:

image

image

image

Reinstalling the SRA (Storage Replicator Adapters)

With SRM reinstalled, proceed with installing the SRA you downloaded earlier:

image

image

image

image

image

Download and install vCenter Site Recovery Manager

With the SRA installed, proceed with launching vCenter and install the vCenter Site Recovery Manager plug-in:

image

image

image

image

image

Launch Site Recovery Manager

With the plug-in for SRM installed and enabled, proceed with opening the Site Recovery plug-in:

image

Run the installcreds utility to register account credentials on the new host with the old DSN

Open up the command prompt as an administrator and change the directory to:

C:\Program Files (x86)\VMware\VMware vCenter Site Recovery Manager\bin>

… within the directory above, execute the following:

installcreds.exe -key db:srm -u domain\vmw_srm

For this demonstration, the database user name is a domain account named VMW_SRM so please change that to the appropriate domain and user account for your environment.

image

Run the srm-config utility to establish an authenticated connection to the local VirtualCenter server

Open up the command prompt as an administrator and change the directory to:

C:\Program Files (x86)\VMware\VMware vCenter Site Recovery Manager\bin>

… within the directory above, execute the following:

srm-config.exe -cmd updateuser -cfg ..\config\vmware-dr.xml -u VMW_SRM

For this demonstration, the database user name is VMW_SRM so please change that to the appropriate user account for your environment.

image

Review Protection Groups

Proceed with logging into the Site Recovery plug-in and verify that your protection groups are in good health:

image

… and we’re done!  I ran into more errors after bringing the protected site up but will separate those errors into other blog posts instead.