Running a healthy service in the cloud : Diagnostics in the cloud

3/22/2011 4:43:07 PM

1. Diagnostics in the cloud

At some point you might need to debug your code, or you’ll want to judge how healthy your application is while it’s running in the cloud. We don’t know about you, but the more experienced we get with writing code, the more we know that our code is less than perfect. We’ve drastically reduced the amount of debugging we need to do by using test-driven development (TDD), but we still need to fire up the debugger once in a while.

Debugging locally with the SDK is easy, but once you move to the cloud you can’t debug at all; instead, you need to log the behavior of the system. For logging, you can use either the infrastructure that Azure provides, or you can use your own logging framework. Logging, like in traditional environments, is going to be your primary mechanism for collecting information about what’s happening with your application.

1.1. Using Azure Diagnostics to find what’s wrong

Logs are handy. They help you find where the problem is, and can act as the flight data recorder for your system. They come in handy when your system has completely burned down, fallen over, and sunk into the swamp. They also come in handy when the worst hasn’t happened, and you just want to know a little bit more about the behavior of the system as it’s running. You can use logs to analyze how your system is performing, and to understand better how it’s behaving. This information can be critical when you’re trying to determine when to scale the system, or how to improve the efficiency of your code.

The drawback with logging is that hindsight is 20/20. It’s obvious, after the crash, that you should’ve enabled logging or that you should’ve logged a particular segment of code. As you write your application, it’s important to consider instrumentation as an aspect of your design.

Logging is much more than just remote debugging, 1980s-style. It’s about gathering a broad set of data at runtime that you can use for a variety of purposes; debugging is one of those purposes.

1.2. Challenges with troubleshooting in the cloud

When you’re trying to diagnose a traditional on-premises system, you have easy access to the machine and the log sources on it. You can usually connect to the machine with a remote desktop and get your hands on it. You can parse through log files, both those created by Windows and those created by your application. You can monitor the health of the system by using Performance Monitor, and tap into any source of information on the server. During troubleshooting, it’s common to leverage several tools on the server itself to slice and dice the mountain of data to figure out what’s gone wrong.

You simply can’t do this in the cloud. You can’t log in to the server directly, and you have no way of running remote analysis tools. But the bigger challenge in the cloud is the dynamic nature of your infrastructure. On-premises, you have access to a static pool of servers. You know which server was doing what at all times. In the cloud, you don’t have this ability. Workloads can be moved around; servers can be created and destroyed at will. And you aren’t trying to diagnose the application on one server, but across a multitude of servers, collating and connecting information from all the different sources. The number of servers used in cloud applications can swamp most diagnostic analysis tools. The shear amount of data available can cause bottlenecks in your system.

For example, a typical web user, as they browse your website and decide to check out, can be bounced from instance to instance because of the load balancer. How do you truly find out the load on your system or the cause for the slow response while they were checking out of your site? You need access to all the data that’s available on terrestrial servers and you need the data collated for you.

You also need close control over the diagnostic data producers. You need an easy way to dial the level of information from debug to critical. While you’re testing your systems, you need all the data, and you need to know that the additional load it places on the system is acceptable. During production, you want to know only about the most critical issues, and you want to minimize the impact of these issues on system performance.

For all these reasons, the Windows Azure Diagnostics platform sits on top of what is already available in Windows. The diagnostics team at Microsoft has extended and plugged in to the existing platform, making it easy for you to learn, and easy to find the information you need.

2. Diagnostics in the cloud is just like normal (almost)

With the challenges of diagnostics at cloud-scale, it’s amazing that the solution is so simple and elegant. Microsoft chose to keep everything that you’re used to in its place. Every API, tool, log, and data source is the same way it was, which keeps the data sources known and well documented. The diagnostics team provides a small process called MonAgentHost.exe that’s started on your instances.

The MonAgentHost process is started automatically, and it acts as your agent on the box. It knows how to tap into all the sources, and it knows how to merge the data and move it to the correct locations so you can analyze it. You can configure the process on the fly without having to restart the host it’s running on. This is critical. You don’t want to have to take down a web role instance just to dial up the amount of diagnostic information you’re collecting. You can control data collection across all your instances with a simple API. All the moving parts of the process are shown in figure 1 . Your role instance must be running in full-trust mode to be able to run the diagnostic agent. If your role instance is running in partial trust, it won’t be able to start.

Figure 1. The MonAgentHost.exe process gathers, buffers, and transfers many different sources of diagnostic data on your behalf. It’s the agent we’ll be focusing on in this section.

As the developer, you’re always in control of what’s being collected and when it’s collected. You can communicate with MonAgentHost by submitting a configuration change to the process. When you submit the change, the process reloads and starts executing your new commands.

2.1. Managing event sources

The local diagnostic agent can find and access any of the normal Windows diagnostic sources; then it moves and collates the data into Windows Azure storage. The agent can even handle full memory dumps in the case of an unhandled exception in one of your processes.

You must configure the agent to have access to a cloud storage account. The agent will place all your data in this account. Depending on the source of the data, it’ll either place the information in BLOB storage (if the source is a traditional log file), or it’ll put the information in a table.

Some information is stored in a table because of the nature of the data collection activity. Consider when you’re collecting data from Performance Monitor. This data is usually stored in a special file with the extension .blg. Although this file could be created and stored in BLOB storage, you would have the hurdle of merging several of these files to make any sense of the data (and the information isn’t easily viewed in Notepad). You generally want to query that data. For example, you might want to find out what the CPU and memory pressure on the server were for a given time, when a particular request failed to process.

Table 1 shows what the most common sources of diagnostic information are, and where the agent stores the data after it’s collected. We’ll discuss how to configure the sources, logs, and the (tantalizingly named) arbitrary files in later sections.

Table 1. Diagnostic data sources
Data source	Default	Destination	Configuration
Arbitrary files	Disabled	BLOB	DirectoryConfiguration class
Crash dumps	Disabled	BLOB	CrashDumps class
Trace logs	Enabled	Azure table	web.config trace listener
Diagnostic infrastructure logs	Enabled	Azure table	web.config trace listener
IIS failed request logs	Disabled	BLOB	web.config traceFailedRequests
IIS logs	Enabled	BLOB	web.config trace listener
Performance counters	Disabled	Azure table	PerformanceCounterConfiguration class
Windows event logs	Disabled	Azure table	WindowsEventLogsBufferConfiguration class

The agent doesn’t just take the files and upload them to storage. The agent can also configure the underlying sources to meet your needs. You can use the agent to start collecting performance data, and then turn the source off when you don’t need it anymore. You do all this through configuration.

2.2. It’s not just for diagnostics

We’ve been focusing pretty heavily on the debugging or diagnostic nature of the Windows Azure Diagnostics platform. Diagnostics is the primary goal of the platform, but you should think of it as a pump of information about what your application is doing. Now that you no longer have to manage infrastructure, you can focus your attention on managing the application much more than you have in the past.

Consider some of the business possibilities you might need to provide for, think about how the diagnostic tools can make some of these scenarios possible.

There are the obvious scenarios of troubleshooting performance and finding out how to tune the system. The common process is that you drive a load on the system and monitor all the characteristics of the system to find out how it responds. This is a good way to find the limits of your code, and to perform A/B tests on your changes. During an A/B test, you test two possible options to see which leads to the better outcome.

Other scenarios aren’t technical in nature at all. Perhaps your system is a multitenant system and you need to find out how much work each customer does. In a medical imaging system, you’d want to know how many images are being analyzed and charge a flat fee per image. You could use the diagnostic system to safely log a new image event, and then once a day move that to Azure storage to feed into your billing system.

Maybe in this same scenario you need a rock-solid audit that tells you exactly who’s accessed each medical record so you can comply with industry and government regulations. The diagnostic system provides a clean way to handle these scenarios.

An even more common scenario might be that you want an analysis of the visitors to your application and their behaviors while they’re using your site. Some advanced e-commerce platforms know how their customers shop. With the mountains of data collected over the years, they can predict that 80 percent of customers in a certain scenario will complete the purchase. Armed with this data, they can respond to a user’s behavior and provide a way to increase the likelihood that they’ll make a purchase. Perhaps this is a timely invitation to a one-on-one chat with a trained customer service person to help them through the process. The diagnostics engine can help your application monitor the key aspects of the user and the checkout process, providing feedback to the e-commerce system to improve business. This is the twenty-first-century version of a salesperson in a store asking if they can help you find anything.

To achieve all of these feats of science with the diagnostic agent, you need to learn how to configure and use it properly.