Metron - Capacity Management: April 2015

Monday 20 April 2015

Automatic reporting and alerting(10 of 10)

What do we actually need? What do we actually want to report on? How often do we want to report on it?

If we are applying threshold based alerting to our reports, we need to ensure that the correct values are set. These values may be the utilizations or response times stated within an SLA or based on maximum resource usage. Failure to set the correct values may lead to incorrect alerts being produced, leading to unnecessary investigations, stress and panic. By including availability and response time information within your capacity reports, you improve both the accuracy and increase confidence in your forecasts whilst providing potential SLA breach information in advance.

When creating forecast models, whether trend or analytical models or both, it is important to make sure that the inputs into your model are as accurate as possible so we can make these predictions to avoid any costly or unnecessary performance issues or SLA breaches.

So let's ensure we have a Brighter Outlook. It is crucial that we get the information at all levels as described and store this information typically in a centralised database, so that we have it readily available. Production of adhoc reports on infrastructure usage and current performance of our applications along with implementing automated reports specifically based on our SLA thresholds, enables us to produce early warning alerts on potential breaches and take appropriate action as necessary.

Guest and Host consolidation. If you have over provisioned systems, look at the usage of your virtual machines against the configured resources to see if there is scope to consolidate your guests onto smaller numbers of ESX hosts. You may also be able to have multiple applications running within the same VM rather than have many VMs running a single application.

Plan ahead. By producing trend reports, producing analytical models and predicting what impact is likely to happen to your infrastructure and application performance running within it. Then make the necessary recommendations on upgrades or configuration changes that prevent you encountering any SLA breaches and associated impacts on services. All of this information should be included within a Service Capacity Plan, allowing us to make decisions on whether we need to upgrade, whether we need to standardise our hardware and what the associated costs are likely to be, so budgets can be accurately planned.

It can also help us decide on whether a more powerful and expensive server is actually required when maybe a less expensive, slightly less powerful server, will do just as good a job. Creating analytical models gives you the information you require to make those decisions.

Regular consultation and information sharing with Application teams and other Service Delivery teams will assist you in making the best decisions going forward and allows you to explain why you have made the stated recommendations.

I'll leave you with a quick look at monetary savings on Capital Expenditure (CAPEX) and Operational Expenditure (OPEX)

CAPEX

All the way through this series I have mentioned being able to possibly reduce the numbers of servers required to host all of your virtual machines. This enables you to possibly make savings on the number of licenses you require and on the actual hardware costs. As you start to reduce your physical estate and consolidate ESX hosts, you start to look at the possibility of reducing the size of your datacenter.

OPEX

Make savings by reducing the amount spent on maintenance and support as you reduce the numbers of servers required in your infrastructure hosting your applications and services. By performing application sizing, we can assist in accurately provisioning resource requirements and help eliminate any potential overspend by over provisioning. Further to this, we can actually reduce the physical server count leading to a reduction in the size of a datacenter.

The savings from this approach such as Power Usage - servers & cooling / lighting, reduction in emissions through reduced power consumption but also through a reduction in components as we consolidate servers and finally through usage, by optimally sizing and provisioning.

In my series I‘ve covered what Cloud computing is and how it is underpinned by Virtualisation, the benefits it can provide, things we should be aware of and how putting in place effective Capacity Management can save you time and money.

If you'd like further information on Capacity Management there are a selection of papers available to download http://www.metron-athene.com/_downloads/index.html and don't forget to register for my webinar 'Understanding VMware Capacity' http://www.metron-athene.com/services/training/webinars/index.html

Jamie Baker
Principal Consultant

Friday 17 April 2015

Old Habits & Potential Risks (9 of 10)

Gartner stated that "through to 2015, more than 70% of private Cloud implementations will fail to deliver operational energy and environmental efficiencies"

(Extract from “Does Cloud Computing Have a ‘Green’ Lining? – Gartner Research 2010)

Why? Well Gartner are referring to the need to implement an organization structure with well-defined roles and responsibilities to manage effective governance and implementation of services, well-defined and standardized processes (such as change management and capacity management) and a well-defined and standardized IT environment. Cloud-computing technology (such as service governor, automation technology and platforms) itself is evolving, and thus, the efficiency promises will not be delivered immediately.

"Old habits" include departmental infrastructure or application hugging. Within an organisation what you tend to find most often is departments develop a "silo mentality", i.e. mine is mine. They become afraid to share infrastructure through a lack of trust and confidence, because they fear an impact on their own services they adopt the we need to protect ourselves from everybody else approach.

The problem with this attitude is it can lead us back to the "just in case" capacity mind set, where you end up always over provisioning systems By using effective capacity management techniques and combining with the functionality that vSphere provides you can get that right, by sizing and provisioning more accurately avoiding over provisioning. Performing application sizing at the first level will help you get the most efficient use out of your infrastructure as possible and ultimately lead to achieving that equilibrium between service and cost.

So how can we "Avoid an Internal Storm" and "Ensure a Brighter Outlook"?

Firstly, ask yourself this question - What is different with Cloud Computing, in terms of Capacity Management, compared with how we've always done it?

Typically, we still need to apply the same capacity management principles such as getting the necessary information at the Business, Service and Component levels. But we have to take into consideration the likelihood that the Cloud is underpinned by Virtualization and more specifically the use of resource pools. Therefore in this case, we need to be aware of what Limits, Shares and Reservations are set and what VMs are running in which pools in our cluster(s). Earlier we displayed a chart of a priority guest resource pool and the CPU usage of the guests within it. We need to identify limits. How much are our virtual machines allowed to use? Do they have a ceiling? What is the limit? When we talk about utilization is it the utilization of a specific CPU limit for example?

Do they have higher priority shares over others? What about any guarantees? Are they guaranteed resources because they may be high priority guests? And are there any CPU infinities assigned.

Information we need.

• Business - how many users is the service supporting? What resources are required? Are we likely to experience growth in this service? If so, by how much and when? Monthly, Quarterly or Annually?

• Service – have the Service Level Requirements (SLR) been defined and / or have they been agreed (SLA) and what do they entail? What, if any, are the penalties for not meeting the terms stated in the SLA?

• Component – gather and store the configuration, performance and application data from the systems and applications hosting the service in a centralized database that can be readily and easily accessed to provide the key evidence on whether services are and will in future satisfy the requirements as stated in the SLAs.

ITIL v3 Capacity Management explains about the kind of information you should be gathering at these levels. Having this information enables us to get the full picture of what is currently happening and allows us to forecast and plan ahead based on the business growth plans for the future.

Once we have all of the required information, we can implement automatic reporting and alerting and that’s what I’ll be covering in the last of my series on Monday....

Jamie Baker
Principal Consultant

Wednesday 15 April 2015

Organisations typically over provision by as much as 100% (8 of 10)

As mentioned previously organisations typically over provision by as much as 100%, so this could put a squeeze on other applications running within your infrastructure, within your internal cloud. Therefore when providing unlimited access to resources for our critical applications and setting high priority shares, it could impact on other applications.

But what impact is that likely to have? It may be insignificant but if it isn’t then we need to think about scaling out or up.

Scaling out is to scale "out" to existing systems of a similar specification. So what are the advantages of being able to do this? One of the advantages is that you may have some existing hardware lying around which has not been deployed or provisioned or some servers that are earmarked for removal that can be added rapidly to a cluster by using the golden reference host to install ESX.

Some benefits from this approach are:

· Short lead time. You don’t need to place an order and there is no need to wait for it to be built and delivered to you and then installed. There’s no need for a hardware upgrade because you’ve actually put extra resources into your cluster and then DRS can migrate virtual machines to load balance.

· Acceptable Costs. Your costs are acceptable in this respect

· SLA’s are met. But for how long? Are they only met for the short term? Are there enough resources and how much more do we actually need going forward as the business grows?

Some disadvantages of this approach would be that the hardware currently available or that has been decommissioned would be older and slower than those currently out there in the market at the moment. It may not be 64 bit compatible; it may not have the latest chip technology or greater memory capacity than the latest servers do. And while it satisfies our short term requirements, it may not meet our medium or long term objectives as the business grows and demand pressure on resources continues to grow.

You may also encounter some additional costs, such as extra software licensing based on the extra licences that you’ll need. You may need to extend some hardware support because you have older systems and you may also find that older servers have a higher power consumption, along with having to find space to host the servers.

Scaling Up.

So what benefits do we get from scaling upwards? This means we scale to the latest and greatest, normally bigger and more powerful systems. Faster CPU's, more physical Memory capacity, 64bit compatible etc. We can also look at increasing our consolidation ratio on these more powerful systems in comparison to what we currently have in place. This helps reduce the number of systems within a data centre leading to the benefits of smaller data centres, lower power consumption, and reduce carbon footprint for example, contributing to any green IT initiatives.

These newer systems may also support hyper threading, in which the CPU scheduler within vSphere actually allows you to have more execution threads for your virtual CPU’s to execute on.

We can make efficiency savings. After making the initial upgrade to a more powerful larger system, taking into consideration the cost of the upgrade, we may actually save over time against scaling out to older systems. It’s essential to perform a cost benefit analysis to see what savings we can make by scaling up.

Some disadvantages are:

· Longer lead time. You have to wait for the systems to be built and delivered / configured and installed after making an upgrade purchase.

· Space. You may have to find space for your new systems and you may have to decommission some old servers within your data centre, this again extending the lead time. The associated costs would be the cost of purchasing new equipment and any associated licensing costs that you are likely to incur by scaling upwards.

One example of a scaling up benefit is from eBay. A key note speech from Mazen Rawashdeeh eBay's VP of Technology Operations at the Gartner Data centre summit in Las Vegas talked about the benefits of leasing servers. The benefit of leasing servers is that you get a complete package for an annual fee. By trading in your existing leased servers for the latest technology that is available on a yearly cycle. This leads to benefits such as being greener by using more power efficient servers which uses less cO2, and reducing our physical size because we can have confidence in being able to host more VMs per ESX server and there increase the conversions of physical systems to virtual machines.

Other cost savings such as power usage, less data centre space in terms of reducing the physical footprint, reduction in the number of software licenses, were some of the key benefits that Mazen was trying to point out why eBay had now shifted to leasing.

I’ve covered the benefits of scaling up and scaling out, but what about scaling in?

Scaling In

What is Scaling in? Well it is the process of using the spare capacity within our infrastructure. We previously mentioned about way of relieving the pressure on your virtual environment if experiencing capacity / performance issues. But what about if your environment is performing as expected and you have over provisioned?

Do you have spare capacity within your infrastructure? Consider consolidating workloads (guests) onto fewer hosts to “sweat the assets” more. Also consider using periods of inactivity to migrate workloads between hosts. An example would be to run batch programs overnight on hosts running working day applications. By performing these actions you may be able to invoke the VMware Distributed Power Management (DPM) application which powers off unused hosts during these periods. This in turn will reduce the power consumption and in turn energy costs. It may also lead to a consolidation on the number of ESX hosts required and thus reduce the number of ESX host licenses needed. On Friday I'll be taking a look at old habits and potential risks....

Jamie Baker
Principal Consultant

Monday 13 April 2015

VMvSphere - protect your critical applications (7 of 10)

Protect your loved ones.

Your loved ones are typically your critical applications. These applications are the ones that really need to be available 24/7 and their associated Service levels must be met. Typically strict financial penalties are involved if such SLA's are breached and events such as loss of service / loss of business are experienced, e.g. if you are an online trading application and your critical application goes down preventing users or customers from accessing your website and purchasing your products or services.

What are the implications? How long has it been down? Why is it down? Is it because you haven’t provisioned enough capacity and you are experiencing poor performance? How much money will you be losing during the time that application is down?

Your critical applications or "loved ones" should be given the highest priority. This is achieved through resource pool shares. Remember that shares only apply when there is resource contention on the host and highest priority pools will normally have no limits set. You may also use CPU Affinity to gain some extra performance benefit.

Reservations (guarantees) are most often set for highest priority pools. This ensures that all of the virtual machines running within that pool across your cluster will be guaranteed the CPU and Memory configured over any other virtual machines that are running within that cluster.

VMware also provide high availability (HA) clusters. If you lose an ESX host, through some kind of hardware or software failure, the virtual machines on that host will be powered up on another member of the cluster. Moreover, to mitigate some hardware failures even further and to protect your critical applications, VMware provide Fault Tolerance (FT) which creates a secondary virtual machine from the primary. This secondary virtual machine runs on a different ESX host within the cluster. If your primary VM is lost through an ESX host failure your secondary VM running in step with your primary takes over immediately and becomes the primary, providing no loss of service.

A couple of key points; it is only for hardware failure, if you have a blue screen on one you’re going to have a blue screen on another. Also in vSphere 4.1, any FT VM can only have one virtual CPU configured.

Any trade-offs? Well typically in an organisation you’ll have this "just in case capacity management" mind set and this can impact on some other services.

What do we mean by "just in case"?. Well what we actually mean is that you overprovision, i.e. just in case we get large peaks and need the extra capacity and I’ll be looking at how this mentality affects the business on Wednesday.
Don't forget to register for my next webinar 'Understanding VMware Capacity http://www.metron-athene.com/services/training/webinars/index.html

Jamie Baker
Principal Consultant.

Friday 10 April 2015

VMware vSphere - Get the best resource usage balance across the cluster members.(6 of 10)

Within vSphere we have the ability to use Affinities. By creating VM to VM and/or VM to Host affinities or anti-affinities using DRS Groups and Rules, we can keep specific virtual machines apart, if for example they have similar resource usage patterns or together if they complement each other. The idea is to get the best resource usage balance across the cluster members.

CPU Affinity can also be set for performance gain, whereby some virtual machines are tied to use certain host CPUs.. The problem with this approach is that it affects vMotion and migrations will fail if CPU Affinity is set. You may also benefit from using CPU Affinity when it comes to software licensing.

When controlling resources within a cloud environment, the providers will use the Resource Pooling functionality to perform this. This ensures that shared resources which are available to the public domain come from a resource pool which is limited either by CPU or Memory or both.

Here is an example of a Priority Guests resource pool for CPU based on a single ESX host.

The top red line you can see is the pools limit and this is in CPU MHz. At the bottom you can see the stat guest CPU usage in MHz, so we have a couple of virtual machines that are assigned to this priority guest resource pool and you can see that the CPU usage in megahertz line actually follows the actual usage of those stacked guests.

Now if resource pools are used for organisational purposes but are not limited, or if you have expandable reservation set, on your child resource pools only, then you will need to monitor your root resource pool. The root resource pool is the total host capacity.

What about your critical applications? I’ll talk about that on Monday.

Jamie Baker

Principal Consultant

Wednesday 8 April 2015

VMware vSphere - How do we get our On Demand Sizing just right?(5 of 10)

Firstly, we want to make the most efficient use of the cloud as we can whilst wholly satisfying our SLA’s. We need to control the resource usage by employing shares, limits and if needed reservations at either the Resource Pool level or at the Virtual Machine level. Employing a chargeback model can also control the usage and also keep in check provisioning of unnecessary virtual machines (VM Sprawl).

Continuously monitor and tune the virtual infrastructure. This will bring the following benefits:

·         Freeing up unused resources

·         Ensuring that capacity is not under or over

·         Identifying workloads for consolidation – including Idle Virtual Machines

·         Load balancing Virtual Machines across hosts

·         Identifying what and when ESX hosts can be powered down

Ultimately we are striving for the equilibrium between Cost and Service Impact. If we attempt to reduce costs too far it can have a significant impact on service. Moreover, if we overspend, the impact on the service may be mitigated but we do not get value for money. If you have the correct processes and tools in place, getting the right balance should be easily achievable.

Automation and Control

VMware provides the ability to provision VMs rapidly by using templates. These templates can be created for vanilla domain approved operating system builds. Further to this virtual machines can be automatically migrated using Distributed Resource Scheduler (DRS) or manually between hosts or based on rules set within the DRS Cluster. DRS can be enabled on a Cluster. This functionality allows for automated, partially automated or manual load balancing of the DRS cluster based on internal algorithms which determine if any one cluster member (ESX) is struggling to meet its demands from its guests and can either migrate or recommend migration options to other members of the cluster.

At the ESX host level, rapid elasticity can be performed if required by referencing a Golden Host when configuring the new server. This reduces setup time considerably and can be on the network and operational within an hour.

We can control what resources a virtual machine has access to, whether it be at a Cluster, Host or Virtual Machine level. Priorities for CPU and Memory access are controlled by shares, which only come into force if there is contention on the parent ESX host. CPU and Memory limits can be set to control the maximum amount a virtual machine can potentially have access to, preventing any one virtual machine from hogging resources.

Use Resource Pools to structure your virtual infrastructure around either a service catalogue, department or geographical location.

On Friday I’ll take you through using Affinities to get the best resource usage balance across the cluster members

Jamie Baker
Principal Consultant

Monday 6 April 2015

Capacity on Demand.(4 of 10)

Today we’ll take a look at some of the implications around sizing our cloud resources, more specifically systems being Undersized and Oversized and what we should be doing to get it right.

As mentioned in the Cloud Computing introduction, in a cloud computing environment you should be able to self provision computing capabilities. One of the primary steps taken before deployment, would be to perform Application Sizing. This process enables you to accurately provision the correct amount of resource required to support the application(s) avoiding any potential performance issues and any unnecessary costs. Without this you could run the risk of being undersized which can lead to poor performance and Service Level Agreements (SLA) being affected.

Cloud providers will implement limits to restrict usage. This of course can be increased for a cost. Even if the initial resource usage costs are low, there could be higher costs down to the financial impact of possible SLA penalties and / or loss of business or productivity due to poor application performance and this in turn can damage the reputation of the business. Cloud bursting may be required if you are undersized and need some additional capacity very rapidly and this approach can incur higher usage premiums.

Over sizing or overprovisioning. The key here is whether you have provisioned too much capacity? You may have over provisioned to wholly satisfy an SLA due to strict penalties, but it could be costing you far more than is actually required.

Gartner state "that most organizations overprovision the resources in their IT environment by more than 100% with an average utilization of physical environments of only 10% and virtual environments of 35%". (extracted from “Does Cloud Computing Have a ‘Green’ Lining? – Gartner Research 2010)

This is primarily to cope with peak workload demands, which typically do not occur very frequently. Save costs on resource usage by provisioning the optimum level to satisfy all criteria, i.e. SLA’s. Costs such as Software Licensing where you are licensed per CPU would incur more costs because you have provisioned more CPU’s than necessary.

How does over capacity affect other services in the cloud? Are other services restricted because capacity has been over provisioned? Could this lead to poor performance?

Single-threaded applications on a vSMP VM do not gain any performance benefit, moreover resources are not only wasted, they affect the performance of the underlying ESX host because they have to manage and provide resources to the associated worlds (processes).

A virtual machine is a group of worlds and each vCPU assigned has a world plus one for the Mouse, Keyboard and Screen (MKS) and one for the Virtual Machine Monitor (VMM)

Over provisioning leads to VM Sprawl.

Memory resources can be shared across virtual machines, however over provisioning puts further pressure on ESX to manage allocation of resources to virtual machines.

Our objective therefore, is to get our On Demand Sizing just right and I’ll be looking at how this can be achieved on Wednesday. Sign up for our Community where you can download free white papers and webinars on-demand http://www.metron-athene.com/_downloads/index.html

Jamie Baker
Principal Consultant

Friday 3 April 2015

Virtualization underpins Cloud Computing, through resource pooling and rapid elasticity(3 of 10)

As mentioned previously to be considered a cloud a service must be On Demand, provide Resource Pooling and also Rapid Elasticity. And whilst Virtualization in general can provide the majority of these features the difference is that a private cloud using internet based technologies can actually provide the mechanism for end users to self provision virtual systems. Think of this as the ability to self-check in at an airport or print your boarding pass from home.
You login to a browser with a reference code, plug in some personal details and print or walk up to a screen enter a few details and voila out pops the boarding card and off you go (business or hand luggage only) otherwise off to bag drop before you go - virtually eliminating the need to queue up at a check-in desk to get your boarding card.

But it's more than just virtualizing systems and hosting them internally, it’s about giving control to the end user.

Now of course, some administrators may wince after reading this but there are ways in which self- provisioned systems can be controlled by using the virtualization technology that underpins cloud technology. Using resource pools within your private cloud gives you the ability to control resources via limits and shares and / or reservations so you can specify the amount of resources that users are allowed to provision. These control settings can also be changed very quickly to increase or decrease the amount of resources that are available within that pool. This helps prevent over specification and VM sprawl.

Another way to control resource deployment and / or usage is to internally charge. Users and their departments will soon reign back on creating over provisioned systems if they are charged on their system configuration usage rather than just on the usage itself.
It can be quite difficult to implement some form of internal charging. What do you charge with? Maybe by utilizing project codes or some other internal monetary system?
On Monday I’ll be looking at Capacity on demand and how you can get your sizing right.

Jamie Baker
Principal Consultant

Wednesday 1 April 2015

When we refer to a "cloud" what is it that we actually mean? (2 of 10)

We know that the cloud provides computing resources for customers to use and these resources are then charged at a monetary value accordingly.
Cloud providers deliver and manage services by using applications such as VMware's vCloud Director. These cloud applications provide benefits such as:

· Increased business agility by empowering users to deploy pre-configured services or custom-built services with the click of a button

· Maintaining security and control over a multi-tenant environment with policy-based user controls and security technologies

· Reducing costs by efficiently delivering resources to internal organizations as virtual datacenters to increase consolidation and simplify management

So, to be considered a Cloud it must be:

· On Demand - cloud aware applications can in most cases automatically self provision resources for itself and release them back as necessary.

· Resource Pooling - freeing up unused resources provides the ability to move these resources between different consumers’ workloads, thus quickly and effectively satisfying demand.

· Rapid Elasticity - rapid means within seconds to minutes (not in days). In a Virtual Cloud Environment, to Scale Out or In would also cover the ability to provision new ESX hosts, rather than just scale to new virtual machines.

Virtualization technology encompasses these three requirements and underpins Cloud Computing.

Many businesses are now using these advantages to move away from overinvestment in rigid, legacy-based technology and adopting cloud-based services which are highly scalable, technology-enabled computing consumed over a network on an as-needed basis.

Cloud Types

Cloud types provide the “computing as a service” model to help reduce IT costs and make your technology environment a responsive, agile, service-based system. These Cloud "types" or Service Delivery Models are commonly known as:

·         Software as a Service (SaaS) - External Service Providers (ESP) such as Amazon can provide access to a specific software application, e.g. Business Mail Service Desk and charge as necessary. You would access this application via a "thin client" typically a web browser.

·         Platform as a Service (PaaS) - This enables you to deploy supported applications into the Cloud with some degree of control over what environment settings are required. You do not have any control over the resources provided to host these applications.

·         Infrastructure as a Service (IaaS) - This provides the ability to provision your own resources and you have full control over what operating systems, environment settings and applications are deployed. The cloud provider still retains full management and control over the infrastructure.

The Cloud or "Clouds" as we know them are categorized by location and ownership, typically referred to as Public / Private or Internal or External clouds. In addition there are Community and Hybrid clouds whereby a Community share the cloud and are bound by a common concern or interest and Hybrid where you have a composition of two or more Private or Public clouds. This allows for data and application portability between clouds to occur. VMware introduced the vApps functionality specifically for this.

Most organisations will tend to lean more towards having exclusive "Internal" cloud services and possibly "Hybrid" cloud services (a mixture of Public and Private clouds). You may find that critical or data sensitive applications are always kept within the organisation and in some cases Testing and Development applications are ported to the Public Cloud where it is more cost effective to do so. There may also be the use of SaaS within the organisation which would be external to the business.

Just to reiterate, Virtualization underpins Cloud Computing, through resource pooling and rapid elasticity and to avoid any confusion, I will be explaining the primary difference of say a Private or Internal Cloud over Virtualization on Friday.
In the meantime register for my next webinar 'Understanding Vmware Capacity' http://www.metron-athene.com/services/training/webinars/index.html

Jamie Baker
Principal Consultant