Compact PCI hardware now provides the
capability to switch control between redundant host system
processors. To fully take advantage of the hardware capability,
operating systems, device drivers, and applications software must
be configured to handle the implications of a host processor
switch.
The ability to replace I/O boards in a system without shutting
off power (hot swap) provides a tremendous boost to the
maintainability and availability of a system. It simplifies the
process of replacement for failed boards, minimizes system down
time, and eliminates the need to reboot a system after board
replacement. Extending hot swapability to system boards and
providing redundant system boards can provide the further benefit
of allowing the system to be tolerant of both system software and
system board failures. If the active system board fails, the
replacement board simply gets swapped in, and the system continues
operation with minimal interruption.
Compact PCI systems already support the hot swapping of
non-system cards, power supplies, and peripheral components. Both
the hardware needs and the board software algorithms required for
the hot swapping of non-system slots are well described by the
PICMG Compact PCI Hot Swap Extensions standard (www.picmg.org). To allow
system processor slots to hot swap, several facilities must work in
concert. For one, the hardware must allow the Compact PCI bus
domains to have their control transferred from one processor to
another without disrupting the bus operation. The software must
also allow the transfer, and must do so at all levels of the
system, from the system controller through each of the system I/O
boards.
The hardware requirements have been solved. System chassis such
as the Motorola Computer Group CPX8216 and CPX8221 have the
hardware necessary to allow this bus takeover. However,
successfully performing a domain takeover requires some adjustment
to the system software.
Controlling the Bus
Before swapping host controllers, you must either first halt
activity on the system bus, or ensure that the post-swap activity
will not cause failure of the new host. You can halt bus activity
by changing the functional configuration of boards in the system or
by using slot control signals as defined by the High-Availability
Hot Swap Standard. It is important to understand the effect of
these on a board in order to apply them properly.
Adjusting each function's configuration space is one way to stop
bus activity. Whether a board presents a bridge or a device as the
single PCI load in the slot, you have control over that function's
PCI mastering and target response capabilities. Specifically, you
can use the master enable bit in the command register to disable
origination of PCI bus cycles by a slot. Disabling a function's bus
mastering capability, however, may result in overruns or underruns
for that function. System software must therefore account for those
possibilities anytime a function or bus has traffic suspended for
longer than a few microseconds.
You can also stop bus activity by using one of two slot control
signals specified by the High-Availability Hot Swap Standard: BdSel
or PCIReset. The choice of signal has significant impact on how the
system recovers following a hot swap.
BdSel
Negating the BdSel signal removes back-end power from a Compact PCI
board. This moves a board into the H0 state of the hardware
connection as described in the standard. In this state, the board
is effectively powered off. Only early power, which is used to
stabilize the connection to the PCI bus signals in the floating
condition, remains active. It should be noted that the time to
enter the H0 state following the negation of BdSel for a slot is
unspecified and will be determined by the hardware implementation
of the slot payload.
Negating BdSel to a slot has the disadvantage of requiring that
the board go through a power-up sequence prior to returning to
service.
PCIReset
Asserting the PCIReset signal to a slot causes the PCI interface
for that slot to reset and float its electrical connections for the
duration of the reset. PCIReset will propagate onto the board's PCI
bus in accordance with the PCI specification, and may reset the
entire board or only the PCI bus, depending on hardware
implementation. The time from signal assertion until the PCI
interface is reset and floating is not specified and will be
determined by the hardware implementation.
Negating the reset allows the board to progress to the H2/S0
state. When the new host releases the board from reset, the normal
PICMG hot swap enumeration process begins. This process allows a
device driver to be configured and PCI resource allocations to be
made for the I/O board.
Using PCIReset to halt bus activity allows the board to maintain
its power, so volatile memory should not be lost. Whether the board
software can recover from PCIReset without complete initialization,
however, is a matter for its software designer to determine.
Processor Hot Swap Classifications
Once the bus traffic is quieted, host processor hot swap can
proceed. There are several ways to go. Processor hot swaps can be
classified along two orthogonal criteria: the relationship of the
two processors during the switchover and the maintenance of state
within the payload and its associated driver.
There are two possibilities for processor relationship during a
hot swap: cooperative and pre-emptive. If both system processors
are capable of participating in the bus domain switchover, then the
switchover is considered a cooperative switchover. Otherwise, the
switchover is considered to be pre-emptive.
In a cooperative switchover the claiming processor notifies the
current domain owner of the intent to switch and waits for the
owner's consent before claiming the bus domain. A pre-emptive
switchover is initiated in the same manner. However, if the
claiming system processor determines that the time allotted for the
cooperative switchover has elapsed prior to receiving the current
owner's consent, the bus domain is forcibly switched to the new
processor.
Cooperative switchovers are desired where possible. Certain
types of software faults, however, can cause the current owner to
not notice a simple request. To maximize the probability that the
current owner will take notice, even in the face of software
faults, the switchover request should trigger an interrupt.
A cooperative switchover procedure will attempt to notify all
I/O functions of the switchover and allow them to halt bus activity
before proceeding. Intelligent I/O functions may be allowed to
complete checkpoint transfers. Additionally, the current domain
owner may attempt to complete state checkpointing of drivers and
other items before consenting to the takeover.
By performing the notifications and checkpointing, the
switchover procedure is most likely to preserve the system state
and halt bus activity. Preserving the system state and halting the
bus maximizes the probability of a clean takeover and the
subsequent recovery and continuation of the system function.
Certain hardware or software faults may interfere with a
cooperative takeover. For example, the checkpoint link between
processors may have failed, preventing a clean checkpoint from
being established. Another possibility is that the current owner
may have established an interrupt-inhibited environment, causing it
to fail to recognize the takeover request. Other types of software
or hardware faults may have similar effects. The result is a
pre-emptive switchover.
A pre-emptive switchover is simply any switchover that did not
satisfy the conditions for a cooperative switchover. In a
pre-emptive switchover, the most recent checkpoint may be stale,
the I/O functions may not have been notified of the change, the bus
may not have halted, or any combination of the foregoing conditions
may be in effect.
Payload and Driver State
There are three levels of domain switchover related to payload and
driver state maintenance, designated cold, warm, and hot. In the
cold switchover, the I/O devices and their associated new drivers
do not maintain any state from before the switchover. In the warm
switchover, I/O devices maintain at least some state from before
the switchover and will be notified in some manner that a switch
has occurred. In the hot switchover, the I/O devices are unaware
that a switch has occurred.
Cold switchovers are accomplished by either using the PCIReset
for each board, or by using the BdSel. Because of this, cold vs.
warm/hot strategies can be mixed on a slot-by-slot basis. Following
the cold switchover, boards are sequenced through the normal I/O
hot swap sequences, allowing the standard enumeration procedures to
work.
Since there is no state maintained across a cold switchover,
very little needs to be done beyond the standard I/O Hot Swap
steps. Only the protocols for causing the processor to swap need be
added. Additionally, the lack of state maintenance means that there
is little advantage of a cooperative switchover vs. a pre-emptive
switchover. While applications may benefit from a cooperative
switchover, the non-system payload gains no benefit from
cooperation.
Warm switchovers are accomplished by disabling the I/O payload's
bus mastering capabilities following the bus exchange. The primary
mechanism for disabling the bus master capabilities is the PCI
configuration header command register. Additional mechanisms, such
as device CSRs may be available on a device-dependent basis. The
primary requirement for warm switchover is that both the device and
its driver are capable of communications regarding the device's
state and usage of system resources. This communication must be
possible without the I/O device requiring bus mastership.
A communication and potential reconfiguration of PCI resources
takes place before the new driver permits the payload to again
perform bus master operation. The mastership hiatus permits any
necessary PCI reconfiguration to occur. Resources such as bus
numbers, PCI memory and I/O space allocations, and DMA buffer
allocations are done anew by the new device driver.
Device-to-driver communications protocols can be resynchronized,
and then bus mastership capabilities can be re-enabled.
Cooperative switchovers have an advantage over pre-emptive
switchovers in the warm switchover mode. Cooperative switchovers
allow extant device status to be checkpointed to the new system
processor and the device to halt activity prior to switchover.
Devices may thereby avoid unexpected over/underruns.
Warm switchovers have the advantage over cold switchovers of
enabling system continuation without interrupting payload states.
This is quite desirable in systems where the payload intelligence
is a large part of the system intelligence, such as call switching
or cellular applications. In these applications, the existing calls
can be maintained.
Warm switchovers maintain state with little support from the
host operating systems, since the burden of managing the switchover
falls on the device intelligence and its associated driver.
However, this is also the drawback to warm switchover. The
protocols and checkpointing required to re-allocated resources and
resynchronize driver and payload may be quite complex. It is
unlikely that standard payload downloads will be capable of such
operations.
Hot switchovers are accomplished by quickly switching a domain
into an identically configured system processor. The I/O devices
then resume operation without reconfiguration. While the devices
may be notified of the switchover as an aid to recovering from
potential under/overruns, basic operation of the device payload
remains undisturbed.
In a hot switchover, cooperative switchovers have an advantage
over pre-emptive switchovers. Cooperative switchovers allow extant
device status to be checkpointed to the new system processor and
the device to halt activity prior to switchover. Devices may
thereby avoid unexpected over/underruns.
To perform a successful hot switchover, the new system processor
must maintain a resource configuration identical to that of the
original system processor. This requires careful checkpointing of
system resource allocations such as PCI bus numbers, PCI I/O and
memory space address, and DMA buffer physical addresses. Most
operating systems will need modification to support this form of
system processor switchover. Additionally, the system processor
device drivers must be capable of configuration and checkpointing
without access to real hardware.
The primary advantage of a hot switchover is that it may be
implemented without modification to the payload devices' downloads.
Only the drivers for the host processor require modifications.
These drivers typically implement simple backplane packet
interfaces, rather than the complex protocols of the I/O devices,
and will deal only with status, service control and encapsulated
data packets. In an environment where complex protocols acquired
from third parties run on the payload devices, and the source code
is not available, the hot switchover may be a necessity.
Processor Hot Swap System Resource Management
The two processor relationships and three driver maintenance
levels yield six possible implementations for processor hot swap,
as shown in Figure 1. Each implementation must go through a
sequence of configuring the system, making the switchover, and
reconfiguring the system. The sequences for each implementation are
given in the Figure 1 links. Following the sequence is not all
there is to implementing a successful hot swap. You may also need
to carefully manage system resources.
Figure 1: The six examples of possible domain switchover
sequences for a given system are application, device, and driver
dependent. Detection of when a switchover should be performed is
not considered in these sequences. The examples assume that the
drivers, operating systems, and payloads have the requisite
capabilities to handle each class of switchover.
Cold and warm domain switchovers require little in the way of
special resource management. This is because they allow PCI
reconfiguration between the switchover and I/O resumption. The same
cannot be said for hot switchovers. Because the device I/O is
allowed to continue without reconfiguration, every resource related
to I/O operations must be carefully managed in a hot switchover.
These resources include, but may not be limited to, PCI bus
numbers, PCI I/O space, PCI memory-mapped I/O space, PCI
prefetchable memory space, PCI interrupts, and DMA physical buffer
and control addresses. Additionally, device driver configuration
must be managed in the absence of physical hardware.
PCI Resources
Hot switchovers require considerable resource management. The
obvious management need is for the collective set of PCI resources.
These resources must be identical on both processors participating
in the hot switchover, yet most operating systems supporting PCI
Hot Swap have dynamic allocation mechanisms. For example, PCI bus
numbers are allocated as PCI-to-PCI bridges are encountered in the
enumeration process. Typically, bus numbers for I/O host swap are
allocated in blocks to allow for subordinate bridges. The CPX8216
chassis, for instance, contains two domain bridges. After a small
allocation to allow for PMC bridges on the system processor, the
remaining bus numbers are divided equally between the two
domains.
Typically, operating systems enumerate the PCI bus either
automatically through the receipt of the ENUM signal or on demand
by the system management interface. In either case, the results may
not be identical each time. When configuring for the hot switchover
of the system processor, the system not owning the domain must have
a means of tracking the allocations made by the owning domain, as
it cannot make its own allocations and have them match.
The key for performing bus number allocations for hot switchover
is to make sure that the domain bridges have identical allocations
based on the domain rather than based upon the PCI BDF (bus,
device, function) triple. This is a requirement that is not
accommodated by most currently available operating systems, which
generally just allocate in order as bridges are discovered, and the
discovery process normally proceeds based on the BDF triple.
PCI I/O and Memory Allocations
PCI-to-PCI bridges used as domain bridges currently have only one
window for each of the three PCI windows: I/O, memory-mapped I/O,
and prefetchable memory spaces. This single window means that the
available address pool for each must be divided among the domain
bridges. The current recommendation is to expose the entire
resource pool through each domain bridge window. The effect of
dynamically changing the window size to accommodate insertion and
extraction is undetermined, and dependent on the bridge
implementation.
When subordinate allocations are made for devices downstream of
the domain bridges, the same allocation must be made in the other
host's virtual resource pool. This may be done by checkpointing the
allocations to the other processor as they are made. This
requirement is not yet accommodated by most available operating
systems, as normal strategy is to only make allocations when
physical hardware is discovered. The operating system concept of
resource allocation must be extended to apply to virtual devices
not yet physically present.
PCI interrupts are allocated according to the hardware wiring
for a given chassis. When an interrupt is allocated in the system
currently owning a domain, the logically equivalent interrupt lines
must be configured on the non-owning processor.
DMA Buffers
Because I/O devices may have pending DMA requests at the time of
domain hot switchover, it is necessary that the physical addresses
used for DMA by the active domain be similarly allocated in the
standby domain. This requirement is not normally met by current
operating systems.
Additionally, in order to manage allocations for multiple
domains, the available DMA memory pool must either be divided and
allocated into segments for each domain, or an MP safe allocation
algorithm must be used to allow the two processors to communicate
their allocations as they occur. In any event, the DMA allocations
must be checkpointed to the standby system and device drivers.
The physical addresses of the DMA pools must be the same on both
system processors, even if the amount of memory on the two
processors differs. If virtual addresses are used in any of the
packet or control data exchanged between the device and the driver,
then the virtual address of such structures must also be identical
on the active and standby systems.
Device Drivers
Previous requirements for resource allocations have implied that
device drivers have extended capabilities when used in hot
switchover configurations. They must be able to accept allocation
information via the system checkpoint protocols to ensure that the
active and standby drivers can maintain a mirrored device
model.
Additionally, device drivers must configure without real
hardware, and have the capability of acquiring or releasing the
physical hardware upon command from the system. The drivers may
also have to be extended to discover which buffers are in use at
the time of hot switchover, as that information may not have made
the last valid checkpoint.
Backplane Communication Issues
Intelligent I/O devices and the system processor device drivers
generally communicate via a shared memory packet interface. Warm or
hot switchovers demand that these protocols have certain features.
Protocols intended for warm switchover use must allow for
suspension and reconfiguration of communication addresses following
the warm switchover. The reconfiguration must be accomplished
without requiring the I/O device to access the domain bus.
All protocols intended for use in domain switchover
configurations should protect against lost or corrupted packets.
Packet errors may occur any time over a complex PCI bus
configuration and are often difficult to localize. While the system
may be aware that an error has occurred, the exact location of the
error may be difficult to determine, especially before the
erroneous data can be used. Pre-emptive takeovers increase the
likelihood of packet errors.
Future Considerations
This set of software considerations for processor hot swap or
domain takeover is a good step in the right direction, but should
not be considered comprehensive. Ongoing implementations of high
availability systems will undoubtedly uncover new considerations.
Nonetheless, following these recommendations will help in creating
a system with host hot swap capability, allowing it to handle many
types of system processor faults.
The cost is that system and application software may require
significant changes. The extent of the modifications needed depends
on the type of switchover you choose. Given that some options
require changes to I/O drivers and applications software, your
choices for switchover type may be limited by the extent to which
you have control of that software.
About the Author
Mark is currently the Systems Architect for
High Availability Real-Time Operating Systems at the Motorola
Computer Group. He has been at Motorola for ten years. Previously,
Mark held a variety of software and hardware development and
management positions primarily related to networking and
communications systems. Mark graduated from Bucknell University
with a BSEE in 1975.