Provide BGP NGEN Mvpn support in contrail software.
Currently, multicast is supported using ErmVpn (Edge replicated multicast). This solution is limited to a single virtual-network. i.e., senders and receivers cannot span across different virtual-networks. Also, this solution is not inter operable (yet) with any of the other known bgp implementations.
Use NGEN-Mvpn design to provide inter-vn multicast capabilities by still using ErmVpn for packets replication underneath in the data plane.
Some investigations were made to extend ermvpn itself to support inter-vn multicast. There is no clean way to do this and it is also not inter operable as well, as mentioned before.
Mvpn can be enabled/disabled in bgp afis.
diff --git a/src/schema/bgp_schema.xsd b/src/schema/bgp_schema.xsd
--- a/src/schema/bgp_schema.xsd
+++ b/src/schema/bgp_schema.xsd
@@ -289,6 +289,8 @@
<xsd:enumeration value="inet-vpn"/>
<xsd:enumeration value="e-vpn"/>
<xsd:enumeration value="erm-vpn"/>
+ <xsd:enumeration value="inet-mvpn"/>
<xsd:enumeration value="route-target"/>
<xsd:enumeration value="inet6"/>
<xsd:enumeration value="inet6-vpn"/>
Also, in order to use this feature, MVPN must be enabled in configuration files of all contrail-control nodes /etc/contrail/contrail-control.conf). One can do so by setting mvpn_ipv4_enable boolean flag in contrail-control.conf.
+<xsd:element name="igmp-enable" type="xsd:boolean" default="false" description="Enable IGMP."/>
+<xsd:annotation>
+ <xsd:documentation>
+ IGMP config parameteres.
+ </xsd:documentation>
+</xsd:annotation>
+<!--#IFMAP-SEMANTICS-IDL
+ Property('igmp-enable', 'global-system-config', 'optional', 'CRUD',
+ 'IGMP mode at Global level.') -->
+<!--#IFMAP-SEMANTICS-IDL
+ Property('igmp-enable', 'virtual-network', 'optional', 'CRUD',
+ 'IGMP mode at VN level.') -->
+<!--#IFMAP-SEMANTICS-IDL
+ Property('igmp-enable', 'virtual-machine-interface', 'optional', 'CRUD',
+ 'IGMP mode at VMI level.') -->
+
Along with the mvpn_ipv4_enable flag, the igmp-enable field in the schema is required for Mvpn functionality to work. The flag is available at virtual-machine-interface, virtual-network and global-system-config level and the igmp-enable configuration is chosen in the same order of precedence.
Enable igmp-enable at any of the available levels for VMI/VN/Contrail to participate in the Mvpn at that level.
In order to use Mvpn feature, users shall enable mvpn in bgp and in the virtual networks as desired. In a peering SDN gateway such as MX (JUNOS) also, users should configure Mvpn. In order to support multicast sender outside the cluster, one must also configure necessary S-PMSI tunnel information inside the routing instance as applicable.
UI shall provide a way to configure/enable Mvpn for bgp and virtual-networks. Specifically, mvpn for ipv4 (and later for ipv6) would be one of the families to enable under bgp address families configuration. There shall be a knob under each virtual-network as well, in order to selectively enable/disable mvpn functionality.
UI shall provide a way to enable/disable IGMP for participation in Mvpn bgp at virtual-machine-interface, virtual-network or global-system-config level.
####Describe any log, UVE, alarm changes
When mvpn AFI is configured, BGP shall exchange Capability with MCAST_VPN NLRI for IPv4 multicast vn routes. AFI(1)/SAFI(5). This is not enabled by default.
MvpnManager 1. There shall be one instance per vrf.mvpn.0 table 2. Maintains list of auto-discovered mvpn [bgp] neighbors 3. Manages locally originated mvpn auto discovery routes such as Type-1 (AD), and Type-2 AD. 4 Manages locally originated Type-4 Leaf AD routes (In response to received Type-3 S-PMSI Routes) 3. Listens to all change to vrf.mvpn.0 table 5. Handles initialization or cleanup when mvpn configuration is added or deleted in a virtual-network 6. Provides data for inspection at run time via Introspect
MvpnProjectManager 1. There shall be one instance per .mvpn.0 2. Maintains an std::map<MvpnState::SG, MvpnState *>. This map contains the current Mvpn state of a particular <S,G> multicast route. Among many things, it holds pointers to all Mvpn routes of the corresponding <S,G> such Leaf-AD Routes, applicable GlobalErmVpnRoute, etc. 3. This MvpnState structure inside the map is ref-counted and stored inside the DB-State of the referred routes. This ensures that referred routes never get deleted until DB State goes away and vice-versa. 4. Listens to all changes to .ermvpn.0 and notifies all type-4 leaf ad routes applicable if GlobalErmvpnRoute changes (add/change/delete)
All mvpn and ermvpn routes for a particular group are always handled serially under the same db task. This is achieved by using group as the only field to hash and map a route to a particular DB table partition. This would simplify the design and let MvpnManager, MvpnProjectManager, ErmVpnManager and MvpnTable::Replicate() call directly into each other (using well defined APIs) in a thread safe manner without the need to acquire a mutex.
Note: Care must be taken to ensure data consistency when ever Mvpn neighbor information is accessed. This information is constructed off Type-1 and Type-2 routes which do not contain any S,G information. Hence they do not necessarily all into the same partition as other routes which are specific to an <S,G>. This is done by protecting neighbor map using a mutex. Callers are always provided a copy of the MvpnNeighbor structure.
Route-Type CreateWhen Primary ReplicateWhen Secondary
================================================================================
Without Inclusive PMSI
T1 AD Configure Mvpn vrf.mvpn RightAway bgp.mvpn, vrf[s].mvpn
T1 AD Receive via bgp bgp.mvpn RightAway vrf[s].mvpn
Send only to I-BGP Peers
Without Inclusive PMSI
T2 AD Configure Mvpn vrf.mvpn RightAway bgp.mvpn, vrf[s].mvpn
T2 AD Receive via bgp bgp.mvpn RightAway vrf[s].mvpn
Send only to E-BGP Peers
Source-Active AD
T5 S-Active AD Receive via xmpp vrf.mvpn RightAway bgp.mvpn, vrf[s].mvpn
T5 S-Active AD Receive via bgp bgp.mvpn RightAway vrf[s].mvpn
SharedTreeJoin
T6 C-<*, G> Receive via xmpp vrf.mvpn Src-Active bgp.mvpn, vrf[s].mvpn
PIM-Register is present
T6 C-<*, G> Receive via bgp bgp.mvpn RightAway vrf[s].mvpn
SourceTreeJoin
T7 C-<S, G> Receive via xmpp vrf.mvpn Source is bgp.mvpn, vrf[s].mvpn
resolvable
T7 C-<S, G> Receive via bgp bgp.mvpn RightAway vrf[s].mvpn
Send T3 S-PMSI
With Leaf Info required
T3 S-PMSI T6/T7 create in vrf vrf.mvpn RightAway bgp.mvpn, vrf[s].mvpn
T3 S-PMSI Receive via bgp bgp.mvpn RightAway vrf[s].mvpn
With PMSI (Ingress Replication + GlobalTreeRootLabel + Encap: MPLS over GRE/UDP)
T4 Leaf-AD T3 replicated in vrf vrf.mvpn GlobalErmRt bgp.mvpn, vrf[s].mvpn
available Update GlobalErmRt with
Input Tunnel Attribute
T4 Leaf-AD Receive via bgp bgp.mvpn RightAway vrf[s].mvpn
or local replication Send xmpp update for ingress
vrouter with PMSI info
In Phase-1 receivers always inside the contrail cluster and sender always outside the cluster. Support for Source specific multicast <C-S,G> only (No support for C-<*, G> ASM). No support for Inclusive I-PMSI either.
Route-Type CreateWhen Primary ReplicateWhen Secondary
================================================================================
Without Inclusive PMSI
T1 AD Configure Mvpn vrf.mvpn RightAway bgp.mvpn, vrf[s].mvpn
T1 AD Receive via bgp bgp.mvpn RightAway vrf[s].mvpn
Send only to I-BGP Peers
Without Inclusive PMSI
T2 AD Configure Mvpn vrf.mvpn RightAway bgp.mvpn, vrf[s].mvpn
T2 AD Receive via bgp bgp.mvpn RightAway vrf[s].mvpn
Send only to E-BGP Peers
SourceTreeJoin
T7 C-<S, G> Receive via xmpp vrf.mvpn Source is bgp.mvpn, vrf[s].mvpn
resolvable
With Leaf Info required
T3 S-PMSI Receive via bgp bgp.mvpn RightAway vrf[s].mvpn
With PMSI (Ingress Replication + GlobalTreeRootLabel + Encap: MPLS over GRE/UDP)
T4 Leaf-AD T3 replicated in vrf vrf.mvpn GlobalErmRt bgp.mvpn, vrf[s].mvpn
available Update GlobalErmRt with
Input Tunnel Attribute
Note: Whenever a route is replicated in bgp.mvpn.0, it is expected to be sent to all remote bgp peers (subjected to route-target-filtering) and imported into vrf[-s].mvpn.0 based on the export route-target of the route and import route target of the table.
Mvpn work flow is essentially handled based on route change notification. MvpnManager listens to route change notifications to mvpn table. Primary paths are created and managed by MvpnManager. Secondary paths are created and managed inside the virtual function MvpnTable::RouteReplicate(). MvpnProjectManager listens to route change notifications to .ermvpn.0 table and takes appropriate actions on all mvpn routes applicable for that <S,G>.
Each MvpnManager of vrf.mvpn.0 originates Type-1 and Type-2 A-D Routes inside vrf.mvpn.0 (When ever mvpn is configured/enabled in the VN) (Note: There is no Inclusive PMSI information encoded when these routes are created/advertised)
Self control-node IP address, router-id and asn are used where ever originator information is encoded.
Type-1 AD prefix format
1:RD:OriginatorIpAddr (RD in asn:vn-id or router-id:vn-id format)
1:self-control-node-router-id:vn-id:originator-control-node-ip-address
export route target is export route target of the of the routing-instance. These routes would get imported to all mvpn tables whose import route-targets list contains this exported route-target (Similar to how vpn-unicast routes get imported) (aka JUNOS auto-export)
Type-2 AD prefix format
2:RD:SourceAs
1:self-control-node-router-id:vn-id:source-as
These routes first get replicated to bgp.mvpn.0 and then get advertised to all BGP neighbors with whom mvpn AFI is exchanged as part of initial capability negotiation. This is the bgp based classic mvpn-site auto discovery.
Note: Intra-AS route is only advertised to IBGP neighbors and Inter-AS route is advertised to only e-bgp neighbors. Since, those neighbors are already part of distinct peer groups on the outbound side, simple hard-coded filtering can be applied to get this functionality.
Auto discovered Mvpn neighbors are stored inside map<IpAddress, MvpnNeighbor *> in each MvpnManager object. (i.e, per virtual-network). Access to this map shall be protected using a mutex. Clients get a copy of the MvpnNeighbor structure, upon call to MvpnManager::findNeighbor(). This allows one to update the neighbors map inline, in RouteListner() method after Type-1/Type-2 routes are replicated/deleted in vrf[s].mvpn.o and are notified.
Agent sends over XMPP, IGMP joins over VMI as C-<S,G> routes and keeps track of map of all S-G routes => List of VMIs (and VRFs) (for mapping to tree-id). These C-<S,G> routes are added to vrf.mvpn.0 table with source protocol XMPP as Type7 route in vrf.mvpn.0. This shall have zero-rd as the source-root-rd and 0 as the root-as asn (since these values are NA for the primary paths)
Format of Type 7 <C-S, G> route added to vrf.mvpn.0 with protocol local/Mvpn
7:<zero-router-id>:<vn-id>:<zero-as>:<C, G>
MvpnManager who is a listener to this route, upon route change notification
- Register /32 Source address is for resolution in PathResolver.
- Create/Update DB-State in MvpnManager::RouteStatesMap inside the project (S,G) specific MvpnManagerPartition object.
When ever this address is resolvable (or otherwise), the Type-7 route is notified and the route replicator tries to replicate the path.
o If the route is resolvable (over BGP), next-hop address and rt-import extended rtarget community associated with the route is retrieved from the resolver data structure (which holds a copy of the information necessary such as inside the path-attributes).
If the next-hop (i.e, Multicast Source) is resolvable, then this Type-7 path is replicated into all vrfs applicable, including bgp.mvpn.0.
o If the route is not resolvable any more, then any already replicated Type-7 path is deleted (By simply returning NULL from MvpnTable::RouteReplicate())
Format of replicated path for Type 7 C-<S, G> replicated path is as shown below.
7:<source-root-rd>:<root-as>:<C, G>
7:source-root-router-id:vn-id:<root-as>:<C, G>
export route-target should be rt-import route-target of route towards source
(as advertised by ingress PE)
Note: If the resolved nexthop is not an Mvpn BGP neighbor, then no join must be sent out. Hence, this Type-7 join route is replicated into bgp.mvpn.0 only if there is a Type1/Type2 route received from the resolved nexthop bgp neighbor.
On the other hand, if received Type1/Type2 routes get withdrawn, all replicated type-7 (and type-4 ?) routes for that PE must also be no longer replicated and instead be deleted from the secondary tables.
If Type1/Type2 routes come in later (auto-discovery happens afterwards), then all type-7 routes must be replicated and advertised again to the newly discovered mvpn bgp neighbor.
Concurrency: Since BGP AD related events do not carry any C,G information, we cannot safely walk all the C,Gs from one partition. Instead, we can just launch mvpn db tables walk and notify all Type-7 and Type-4 routes. Inside RouteReplicate(), new type-7/type-4 paths shall be replicated or already replicated type-7/type-4 paths shall get deleted based on BGP Mvpn auto discovered neighbor presence/absence.
Q: Instead, should we maintain a set of all type-7 and type-4 paths (on a per partition basis)? Changes to auto discovery states are quite rare in nature, though. Hence it seems this scenario can be handled simply by walking the entire vrf.mvpn.0 table.
Once replicated into bgp.mvpn.0, the secondary path will be advertised to all other mvpn neighbors. (Route Target Filtering will ensure that it is only sent to the ingress PE). For phase 1 this route only needs to be replicated to bgp.mvpn.0 (Sender is always outside the contrail cluster). There is no need to replicate this route to any other vrf locally. This is required when we add support for sender in one virtual-network and receivers in another with in the contrail cloud.
Any change to the reachability to Source shall be dealt as delete of old Type-7 secondary path and add of new Type-7 secondary path.
Note: This requires advertising IGMP Routes as XMPP routes into different table .mvpn.0 (in addition to vrf.mvpn.0). Hence requires changes to agents. Agents should refcount and advertise <S,G> once to .mvpn.0, as new receivers join (or old receivers leave).
In bgp.mvpn.0 table, all mvpn prefixes across all VNs reside. Hence the routes are made distinct by prepending with appropriate route distinguisher. The format of this would be :. Router-Ids could be self (control-node) router-id or a remote peer router-id as appropriate. the VRF-ID though must be the unique virtual network id associated with that router-id. For remote peers, this id can be retrieved from the Type-1/Type-2 auto-discovery routes. Locally, this is allocated during RoutingInstance construction.
As mentioned in previous section, when C-<S, G> route is received and installed in vrf.mvpn.0 table (protocol: XMPP), Source needs to be resolved in vrf.inet.0 table in order reach out to the ingress PE. Hence, Type-7 paths added in BgpXmppChannel::ProcessMvpnItem() are always marked for "ResolutionRequested".
This ensures that route resolution is started when those routes are added to the BgpTable automatically, by the current implementation. When Type-7 routes are deleted, current code automatically stops resolution requests as well.
PathResolver code should be slightly modified though in order to
- Support longest prefix match
- Add resolved paths into rt only conditionally. mpvn does not need explicit resolved paths in the route. It only needs unicast route resolution.
- When unicast RPF Source is [un-]resolved, the Type-7 join route is notified. Inside MvpnTable::Replicate(), Type-7 C-<S,G> route can be replicated into bgp.mvpn.0 table with the correct RD and export route target (or delete the route if it was replicated before and the Source is no longer resolvable, or if resolution parameters change). Check must be also made to ensure that resolved nexthop is also an active mvpn bgp neighbor.
When this <S,G> Type-7 route sent by egress PE is successfully imported into vrf.mvpn.0 (by matching auto-generated rt-import route-target) by the ingress PE and if provider-tunnel is configured, then ingress PE should generate Type-3 C-<S, G> S-PMSI AD Route into vrf.mvpn.0 table. From here, this would replicated to bgp.mvpn.0 and is advertised to remote (egress) PEs.
e.g. Ingress (Sender) PE S-PMSI configuration in JUNOS
set routing-instances vrf provider-tunnel selective group 232.2.0.0/16 source 1.2.3.4/32 ingress-replication label-switched-path
This is not targeted for Phase 1. Please refer to Section 4.3 for general Routes flow for these ASM routes.
3:<root-rd>:<C-S,G>:<Sender-PE-Router-Id> (Page 240)
Target would be the export target of the vrf (so it would get imported into vrf.mvpn.0 in the egress pe)
PMSI Flags: Leaf Information Required 1 (So that egress can initiate the join for the pmsi tunnel) (Similar to RSVP based S-PMSI, Page 251) TunnelType is expected to be only Ingress-Replication and Label is always expected to be 0.
This route is originated when ever Type-6/Type-7 Customer Join routes are received and replicated in vrf.mvpn.0 (at the ingress). Origination of this route in control-node is not required for Phase 1.
4:<S-PMSI-Type-3-Route-Prefix>:<Receive-PE-Router-ID>
Route target: <Ingress-PE-Router-ID>:0
PMSI Tunnel: Tunnel Type: Ingress-Replication and label:Tree-Label
When Type-3 route is imported into vrf.mvpn.0 table at the egress, (such as control-node), the secondary Type-3 route is notified.
MvpnManager, upon notification for any Type-3 route shall do the following.
On Add/Change
- Originate a new Type-4 route in vrf.mvpn.0 table corresponding to the newly arrived Type-3 route and add this to the S-G specific map inside associated MvpnProjectManager object.
- No PMSI Tunnel attribute is encoded (yet) to this Type-4 path
- Notify the route
On Delete, any Type-4 path originated is retrieved from the DB state and deleted.
This type-4 path is replicated into bgp and other tables if PMSI information can be computed (in MvpnTable::RouteReplicate()). PMSI Information is retrieved via api ErmVpnManager::GetGlobalRouteForestNodePMSI(). Specifically, Mvpn needs the input label and IP address of the MCast Forwarder (Vrouter) of the forest node of the global (level-1) ermvpn tree. If this PMSI info is present, the type-4 leaf-ad route is replicated into bgp and other tables as appropriate.
Also, GlobalErmVpnRoute is notified so that the forest node xmpp route can be updated with appropriate input tunnel information (Provided by Mvpn module, which is the IP address present in the prefix of this originated leaf-ad route)
If PMSI information cannot be retrieved, (e.g. the global level-1 ermvpn tree is not computed yet), then type-4 path is not replicated into other tables.
On the other hand, when global tree is computed or updated, MvpnProjectManager listener gets called. In this callback,
o Walk the list of all type-4 paths already originated for this S-G and notify.
o RouteReplicate() of Type-4 path would run like how it did before and can now [un-]replicate the type-4 path based on the [un-]availability of the PMSI information.
MvpnManager shall provide an API to ErmVpnManager to encode input tunnel attr for the forest node of the (self) local tree.
MvpnManager::GetInputTunnelAttr(ErmVpnRoute *global_ermvpn_tree_route) const;
ErmVpn manager shall call this API when ever forest node route (olist) is computed. MvpnManager can find this info from the associated db state. If the route is indeed replicated successfully, then the ingress PE router address can be provided as input tunnel attribute (which is part of the Type-4 route prefix)
Reference: Page 254
? It goes into correct vrf only because the prefix itself is copied ?
This is mainly applicable for inter-as mvpn stitched over segmented tunnels. This is not targeted for Phase 1.
Build one tree for each unique C-<S, G> combination under each tenant/project. Note: <S,G> is expected to be unique within a tenant (project) space.
Tree can be built completely using existing ErmVpn feature.
Master control-node which builds level-1 tree shall also be solely responsible for advertising mvpn routes. Today, this is simply decided based on control-node with the lowest router-id.
Agent shall continue to advertise routes related to BUM traffic over xmpp to vrf.ermvpn.0 table. However, all mvpn routes (which are based on IGMP joins or leaves) are always advertised over XMPP to <project/tenant>.mvpn.0 table.
Control-node shall build an edge replicated multicast tree like how it does so already. PMSI Tunnel information is gathered from the forest node of the locally computed tree. (bottom-most and right-most leaf node in the replication tree)
When advertising Type3 Leaf AD Route (Section 4.7), egress must also advertise PMSI tunnel information for forwarding to happen. Mvpn Manager shall use the root node ingress label as the label value in PMSI tunnel attribute
When ever tree is recomputed (and only if the forest node changes), Type4 route is updated and re-advertised as necessary. If a new node is selected as root, Type-4 route can be simply updated and re-advertised (implicit withdraw and new update)
ErmVpn Tree built inside the contrail cluster is fully bi-directional and self contained. Vrouter would flood the packets within the tree only so long as the packet was originated from one of the nodes inside the tree (in the oif list)
This is no longer true when a single node of the tree is stitched with the SDN gateway using ingress replication. The stitched node should be programmed to accept multicast packets from SDN gateway (over the GRE/UDP tunnel) and then flood them among all the nodes contained inside the tree. This is done by encoding a new "Input Tunnel Attribute" to the xmpp route sent to the agent. This attribute shall contain the IP address of the tunnel end point (SDN GW) as well as the Tunnel Type (MPLS over GR/MPLS over UDP) as appropriate.
Vrouter should relax its checks and indeed accept received multicast packets from this SDN gateway even though that incoming interface may not be part of the oif list.
vrf.inet.0 table must be available when Type-7 routes are added to vrf.mvpn.0 so that route resolution for the source address can be started. At present, BgpXmppChannel module defers all tables subscribe/unsubscribe and routes addition/deletion from the agent until the RoutingInstance and the tables are created (via configuration processing). This dependency logic must be extended to span across all families (tables).
This design also relies on the presence of <project/tenant> specific virtual network. This routing-instance is created based on the configuration. No assumption shall be made in the order of processing of this configuration processing. Instead, MvpnManager module shall handle the scenario in which .mvpn.0 may not be present at certain time and may come in later.
Agent implements the IGMP router functionality. Even though IGMP implementation in 'services/grpmgmt' handles IGMPv1, v2, and v3, only IGMPv3 is of interest in contrail. More specifically, IGMPv3 which can derive <S,G> in include mode is handled. <*,G> is not supported in this phase.
IGMP packets and <S,G> notification from IGMP router implementation is handled in the same task and instance context. Reports are also sent in the same task and instance context.
Every IPAM per VN is registered with the IGMP module using the gateway or the dns server address for handling of IGMP packets on the compute. IGMP packets from Agent do not get sent out of the compute/vrouter when IGMP is enabled on the VMI.
The Agent expects IGMP packets from the local VMIs only and such IGMP packets should be sourced from valid IP address belonging to an active VN-IPAM.
The <S,G> join and leaves per VMI basis is tracked. First <S,G> join by any VMI on compute will result in a multicast route addition in VMIs vrf. The route is also added to the ip-fabric vrf IPv4 multicast table if it does not already exist. A Mvpn subscription notification is sent to the contrail-control for participation of compute in Mvpn for the <S,G>. Also, notification to contrail-control is sent when route is added to the ip-fabric vrf for pariticipation in the ermvpn tree.
Last <S,G> leave from any VMI will result in delete of route from both VMIs vrf. It may also result in delete of <S,G> route from the ip-fabric vrf IPv4 multicast table if no other VNs are interested in the <S,G>. Unsubscribe notifications to contrail-control is sent for both Mvpn and ermvpn as needed.
Even though, route is created in the Agent for the <S,G> they are not downloaded to the vrouter, since the source is not expected to be in the contrail in this phase. Only a composite nexthop entry is updated in the vrouter. The nexthop in the ip-fabric per <S,G> contains list of local VMs which had shown interest via IGMP and also list of tunnel nexthops dervied using ermvpn on the contrail control.
##5.2 Forwarding performance
This feature requires automatic creation and management of project or tenant specific virtual-network in the configuration. Care must be taken to ensure that old configuration which would not have such a virtual-network continues to work correctly.
####If this feature deprecates any older feature or API then list it here.
- Project specific virtual-network to manage forwarding of multicast data
- IGMP support in contrail vrouter agent
- Advertisement of IGMP routes over XMPP to vrf.mvpn.0 and .mvpn.0
- Ingress Replication over MPLS-over-GRE/MPLS-over-UDP tunnels in SDN (JUNOS)