Skip to content

Commit

Permalink
Implemented persistency models - Cache and Persistent
Browse files Browse the repository at this point in the history
With this change, PMEM volumes are created in CreateVolume() call, irrespective
of persistency model. This means there is no 'delayed volume' creation in
ControllerPublishVolume.

In normal persistent volume case the driver creates a pmem volume on a free node
in the order of 'CreateVolumeRequest.TopologyRequirement.Preferred'. And locks
the newly created volume to that node by using 'Volume.Topology'. The container
orchestrator has to make sure that the application which claims this volume
shall run on this node.

Unlike normal persistent volume, cache volume creates set of pmem volumes on
different number of nodes, each with its own local data, in the order of
'CreateVolumeRequest.TopologyRequirement.Preferred'.  Applications are started
on those nodes and then get to use the volume on their node. Data persists
across application restarts. This is useful when the data is only cached
information that can be discarded and reconstructed at any time and the
application can reuse existing local data when restarting.

Introduced new volume/StorageClass parameters:
- "persistencyModel", which shall set to "cache" for cache volumes and
- "cacheSize" to request no of nodes the created volume can be used.

Updated README with different possible persistency models and Kubernetes usage
for cache and persistent volumes.

Removed node locking for volume tests, as we implemented full topology and
provisioning in cluster supposed to work.
  • Loading branch information
avalluri committed Mar 1, 2019
1 parent 6c9e6a0 commit d24d610
Show file tree
Hide file tree
Showing 16 changed files with 554 additions and 215 deletions.
42 changes: 37 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,16 +125,48 @@ The [`test/setup-ca-kubernetes.sh`](test/setup-ca-kubernetes.sh) script shows ho

A production deployment can improve upon that by using some other key delivery mechanism, like for example [Vault](https://www.vaultproject.io/).

### Dynamic provisioning

The following diagram illustrates how the PMEM-CSI driver performs dynamic volume provisioning in Kubernetes:
![sequence diagram](/docs/images/sequence/pmem-csi-sequence-diagram.png)
<!-- FILL TEMPLATE:
* Target users and use cases
* Design decisions & tradeoffs that were made
* What is in scope and outside of scope
-->

### Volume Persistency

In a typical CSI deployment, volumes are provided by a storage backend that is independent of a particular node. When a node goes offline, the volume can be mounted elsewhere. But PMEM volumes are *local* to node and thus can only be used on the node where they were created. This means the applications using PMEM volume cannot freely move between nodes. This limitation needs to be considered when designing and deploying applications that are to use *local storage*.

Below are the volume persistency models considered for implementation in PMEM-CSI to serve different application use cases:

* Persistent Volumes
A volume gets created independently of the application, on some node where there is enough free space. Applications using such a volume are then forced to run on that node and cannot run when the node is down. Data is retained until the volume gets deleted.

* Ephemeral Volumes
Each time an application starts to run on a node, a new volume is created for it on that node. When the application stops, the volume is deleted. The volume cannot be shared with other applications. Data on this volume is retained only while the application runs.

* Cache Volumes
Volumes are pre-created on a certain set of nodes, each with its own local data. Applications are started on those nodes and then get to use the volume on their node. Data persists across application restarts. This is useful when the data is only cached information that can be discarded and reconstructed at any time *and* the application can reuse existing local data when restarting.

Volume | Kubernetes | PMEM-CSI | Limitations
--- | --- | --- | ---
Persistent | supported | supported | topology aware scheduling<sup>1</sup>
Ephemeral | [in design](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20190122-csi-inline-volumes.md#proposal) | in design | topology aware scheduling<sup>1</sup>, resource constraints<sup>2</sup>
Cache | supported | supported | topology aware scheduling<sup>1</sup>

<sup>1 </sup>[Topology aware scheduling](https://github.com/kubernetes/enhancements/issues/490)
ensures that an application runs on a node where the volume was created. For CSI-based drivers like PMEM-CSI, Kubernetes >= 1.13 is needed. On older Kubernetes releases, pods must be scheduled manually onto the right node(s).

<sup>2 </sup>The upstream design for ephemeral volumes currently does not take [resource constraints](https://github.com/kubernetes/enhancements/pull/716#discussion_r250536632) into account. If an application gets scheduled onto a node and then creating the ephemeral volume on that node fails, the application on the node cannot start until resources become available.

#### Usage on Kubernetes

Kubernetes cluster administrators can expose above mentioned [volume persistency types](#volume-persistency) to applications using [`StorageClass Parameters`](https://kubernetes.io/docs/concepts/storage/storage-classes/#parameters). An optional `persistencyModel` parameter differentiates how the provisioned volume can be used.

* if no `persistencyModel` parameter specified in `StorageClass` then it is treated as normal Kubernetes persistent volume. In this case PMEM-CSI creates PMEM volume on a node and the application that claims to use this volume is supposed to be scheduled onto this node by Kubernetes. Choosing of node is depend on StorageClass `volumeBindingMode`. In case of `volumeBindingMode: Immediate` PMEM-CSI chooses a node randomly, and in case of `volumeBindingMode: WaitForFirstConsumer` Kubernetes first chooses a node for scheduling the application, and PMEM-CSI creates the volume on that node. Applications which claim a normal persistent volume has to use `ReadOnlyOnce` access mode in its `accessModes` list. This [diagram](/docs/images/sequence/pmem-csi-persistent-sequence-diagram.png) illustrates how a normal persistent volume gets provisioned in Kubernetes using PMEM-CSI driver.

* `persistencyModel: cache`
Volumes of this type shall be used in combination with `volumeBindingMode: Immediate`. In this case, PMEM-CSI creates a set of PMEM volumes each volume on different node. The number of PMEM volumes to create can be specified by `cacheSize` StorageClass parameter. Applications which claim a `cache` volume can use `ReadWriteMany` in its `accessModes` list. Check with provided [cache StorageClass](deploy/kubernetes-1.13/pmem-storageclass-cache.yaml) example. This [diagram](/docs/images/sequence/pmem-csi-cache-sequence-diagram.png) illustrates how a cache volume gets provisioned in Kubernetes using PMEM-CSI driver.
**NOTE**: Cache volumes are local to node not Pod. If two Pods using the same cache volume runs on the same node, will not get their own local volume, instead they endup sharing the same PMEM volume. Applications has to consider this and use available Kubernetes mechanisms like [node aniti-affinity](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity) while deploying. Check with provided [cache application](deploy/kubernetes-1.13/pmem-app-cache.yaml) example.

## Prerequisites

### Software required
Expand All @@ -160,7 +192,7 @@ The driver does not create persistent memory Regions, but expects Regions to exi

PMEM-CSI driver implements CSI specification version 1.0.0, which only supported by Kubernetes versions >= v1.13. The driver deployment in Kubernetes cluster has been verified on:

| Branch | Kubernetes branch/version | Required Alfa feature-gates |
| Branch | Kubernetes branch/version | Required alfa feature-gates |
|-------------------|--------------------------------|---------------------------- |
| devel | Kubernetes 1.13 | CSINodeInfo, CSIDriverRegistry |

Expand Down
42 changes: 42 additions & 0 deletions deploy/kubernetes-1.13/pmem-app-cache.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
apiVersion: apps/v1beta2
kind: ReplicaSet
metadata:
name: my-csi-app
spec:
selector:
matchLabels:
app: my-csi-app
replicas: 2
template:
metadata:
labels:
app: my-csi-app
spec:
# make sure that no two Pods run on same node
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values: [ my-csi-app ]
topologyKey: "kubernetes.io/hostname"
containers:
- name: my-frontend
image: busybox
command: [ "/bin/sh" ]
args: [ "-c", "touch /data/$(POD_NAME); sleep 100000" ]
env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
volumeMounts:
- mountPath: "/data"
name: my-csi-volume
volumes:
- name: my-csi-volume
persistentVolumeClaim:
claimName: pmem-csi-pvc-cache
11 changes: 11 additions & 0 deletions deploy/kubernetes-1.13/pmem-pvc-cache.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pmem-csi-pvc-cache
spec:
accessModes:
- ReadWriteMany # cache volumes are multi-node volumes
resources:
requests:
storage: 8Gi
storageClassName: pmem-csi-sc-cache # defined in pmem-storageclass-cache.yaml
10 changes: 10 additions & 0 deletions deploy/kubernetes-1.13/pmem-storageclass-cache.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: pmem-csi-sc-cache
provisioner: pmem-csi
reclaimPolicy: Delete
volumeBindingMode: Immediate
parameters:
persistencyModel: cache
cacheSize: "2"
170 changes: 170 additions & 0 deletions docs/diagrams/sequence-cache.wsd
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
@startuml "pmem-csi-cache-sequence-diagram"

title \nDynamic provisioning of pmem-csi "cache" volume\n

skinparam BoxPadding 40

actor Admin as admin #red
actor User as user
entity Kubernetes as k8s
box "Master node"
entity kubelet as masterkubelet
participant "external-provisioner" as provisioner
participant "external-attacher" as attacher
participant "pmem-csi-driver" as masterdriver
endbox

box "Compute node X"
entity kubelet as nodekubeletX
participant "pmem-csi-driver" as nodedriverX
endbox

box "Compute node Y"
entity kubelet as nodekubeletY
participant "pmem-csi-driver" as nodedriverY
endbox

== Driver setup ==
admin->k8s:Label nvdimm nodes: <b>storage=nvdimm</b>
k8s->admin

' deploy driver
admin->k8s:deploy driver\nkubectl create -f pmem-csi.yaml
k8s->admin
k8s->masterkubelet:start driver pod
masterkubelet-->provisioner:start container
masterkubelet-->attacher:start container
masterkubelet-->masterdriver:start container
note right of masterdriver
listen on tcp port 10000
end note
k8s-->nodekubeletX:start driver pod
nodekubeletX-->nodedriverX:start container
note left of nodedriverX
* prepare logical volume groups
* listen on port 10001
* listen on unix socket:
/var/lib/kubelet/plugins/pmem-csi/csi.sock
end note
nodedriverX->masterdriver:RegistryServer.RegisterNodeController(\n{nodeId:"node-x", endpoint:"http://ip:10001"})

k8s-->nodekubeletY:start driver pod
nodekubeletY-->nodedriverY:start container
note left of nodedriverY
* prepare logical volume groups
* listen on port 10001
* listen on unix socket:
/var/lib/kubelet/plugins/pmem-csi/csi.sock
end note
nodedriverY->masterdriver:RegistryServer.RegisterNodeController(\n{nodeId:"node-y", endpoint:"http://ip:10001"})

' install a storage class
admin->k8s:create StorageClass\nkubectl create -f pmem-storageclass-cache.yaml
note left of k8s
metadata:
name: pmem-csi-sc-cache
volumeBindingMode: <b>Immediate
paramters:
persistencyModel: cache
cacheSize: "2"
end note
k8s->admin

' provision a cache volume
== Volume provisioning ==
admin->k8s:create PVC object\nkubectl create -f pmem-pvc-cache.yaml
note left of k8s
metatdata:
name: pmem-csi-pvc-cache
spec:
storageClassName: pmem-csi-sc-cache
end note
k8s->admin
k8s-->provisioner:<<Event>>\nPersistentVolumeClaim created
activate provisioner
provisioner->masterdriver:CSI.Controller.CreateVolume()
masterdriver->nodedriverX:csi.Controller.CreateVolume()
nodedriverX->nodedriverX:create pmem volume
nodedriverX->masterdriver:success
masterdriver->nodedriverY:csi.Controller.CreateVolume()
nodedriverY->nodedriverY:create pmem volume
nodedriverY->masterdriver:success
masterdriver->provisioner:success
note left of masterdriver
prepare Topology information:
Volume{
accessible_topology: [
segments:{ "pmem-csi.intel.com/node":"node-x"},
segments:{ "pmem-csi.intel.com/node":"node-y"} ]
}
end note
provisioner->k8s:Create PV object
deactivate provisioner

== Volume usage ==
' Start an application
user->k8s:Create application pod
note left of k8s
volumes:
- name: my-csi-volume
persistentVolumeClaim:
claimName: pmem-csi-pvc-cache
end note

k8s->user:success

k8s->nodekubeletX:schedules pod on node-x
note right of k8s
Kubernetes is might choose <b>node-x</b> or <b>node-y</b>.
end note

k8s-->nodekubeletX:make available volume to pod
nodekubeletX->nodedriverX:csi.Node.StageVolume()
activate nodedriverX
nodedriverX->nodedriverX:mount pmem device
nodedriverX->nodekubeletX:success
deactivate nodedriverX

nodekubeletX->nodedriverX:csi.Node.PublishVolume()
activate nodedriverX
nodedriverX->nodedriverX:bind mount pmem device
nodedriverX->nodekubeletX:success
deactivate nodedriverX

' deprovision a cache volume
== Volume Deletion ==
' stop pod
user->k8s:stop applicaiton pod
k8s->user:success
k8s->nodekubeletX:stop pod containers

nodekubeletX->nodedriverX:csi.Node.UnPublishVolume()
activate nodedriverX
nodedriverX->nodedriverX:unmout pod's bind mount
nodedriverX->nodekubeletX:success
deactivate nodedriverX

nodekubeletX->nodedriverX:csi.Node.UnstageVolume()
activate nodedriverX
nodedriverX->nodedriverX:unmount pmem device
nodedriverX->nodekubeletX:success
deactivate nodedriverX

'''''''''''''''''''''''''''
admin->k8s:Delete PVC object\nkubectl delete pvc pmem-pvc-cache
k8s->admin
k8s-->provisioner:<<Event>>\nPersistentVolumeClaim deleted
activate provisioner
provisioner->masterdriver:CSI.Controller.DeleteVolume()
masterdriver->nodedriverX:csi.Controller.DeleteVolume()
nodedriverX->nodedriverX:delete pmem volume
nodedriverX->masterdriver:success
masterdriver->nodedriverY:csi.Controller.DeleteVolume()
nodedriverY->nodedriverY:delete pmem volume
nodedriverY->masterdriver:success
masterdriver->provisioner:success
provisioner->k8s:Delete PV object
deactivate provisioner


@enduml
Loading

0 comments on commit d24d610

Please sign in to comment.