这段时间工作中用到了Docker以及Kubernetes（简称K8S），现在整理下我学习Docker以及K8S过程中看的一些比较好的资料，方便自己回顾，也希望能给容器小白一些帮助。给自己定一个小目标，二月底之前完成。

这是本系列的第五篇文章，将介绍Docker的原理之Namespace。(整理自：http://coolshell.cn/articles/17010.html)

docker容器本质上是宿主机上的进程。Docker 通过namespace实现了资源隔离。

一、简介

Linux Namespace是linux提供的一种内核级别环境隔离的方法。不知道你是否还记得很早以前的Unix有一个叫chroot的系统调用（通过修改根目录把用户jail到一个特定目录下），chroot提供了一种简单的隔离模式：chroot内部的文件系统无法访问外部的内容。Linux Namespace在此基础上，提供了对UTS、IPC、mount、PID、network、User等的隔离机制。

举个例子，我们都知道，Linux下的超级父亲进程的PID是1，所以，同chroot一样，如果我们可以把用户的进程空间jail到某个进程分支下，并像chroot那样让其下面的进程看到的那个超级父进程的PID为1，于是就可以达到资源隔离的效果了（不同的PID namespace中的进程无法看到彼此）

šLinux Namespace 有如下种类，官方文档在这里《Namespace in Operation》

分类	系统调用参数	相关内核版本
Mount namespaces	CLONE_NEWNS	Linux 2.4.19
šUTS namespaces	CLONE_NEWUTS	Linux 2.6.19
IPC namespaces	CLONE_NEWIPC	Linux 2.6.19
PID namespaces	CLONE_NEWPID	Linux 2.6.24
Network namespaces	CLONE_NEWNET	始于Linux 2.6.24 完成于 Linux 2.6.29
User namespaces	CLONE_NEWUSER	始于 Linux 2.6.23 完成于 Linux 3.8)

UTS: hostname
IPC: 进程间通信
PID: "chroot"进程树
NS: 挂载点，首次登陆Linux
NET: 网络访问，包括接口
USER: 将本地的虚拟user-id映射到真实的user-id

主要是š三个系统调用

šclone() – 实现线程的系统调用，用来创建一个新的进程，并可以通过设计上述参数达到隔离。
šunshare() – 使某进程脱离某个namespace
šsetns() – 把某进程加入到某个namespace

二、几种Namespace

clone()系统调用

首先，我们来看一下一个最简单的clone()系统调用的示例，（后面，我们的程序都会基于这个程序做修改）：

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>
/* 定义一个给 clone 用的栈，栈大小1M */
#define STACK_SIZE (1024 * 1024)
staticcharcontainer_stack[STACK_SIZE];
char* constcontainer_args[] = {
    "/bin/bash",
    NULL
};
intcontainer_main(void* arg)
{
    printf("Container - inside the container!\n");
    /* 直接执行一个shell，以便我们观察这个进程空间里的资源是否被隔离了 */
    execv(container_args[0], container_args);
    printf("Something's wrong!\n");
    return1;
}
intmain()
{
    printf("Parent - start a container!\n");
    /* 调用clone函数，其中传出一个函数，还有一个栈空间的（为什么传尾指针，因为栈是反着的） */
    intcontainer_pid = clone(container_main, container_stack+STACK_SIZE, SIGCHLD, NULL);
    /* 等待子进程结束 */
    waitpid(container_pid, NULL, 0);
    printf("Parent - container stopped!\n");
    return0;
}

从上面的程序，我们可以看到，这和pthread基本上是一样的玩法。但是，对于上面的程序，父子进程的进程空间是没有什么差别的，父进程能访问到的子进程也能。

下面，让我们来看几个例子看看，Linux的Namespace是什么样的。

UTS Namespace

下面的代码，我略去了上面那些头文件和数据结构的定义，只有最重要的部分。

intcontainer_main(void* arg)
{
    printf("Container - inside the container!\n");
    sethostname("container",10); /* 设置hostname */
    execv(container_args[0], container_args);
    printf("Something's wrong!\n");
    return1;
}
intmain()
{
    printf("Parent - start a container!\n");
    intcontainer_pid = clone(container_main, container_stack+STACK_SIZE,
            CLONE_NEWUTS | SIGCHLD, NULL); /*启用CLONE_NEWUTS Namespace隔离 */
    waitpid(container_pid, NULL, 0);
    printf("Parent - container stopped!\n");
    return0;
}
 

运行上面的程序你会发现（需要root权限），子进程的hostname变成了 Container。

hchen@ubuntu:~$ sudo./uts
Parent - start a container!
Container - inside the container!
root@container:~# hostname
container
root@container:~# uname -n
container

IPC Namespace

IPC全称 Inter-Process Communication，是Unix/Linux下进程间通信的一种方式，IPC有共享内存、信号量、消息队列等方法。所以，为了隔离，我们也需要把IPC给隔离开来，这样，只有在同一个Namespace下的进程才能相互通信。如果你熟悉IPC的原理的话，你会知道，IPC需要有一个全局的ID，即然是全局的，那么就意味着我们的Namespace需要对这个ID隔离，不能让别的Namespace的进程看到。

要启动IPC隔离，我们只需要在调用clone时加上CLONE_NEWIPC参数就可以了。

intcontainer_pid = clone(container_main, container_stack+STACK_SIZE,
            CLONE_NEWUTS | CLONE_NEWIPC | SIGCHLD, NULL);

首先，我们先创建一个IPC的Queue（如下所示，全局的Queue ID是0）

hchen@ubuntu:~$ ipcmk -Q
Message queue id: 0
hchen@ubuntu:~$ ipcs -q
------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0xd0d56eb2 0          hchen      644        0            0

如果我们运行没有CLONE_NEWIPC的程序，我们会看到，在子进程中还是能看到这个全启的IPC Queue。

hchen@ubuntu:~$ sudo./uts
Parent - start a container!
Container - inside the container!
root@container:~# ipcs -q
------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0xd0d56eb2 0          hchen      644        0            0

但是，如果我们运行加上了CLONE_NEWIPC的程序，我们就会下面的结果：

root@ubuntu:~$ sudo./ipc
Parent - start a container!
Container - inside the container!
root@container:~/linux_namespace# ipcs -q
------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

我们可以看到IPC已经被隔离了。

PID Namespace

我们继续修改上面的程序：

intcontainer_main(void* arg)
{
    /* 查看子进程的PID，我们可以看到其输出子进程的 pid 为 1 */
    printf("Container [%5d] - inside the container!\n", getpid());
    sethostname("container",10);
    execv(container_args[0], container_args);
    printf("Something's wrong!\n");
    return1;
}
intmain()
{
    printf("Parent [%5d] - start a container!\n", getpid());
    /*启用PID namespace - CLONE_NEWPID*/
    intcontainer_pid = clone(container_main, container_stack+STACK_SIZE,
            CLONE_NEWUTS | CLONE_NEWPID | SIGCHLD, NULL);
    waitpid(container_pid, NULL, 0);
    printf("Parent - container stopped!\n");
    return0;
}

运行结果如下（我们可以看到，子进程的pid是1了）：

hchen@ubuntu:~$ sudo./pid
Parent [ 3474] - start a container!
Container [    1] - inside the container!
root@container:~# echo $$
1

你可能会问，PID为1有个毛用啊？我们知道，在传统的UNIX系统中，PID为1的进程是init，地位非常特殊。他作为所有进程的父进程，有很多特权（比如：屏蔽信号等），另外，其还会为检查所有进程的状态，我们知道，如果某个子进程脱离了父进程（父进程没有wait它），那么init就会负责回收资源并结束这个子进程。所以，要做到进程空间的隔离，首先要创建出PID为1的进程，最好就像chroot那样，把子进程的PID在容器内变成1。

但是，我们会发现，在子进程的shell里输入ps,top等命令，我们还是可以看得到所有进程。说明并没有完全隔离。这是因为，像ps, top这些命令会去读/proc文件系统，所以，因为/proc文件系统在父进程和子进程都是一样的，所以这些命令显示的东西都是一样的。

所以，我们还需要对文件系统进行隔离。

Mount Namespace

下面的例程中，我们在启用了mount namespace并在子进程中重新mount了/proc文件系统。

intcontainer_main(void* arg)
{
    printf("Container [%5d] - inside the container!\n", getpid());
    sethostname("container",10);
    /* 重新mount proc文件系统到 /proc下 */
    system("mount -t proc proc /proc");
    execv(container_args[0], container_args);
    printf("Something's wrong!\n");
    return1;
}
intmain()
{
    printf("Parent [%5d] - start a container!\n", getpid());
    /* 启用Mount Namespace - 增加CLONE_NEWNS参数 */
    intcontainer_pid = clone(container_main, container_stack+STACK_SIZE,
            CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS | SIGCHLD, NULL);
    waitpid(container_pid, NULL, 0);
    printf("Parent - container stopped!\n");
    return0;
}

运行结果如下：

hchen@ubuntu:~$ sudo./pid.mnt
Parent [ 3502] - start a container!
Container [    1] - inside the container!
root@container:~# ps -elf
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 S root         1     0  0  80   0 -  6917 wait   19:55 pts/2    00:00:00/bin/bash
0 R root        14     1  0  80   0 -  5671 -      19:56 pts/2    00:00:00 ps-elf

上面，我们可以看到只有两个进程，而且pid=1的进程是我们的/bin/bash。我们还可以看到/proc目录下也干净了很多：

root@container:~# ls /proc
1          dma          key-users  net            sysvipc
16         driver       kmsg        pagetypeinfo   timer_list
acpi       execdomains  kpagecount  partitions     timer_stats
asound     fb           kpageflags  sched_debug    tty
buddyinfo  filesystems  loadavg     schedstat      uptime
bus        fs           locks       scsi           version
cgroups    interrupts   mdstat      self           version_signature
cmdline    iomem        meminfo     slabinfo       vmallocinfo
consoles   ioports      misc        softirqs       vmstat
cpuinfo    irq          modules     stat           zoneinfo
crypto     kallsyms     mounts      swaps
devices    kcore        mpt         sys
diskstats  keys         mtrr        sysrq-trigger

下图，我们也可以看到在子进程中的top命令只看得到两个进程了。

这里，多说一下。在通过CLONE_NEWNS创建mount namespace后，父进程会把自己的文件结构复制给子进程中。而子进程中新的namespace中的所有mount操作都只影响自身的文件系统，而不对外界产生任何影响。这样可以做到比较严格地隔离。