KMview module interface specifications
From VirtualSquare
KMview Kernel Module provides a general system call tracing/modification interface. Here is the technical description of the interface specifications.
kmview_module interface
-----------------------
Renzo Davoli and Andrea Gasparini
Version: 2 (2010)
kmview_module has been designed as a kernel support for view-os on linux but
it is effectively an efficient support for any virtualization based on
system call interception and transformation.
kmview_module could be effectively used also by security tools based on
system call interposition.
There are two main entities in kmview: tracer and traced processes.
A tracer process cannot trace itself but it can be a traced process of
another tracer.
All the traced processes are in the offspring of their tracer,
when a process is traced there is no way to exit from the control
of the tracer.
A tracer process first open a read only connection to /dev/kmview
(major=10,minor=233 - officially assinged)
fd=open("/dev/kmview",O_RDONLY);
Before starting its first traced process, the tracer can set some flags
to set some extra features in this way:
ioctl(fd, KMVIEW_SET_FLAGS, flags);
This ioctl must be called when there are no traced processes otherwise
it returns EACCES (to prevent inconsistencies).
A "root" traced process is started in this way:
if (fork() == 0) {
ioctl(fd, KMVIEW_ATTACH);
close(fd);
..... code of the traced process, e.g. exec of some program
}
The root traced process must register itself as a traced process and close
the tracing file. If the traced process forks (or clones) other processes
they will be traced, too. No further direct interaction will take place
between the traced process and their tracer.
If a tracer dies (or it closes the fd) all the traced processes will be
killed (SIGKILL).
A tracer receive all events related to its traced processes using
a "read" or by the magicpoll technique (see over).
The received data follows the struct kmview_event specification:
struct kmview_event {
unsigned long tag;
union {
.... data for specific events ...
}
}
There are four basic events identified by the following tags:
KMVIEW_EVENT_NEWTHREAD: a new traced thread/process has just started
KMVIEW_EVENT_TERMTHREAD: a new traced thread/process terminated
KMVIEW_EVENT_SYSCALL_ENTRY: a traced process started a syscall
KMVIEW_EVENT_SYSCALL_EXIT: a syscall for a traced process completed its
execution.
In order to provide a fast interaction between kernel, module and tracer
each layer keeps its own id for processes.
The kernel identifies each process by its pid, the module has its own
identifier named kmpid and the tracer can use its own identified, the
umpid. (km stands for kernel-mode, um stands for user-mode).
In this way each layer can use its identifier as an index within an array:
there is not any waste of time to scan into tables or waste of code to
keep hash tables. Technically speaking the whole system scales as O(1)
(no extra costs related to the number of processes).
All the events reported by the module to the tracer carry the umpid, (except
KMVIEW_EVENT_NEWTHREAD). All the requests sent (through ioctl) from the
tracer to the module carry the kmpid. If a tracer tries to send an ioctl
for a process handled by another tracer it gets an error (EPERM).
(a process handled by several nested tracers has a different kmpid for
each tracer).
Basic Events:
------------------------------------------------------------------------------
KMVIEW_EVENT_NEWTHREAD:
struct kmview_event_newthread{
pid_t kmpid;
pid_t pid;
pid_t umppid;
unsigned long flags;
} newthread;
A new thread has just started. The tracer must store its kmpid.
umppid is the umpid of the parent (forking/cloning) process: the tracer can
use this field to keep trace of the hierarchy in its data structures.
flags are the cloning flags (as described in clone(2)).
Before reading other events the tracer must send the umpid of this new
thread to the module in this way:
struct kmview_ioctl_umpid {
pid_t kmpid;
pid_t umpid;
};
ioctl(fd, KMVIEW_UMPID, & {struct kmview_ioctl_umpid var} );
If a tracer wants to use pid or kmpid instead of having its own identifiers
it should copy pid or kmpid respectively to the umpid field.
------------------------------------------------------------------------------
KMVIEW_EVENT_TERMTHREAD:
struct kmview_event_termthread{
pid_t umpid;
unsigned long remaining;
} termthread;
The process/thread identified by umpid terminated. No further event will be
reported for that process/thread.
The field remaining contains the overall number of processes handled by
this tracer. Many tracers shut down when remaining==0;
------------------------------------------------------------------------------
KMVIEW_EVENT_SYSCALL_ENTRY:
struct kmview_event_ioctl_syscall{
union {
pid_t umpid;
pid_t kmpid;
unsigned long just_for_64bit_alignment;
} x;
unsigned long scno;
unsigned long args[6];
unsigned long pc;
unsigned long sp;
} syscall;
(forget the just_for_64bit_alignment field, it is not used.)
x.umpid is the umpid identifier, scno is the syscall number, args are
the arguments, pc is the program counter and sp the stack pointer.
When the tracer receive the event the traced process is
quiescent (as defined in utrace): it is waiting in a state very close to
the user state. While a kmview traced process is quiescent the process
can be restarted by its tracer or killed.
There are three different ways to restart a quiescent process for
a KMVIEW_EVENT_SYSCALL_ENTRY event:
1- KMVIEW_SYSRESUME:
ioctl(fd,KMVIEW_SYSRESUME,kmpid).
the syscall gets retarted as is. The tracer will not receive any
KMVIEW_EVENT_SYSCALL_EXIT event for this call.
2- KMVIEW_SYSVIRTUALIZED:
ioctl(fd,KMVIEW_SYSVIRTUALIZED,& {struct kmview_event_ioctl_sysreturn var})
struct kmview_event_ioctl_sysreturn{
union {
pid_t umpid;
pid_t kmpid;
unsigned long just_for_64bit_alignment;
} x;
long retval;
long errno;
} sysreturn;
the call has been virtualized. This system call will not be executed by
the linux kernel. Kmpid must be set and the return value, errno will be
those specified here.
The tracer will not receive any KMVIEW_EVENT_SYSCALL_EXIT event for this call.
3- KMVIEW_SYSMODIFIED/KMVIEW_SYSARGMOD:
ioctl(fd,KMVIEW_SYSMODIFIED,& {struct kmview_event_ioctl_syscall var})
ioctl(fd,KMVIEW_SYSARGMOD,& {struct kmview_event_ioctl_syscall var})
The call may have been modified. The kernel will execute the syscall
(maybe a different one) as stated by the registers.
Registers, scno, will be changed.
KMVIEW_SYSMODIFIED call causes a KMVIEW_EVENT_SYSCALL_EXIT event after
the syscall execution (to restore the original values of registers if needed).
After a KMVIEW_SYSARGMOD there is not a KMVIEW_EVENT_SYSCALL_EXIT notification.
------------------------------------------------------------------------------
KMVIEW_EVENT_SYSCALL_EXIT:
struct kmview_event_ioctl_sysreturn syscall;
With this event the tracer can get the result (return value or error)
of the syscall executed by the linux kernel.
The traced process is quiescent when the tracer receives the event.
To restart the process the tracer can use KMVIEW_SYSRESUME or
with the same syntax described above for KMVIEW_EVENT_SYSCALL_ENTRY or
KMVIEW_SYSRETURN:
ioctl(fd,KMVIEW_SYSRETURN,& {struct kmview_event_ioctl_sysreturn var})
by this latter call, return value and errno can be changed.
------------------------------------------------------------------------------
A minimal tracer appear as follows:
#include <stdio,h>
#include <fcntl.h>
#include <kmview.h>
void dowait(int signal)
{
int w;
wait(&w);
}
main(int argc, char *argv[])
{
int fd;
struct kmview_event event;
fd=open("/dev/kmview",O_RDONLY);
signal(SIGCHLD,dowait);
if (fork()) {
while (1) {
read(fd,&event,sizeof(event));
switch (event.tag) {
case KMVIEW_EVENT_NEWTHREAD:
{
struct kmview_ioctl_umpid ump;
printf("new process %d\n",event.x.newthread.pid);
ump.kmpid=event.x.newthread.kmpid;
/* we use umpid == kmpid */
ump.umpid=event.x.newthread.kmpid;
ioctl(fd, KMVIEW_UMPID, &ump);
break;
}
case KMVIEW_EVENT_TERMTHREAD:
printf("Terminated proc %d (%d left)\n",
event.x.termthread.umpid,
event.x.termthread.remaining);
if (event.x.termthread.remaining == 0)
exit (0);
break;
case KMVIEW_EVENT_SYSCALL_ENTRY:
printf("Syscall %d->%d\n",
event.x.syscall.x.umpid,
event.x.syscall.scno);
ioctl(fd, KMVIEW_SYSRESUME, event.x.syscall.x.umpid);
break;
}
}
} else { /* traced root process*/
ioctl(fd, KMVIEW_ATTACH);
close(fd);
argv++;
execvp(argv[0],argv);
}
}
------------------------------------------------------------------------------
DATA TRANSFER
The tracer can transfer data from/to the traced process by ioctl:
struct kmview_ioctl_data {
pid_t kmpid;
long addr;
int len;
void *localaddr;
};
ioctl(fd,KMVIEW_READDATA,& {struct kmview_ioctl_data var})
ioctl(fd,KMVIEW_READSTRINGDATA,& {struct kmview_ioctl_data var})
ioctl(fd,KMVIEW_WRITEDATA,& {struct kmview_ioctl_data var})
KMVIEW_READDATA copy 'len' bytes from the traced process address 'addr' to
the tracer address 'localaddr';
KMVIEW_READSTRINGDATA copy at most 'len' bytes from the traced process address
'addr' to the tracer address 'localaddr'. if stops copying after
the first null byte transferred.
KMVIEW_WRITEDATA copy 'len' bytes from the tracer address 'localaddr' 'to
traced process address 'addr'.
------------------------------------------------------------------------------
MAGICPOLL
When a tracer is a virtual machine monitor (an hypervisor, lending the word
from xen), often it does not keep waiting on a read as there are many
source of events, file descriptors or signals.
If the hypervisor uses a ppoll it can wake up as soon as something happens.
Unfortunately this means that for the standard virtualization cycle it
needs two context switches to get the system call (or other event) from
the kmview module.
It the tracer sends the address of a buffer by the KMVIEW_MAGICPOLL ioctl,
any select/poll like call will direcly tranfer the event to the buffer,
thus when the return value of poll/select returns the availability of data
for reading (e.g. POLLIN for poll), the data is already in the buffer.
This reduces the number of context switches as there is no need to call
read.
In order to further decrease the number of context switches per system
call, it is possible to use an array of struct kmview_event as a
magicpoll buffer.
If there are several pending events (at most one per traced process),
the kmview module fills in several elements of the array.
The magic poll ioctl is the following one:
struct kmview_magicpoll {
long magicpoll_addr;
long magicpoll_cnt;
};
ioctl(fd,KMVIEW_MAGICPOLL,& {struct kmview_magicpoll var} );
magicpoll_addr is the address of the buffer, magicpoll_cnt is the number of
elements in the array.
When poll returns either the array is full of pending events or
the array element after the last significant pending event is tagged as
KMVIEW_EVENT_NONE (0).
------------------------------------------------------------------------------
KMVIEW_FLAG_SOCKETCALL
Linux supports the Berkeley socket interface, but in many architectures
instead of defining several different system calls (e.g. one for socket(2),
one for connect, listen, accept etc.) is has just one system call
(__NR_socketcall) with two parameters: the number of the call
(as defined in /usr/include/linux/net.h) and a pointer to the
array of parameters.
It is the case of several widely used architectures like i386 or powerpc.
Other architectures like x86_64, ia64 or alpha, has several system calls
one for each Berkeley socket call.
To speed up the virtualization on architectures with __NR_socketcall,
kmview provides the KMVIEW_FLAG_SOCKETCALL option.
When KMVIEW_FLAG_SOCKETCALL flag is set by KMVIEW_SET_FLAGS, the tracer
receives a KMVIEW_EVENT_SOCKETCALL_ENTRY instead of the event
KMVIEW_EVENT_SYSCALL_ENTRY when the system call __NR_socketcall is
starting.
The kmview_event_socketcall structure is the following one:
struct kmview_event_socketcall{
union {
pid_t umpid;
unsigned long just_for_64bit_alignment;
} x;
unsigned long scno;
unsigned long args[6];
unsigned long pc;
unsigned long sp;
unsigned long addr;
} socketcall;
scno is the number of the socket call (the number of the call listed in
/usr/include/linux/net.h), args are the socket call args, pc and sp as
usual, the final addr is the address of the argument array.
This prevents the hypervisor from spending one more context switch to grab
the parameters.
The system call can be restarted by KMVIEW_SYSRESUME, KMVIEW_SYSVIRTUALIZED.
Currently the module does not provide any KMVIEW_FLAG_SOCKETMODIFIED call:
it is possible to use KMVIEW_FLAG_SYSMODIFIED to start another system call
instead of a socket call, parameters of the same socket call can be changed
by rewriting them using addr.
Changing a socket call with another is rare, and must be done by hand
carefully as the buffer pointed by addr can have not enough space to fit
the new arguments.
------------------------------------------------------------------------------
KMVIEW_FLAG_FDSET
There are several system calls that have a file descriptor in the
first argument. It is for example the case of read, write, fstat etc.
When a virtual machine monitor virtualize just some files it is useless
to notify the tracer/monitor for fd relatied to non virtualized files.
When KMVIEW_FLAG_FDSET is set by KMVIEW_SET_FLAGS ioctl,
all fd related calls are not notified to the tracer by default.
The tracer can add and delete file descriptors to the set of
traced fd by using the following calls:
struct kmview_fd {
pid_t kmpid;
int fd;
};
ioctl(fd,KMVIEW_ADDFD,& {struct kmview_fd var});
ioctl(fd,KMVIEW_DELFD,& {struct kmview_fd var});
The tracer receives system call (and socket call) events only when
the first argument file descriptor belongs to the set of traced fd.
The set of traced fd is automatically inherited during a clone/fork of
a traced process.
There are two flags that can be specified together with KMVIEW_FLAG_FDSET:
KMVIEW_FLAG_EXCEPT_CLOSE and KMVIEW_FLAG_EXCEPT_FCHDIR.
These are notable exception that the tracer may specify.
When KMVIEW_FLAG_EXCEPT_CLOSE, the close system call (as well as shutdown)
always cause an event to the tracer. KMVIEW_FLAG_EXCEPT_FCHDIR has the
same effect for fchdir.
These two calls cause a change of the system state: some virtual machine
monitors (like our kmview) need to trace these changes.
------------------------------------------------------------------------------
KMVIEW_FLAG_PATH_SYSCALL_SKIP
When this flag is set (without any further configuration), all the system
calls involving pathnames (like open, lstat or link) are not forwarded
to the virtual machine monitor (tracer).
This call minimizes the amount of messages to the virtual machine monitor
when there are no active virtualizations of the file system.
It is possible to define exceptions for processes or for subtrees
(prefixes for absolute pathnames)
When the tracer needs to receive the notification of the path system calls
requested by a process, it runs the followin ioctl request:
ioctl(fd,KMVIEW_SET_CHROOT,kmpid)
This exception is inherited during a clone/fork of a traced process and can be
undefined by:
ioctl(fd,KMVIEW_CLR_CHROOT,kmpid)
Kmview manages up to 64 prefixes for ghost mountpoints. A ghost mount is a
very fast virtualization for file systems. Ghost mounted filesystems can be
reached only by absolute pathnames. In fact, relative pathnames and symbolic
links cannot be resolved by the kmview kernel module.
The profixes must be stored in struct ghosthash64:
#define GH_SIZE 64
#define GH_TERMINATE 255
#define GH_DUMMY 254
struct ghosthash64 {
unsigned char deltalen[GH_SIZE];
unsigned int hash[GH_SIZE];
};
all ghost mount pathnames get converted into hash signatures and stored in
the hash array sorted by increasing pathname length. Each element of deltalen
contains the difference betweeen the current element and the previous one.
When a ghosthash64 contains less than 64 ghost mount hash values, GH_TERMINATE
is stored in the deltalen element after the last (longest) element as
a termination tag. The value of deltalen elements cannot exceed GH_DUMMY,
fake dummy elements must be added when needed.
Kmview uses the following hash signature:
signature = 0
for each char in the path:
signature = signature ^ ((signature << 5) + (signature >> 2) + char)
The following ioctl loads a new array of hash values in the kenrnel module:
ioctl(fd.KMVIEW_GHOSTMOUNTS,&gh);
where gh is a struct ghosthash64 variable.
e.g.:
struct ghosthash64 mygh={{2,2,6,GH_TERMINATE},{0x633,0x193694,0x5f710af7}};
ioctl(fd.KMVIEW_GHOSTMOUNTS,&mygh);
force the kmview module to forward to the tracer all the syscall using
absolute path beginning by '/1','/1/2' or '/3/4567890'.
In fact, the first path is 2 characters long, the second 4 (two more than the
previous one), the third is 10 (+6).
0x633,0x193694,0x5f710af7 are the three hash values computed by the function
above.
The file ghosthash.example.c in the source tree is an example of library
for the management of ghosthash64 data structures.
Please note that this technique may generate false positives (although
it happens very rarely: 1/2^32), but it is very fast to select which
system calls must be forwarded to the tracer.
------------------------------------------------------------------------------
KMVIEW_SYSCALLBITMAP (ioctl)
This ioctl selects the system calls for the tracer.
int bitmap[INT_PER_MAXSYSCALL];
ioctl(fd.KMVIEW_SYSCALLBITMAP,bitmap);
Each bit of bitmap corresponds to a specific system call, when a bit is
set in the bitmap, the system call is *not* forwarded to the tracer.
In other word this bitmap encodes the system calls which are *not* useful
for the tracer.
Kmview.h file includes some inline functions to handle system call bitmaps:
Initialize bitmaps (all ones or all zeros)
static inline void scbitmap_fill(unsigned int *bitmap);
static inline void scbitmap_zero(unsigned int *bitmap);
Add/delete a bit in a bitmap
static inline void scbitmap_set(unsigned int *bitmap,int scno);
static inline void scbitmap_clr(unsigned int *bitmap,int scno);
Test if a bit in a bitmap is set:
static inline unsigned int scbitmap_isset(unsigned int *bitmap,int scno);
(C) 2007 Renzo Davoli and Andrea Gasparini.
updated 2009 Renzo Davoli
updated 2010 Renzo Davoli
Document released under the CC Attribution ShareAlike license.

