Skip to content

Nastya flog2#20

Open
anastasiamarkina wants to merge 249 commits into
cyrillos:criu-devfrom
anastasiamarkina:nastya-flog2
Open

Nastya flog2#20
anastasiamarkina wants to merge 249 commits into
cyrillos:criu-devfrom
anastasiamarkina:nastya-flog2

Conversation

@anastasiamarkina
Copy link
Copy Markdown

No description provided.

xemul and others added 30 commits May 16, 2019 03:24
So, here's the enhanced version of the first try.

Changes are:

1. The wrapper name is criu-ns instead of crns.py
2. The CLI is absolutely the same as for criu, since the script
   re-execl-s criu binary. E.g.
	   scripts/criu-ns dump -t 1234 ...
   just works
3. Caller doesn't need to care about substituting CLI options,
   instead, the scripts analyzes the command line and
   a) replaces -t|--tree argument with virtual pid __if__ the
      target task lives in another pidns
   b) keeps the current cwd (and root) __if__ switches to another
      mntns. A limitation applies here -- cwd path should be the
      same in target ns, no "smart path mapping" is performed. So
      this script is for now only useful for mntns clones (which
      is our main goal at the moment).

Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Looks-good-to: Andrey Vagin <avagin@openvz.org>
This is the case when the in/out files are image cache/proxy sockets.

Signed-off-by: Rodrigo Bruno <rbruno@gsd.inesc-id.pt>
Signed-off-by: Katerina Koukiou <k.koukiou@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
This patch introduces the --remote option and the necessary code changes to
support it. This leaves user the option to decide if the checkpoint data is to
be stored on disk or sent through the network (through the image-proxy).
The latter forwards the data to the destination node where image-cache
receives it.

The overall communication is performed as follows:
src_node CRIU dump -> (sends images through UNIX sockets) ->      image-proxy
								       |
								       V
dst_node: CRIU restore <- (receives images through UNIX sockets)<- image-cache

Communication between image-proxy and image-cache is done through a single
TCP connection.

Running criu with --remote option is like this:

dst_node# criu image-cache -d --port <port> -o /tmp/image-cache.log
dst_node# criu restore --remote -o /tmp/image-cache.log
src_node# criu image-proxy -d --port <port> --address <dst_node> -o /tmp/image-proxy.log
src_node# criu dump -t <pid> --remote -o /tmp/dump.log

    [ xemul:
here's the list of what should be done with the cache/proxy
in order to have them merged into master.

0. Document the whole thing :)
   Please, add articles for newly introduced actions and options to
   https://criu.org/CLI page.
   Also, it would be good to have an article describing the protocols
   involved.

1. Make the unix sockets reside in work-dir.
   The good thing is that we've get rid of the socket name option :)
   But looking at do_open_remote_image() I see that it fchdir-s to
   image dir before connecting to proxy/cache. Better solution is to
   put the socket into workdir.

   1a. After this the option -D|--images-dir should become optional.
       Provided the --remote is given CRIU should work purely on the
       work-dir and not generate anything in the images-dir.

2. Tune up the image_cache and image_proxy commands to accept the
   --status-fd and --pidfile options.
   Presumably the very cr_daemon() call should be equipped with
   everything that should be done for daemonizing and proxy/cache
   tasks should just call it :)

3. Fix local connections not to generate per-image threads. There
   can be many images and it's not nice to stress the system with
   such amount of threads. Please, look at how criu/uffd.c manages
   multiple descriptors with page-faults using the epoll stuff.

   3a. The accept_remote_image_connections() seem not to work well
       with opts.ps_socket scenario as the former just calls accept()
       on whatever socket is passed there, while the opts.ps_socket
       is already an established socket for data transfer.

4. No strings in protocol. Now the hard-coded "RESTORE_FINISH" string
   (and DUMP_FINISHED one) is used to terminate the communication.
   Need to tune up the protobuf objects to send boolean (or integer)
   EOF sign rather that the string.

5. Check how proxy/cache works with incremental dumps. Looking at the
   skip_remote_bytes() I think that image-cache and -proxy still do not
   work well with stacked pages images. Probably for those we'll need
   the page-server or lazy-pages -like protocol that would request the
   needed regions and receive it back rather than read bytes from
   sockets simply to skip those.

6. Add support for cache/proxy into go-phaul code. I haven't yet finished
   with the prototype, but plan to do it soon, so once the above steps
   are done we'll be able to proceed with this one.

]

Signed-off-by: Rodrigo Bruno <rbruno@gsd.inesc-id.pt>
Signed-off-by: Katerina Koukiou <k.koukiou@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
The current patch brings the implementation of the image proxy and image cache.
These components are necessary to perform in-memory live migration of processes
using CRIU. The image proxy receives images from CRIU Dump/Pre-Dump (through
UNIX sockets) and forwards them to the image cache (through a TCP socket). The
image cache caches image in memory and sends them to CRIU Restore (through
UNIX sockets) when requested.

Signed-off-by: Rodrigo Bruno <rbruno@gsd.inesc-id.pt>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Rodrigo Bruno <rbruno@gsd.inesc-id.pt>
Signed-off-by: Katerina Koukiou <k.koukiou@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
To suppress protobuf's warning:
> [libprotobuf WARNING google/protobuf/compiler/parser.cc:546]
> No syntax specified for the proto file: remote-image.proto.
> Please use 'syntax = "proto2";' or 'syntax = "proto3";'
> to specify a syntax version. (Defaulted to proto2 syntax.)

Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Rodrigo Bruno <rbruno@gsd.inesc-id.pt>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: rodrigo-bruno <rbruno@gsd.inesc-id.pt>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: rodrigo-bruno <rbruno@gsd.inesc-id.pt>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
I see no need to do dynamic init here.

Cc: Rodrigo Bruno <rbruno@gsd.inesc-id.pt>
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
OK, so we have pr_perror() for cases where errno is set (and it makes
sense to show it), and pr_err() for other errors. A correct function
is to be used, depending on the context.

1. pthread_mutex_*() functions don't set errno, therefore pr_perror()
   should not be used.

2. accept() sets errno => makes sense to use pr_perror().

3. read_header() arguably sets errno => use pr_err().

4. open_proc_rw() already prints an error message, there is no need
   for yet another one.

Cc: Rodrigo Bruno <rbruno@gsd.inesc-id.pt>
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
1. Use xmalloc() where possible.

2. There is no need to print an error message, as xmalloc()
   has already printed it for you.

Cc: Rodrigo Bruno <rbruno@gsd.inesc-id.pt>
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
In those error paths where we don't have errno set,
don't use pr_perror(), use pr_err() instead.

Cc: Rodrigo Bruno <rbruno@gsd.inesc-id.pt>
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
When --remote option is specified, read_local_page tries to pread from a
socket, and fails with "Illegal seek" error.
Restore single pread call for regular image files case and introduce
maybe_read_page_img_cache version of maybe_read_page method.

Generally-approved-by: Rodrigo Bruno <rbruno@gsd.inesc-id.pt>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
There is no real need to have both.

Signed-off-by: Omri Kramer <omri.kramer@gmail.com>
Singed-off-by: Lior Fisch <fischlior@gmail.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
It's simply impossible (yet), so emit a warning.

Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The opts.remote is always false in this code.

Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We have two places to check for parent via page server -- as
a part of _OPEN req and explicit req. Make the latter code
be in-sync with the opening one.

Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Those may not support sendfiles, so use read/write-s instead

Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Drop the constants for default cache host/port and page size because
they are not used anywhere.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
1) fix sfle memory leak on get_fle_for_scm error
2) fix gfd open descriptor leak on get_fle_for_scm error
3-6) fix buf memory leak on read and pwrite errors

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
rst0git and others added 21 commits July 31, 2019 14:12
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
In the Fedora tests we install python3-pip only to install flake8.

This is not necessary as there is a Fedora package for flake8.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
The following error is falsely reported by flake8:

lib/py/images/pb2dict.py:266:24: F821 undefined name 'basestring'

This error occurs because `basestring` is not available in Python 3,
however the if condition on the line above ensures that this error
will not occur at run time.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
When Python 2 is not installed we assume that /usr/bin/python refers to
version 3 of Python and the executable /usr/bin/python2 does not exist.

This commit also resolves a compatibility issue with Popen where in
Py2 file descriptors will be inherited by the child process and in
Py3 they will be closed by default.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Reduce code duplication by taking setup_swrk() function into a separate
module that can be reused in multiple places.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Doesn't change uapi, but makes it a bit more friendly and documented
which loglevel means what for foreign user.

Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Provide a way to set gettimeofday() function for an infected task.
CRIU's parasite & restorer are very voluble as more logs are better
than lesser in terms of bug investigations.
In all modern kernels there is a way to get time without entering
kernel: vdso. So, add a way to reduce the cost of logging without making
it less valuable.

[I'm not particularly fond of std_log_set_gettimeofday() name, so
 if someone can come with a better naming - I'm up for a change]

Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
vdso will be used in restorer for timings in logs - try to keep it
during restore process.

Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
For simplicity, make them always valid in restorer.
rt->vdso_start will be used to calculate gettimeofday() address.

Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Also slight refactor.

Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
We need to differ compatible (ia32) vdso maps from x86_64.
That dictates ABI on vdso code.
According to that, the decision to (not) use gettimeofday() from vdso in
64-bit restorer.

Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Omit calling raw syscalls and use vdso for the purpose of logging.
That will eliminate as much as one-syscall-per-PIE-message.
Getting time without switching to kernel will speed up C/R,
keeping logs as informative as they were.

Fixes: #346

I haven't enabled vdso timings for ia32 applications as it needs more
changes and complexity.. Maybe later.

Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
I've mentioned the problem that after c/r each inotify receives one or
more unexpected events.

This happens because our algorithm mixes setting up an inotify watch on
the file with opening and closing it.

We mix inotify creation and watched file open/close because we need to
create the inotify watch on the file from another mntns (generally). And
we do a trick opening the file so that it can be referenced in current
mntns by /proc/<pid>/fd/<id> path.

Moreover if we have several inotifies on the same file, than queue gets
even more events than just one which happens in a simple case.

note: For now we don't have a way to c/r events in queue but we need to
at least leave the queue clean from events generated by our own.

These, still, looks harder to rewrite wd creation without this proc-fd
trick than to remove unexpected events from queues.

So just cleanup these events for each fdt-restorer process, for each of
its inotify fds _after_ restore stage (at CR_STATE_RESTORE_SIGCHLD).
These is a closest place where for an _alive_ process we know that all
prepare_fds() are done by all processes. These means we need to do the
cleanup in PIE code, so need to add sys_ppoll definitions for PIE and
divide process in two phases: first collect and transfer fds, second do
real cleanup.

note: We still do prepare_fds() for zombies. But zombies have no fds in
/proc/pid/fd so we will collect no in collect_fds() and therefore we
have no in prepare_fds(), thus there is no need to cleanup inotifies for
zombies.

v2: adopt to multiple unexpected events
v3: do not cleanup from fdt-receivers, done from fdt-restorer
v4: do without additional fds restore stage
v5: replace sys_poll with sys_ppoll and fix minor nits

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

use ppoll always and remove poll
Just create two inotify watches on a testfile, and do nothing except
c/r, it is expected that there is no events in queue after these.

before "inotify: cleanup auxiliary events from queue":

[root@snorch criu]# ./test/zdtm.py run -t zdtm/static/inotify04
=== Run 1/1 ================ zdtm/static/inotify04
======================== Run zdtm/static/inotify04 in h ========================
 DEP       inotify04.d
 CC        inotify04.o
 LINK      inotify04
Start test
./inotify04 --pidfile=inotify04.pid --outfile=inotify04.out --dirname=inotify04.test
Run criu dump
Run criu restore
Send the 15 signal to  60
Wait for zdtm/static/inotify04(60) to die for 0.100000
=============== Test zdtm/static/inotify04 FAIL at result check ================
Test output: ================================
18:37:14.279:    60: Event       0x10
18:37:14.280:    60: Event       0x20
18:37:14.280:    60: Event       0x10
18:37:14.280:    60: Read 3 events
18:37:14.280:    60: FAIL: inotify04.c:105: Found 3 unexpected inotify events (errno = 11 (Resource temporarily unavailable))

<<< ================================

v2: make two inotifies on the same file

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

zdtm: inotify04 add another inotify on the same file
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
For now we will use static buffer to keep all messages.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Just an example of decoding messages with "criu check" command.
The flogs messages are printed to stdout after --- FLOG --- message.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
New --binlog option can be used in addition to the older --log-file filename option to use binary log format
new --print-log option allows to use criu as a reader tool for binlog files. Filename of the log file is to be passed as usual via --log-file option.
Documentation/binlog.txt file describes binlog file format and used options
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.