19c instant client and Docker
You should get there if you search for “ORA-12637: Packet receive failed” and “Docker”. Note that you can get the same error on old versions of VirtualBox and maybe other virtualized environments that do not correctly forward out-of-band data.
TL;DR: There are two workarounds for this:
- get out-of-band correctly handled with a recent version of your hypervisor, or by disabling userland-proxy in Docker
- disable out-of-band breaks usage by setting DISABLE_OOB=ON in sqlnet.ora (client and/or server)
But this post is also the occasion to explain a bit more about this.
Out Of Band breaks
You have all experienced this. You run a long query and want to cancel it. Sometimes, just hitting ^C stops it immediately. Sometimes, it takes very long. And that has to do with the way this break command is handled.
Here is an example where Out-Of-Band is disabled (I have set DISABLE_OOB=ON in sqlnet.ora):
09:55:44 SQL> exec dbms_lock.sleep(10);
^C
BEGIN dbms_lock.sleep(10); END;*
ERROR at line 1:
ORA-01013: user requested cancel of current operation
ORA-06512: at "SYS.DBMS_LOCK", line 215
ORA-06512: at line 1Elapsed: 00:00:06.02
Here is an example where Out-of-Band is enabled (the default DISABLE_OOB=OFF in sqlnet.ora):
10:13:05 SQL> exec dbms_lock.sleep(10);
^CBEGIN dbms_lock.sleep(10); END;*
ERROR at line 1:
ORA-01013: user requested cancel of current operation
ORA-06512: at "SYS.DBMS_LOCK", line 215
ORA-06512: at line 1Elapsed: 00:00:00.43
You see immediately the difference: with Out-Of-Band enabled, the cancel is immediate. Without it, it takes some time for the server to cancel the statement. In both cases, I’ve hit ^C at a time where the client was waiting for the server to answer a call. And that’s the point: when the client sends a user call it waits for the answer. And the server is busy and will not read from the socket until it has something to send.
Then here is how it works: on some platform (which explains why this ^C is not immediate when your client or server is Windows) the “break” message is sent with an Out-Of-Band channel of TCP/IP with the URG flag. Then, on the server, the process will be interrupted with a SIGURG signal and will be able to cancel immediately. Without it, this “break/reset” communication is done through the normal socket channel when it is available.
What is new in 19c
When the connection is established, the client checks that out-of-band break messages are supported, by sending a message with the MSG_OOB flag. This can be disabled by the new parameter: DISABLE_OOB_AUTO
But if the network stack does not handle this properly (because of bug in proxy tunneling or hypervisor) the connection hangs for a few minutes and then fails with: ORA-12637: Packet receive failed
Note that I don’t think the problem is new. The new connection-time check makes it immediately visible. But when OOB is not working properly a ^C will also hang. This means that setting DISABLE_OOB_AUTO=TRUE is not a solution but just postpones the problem. The solution is DISABLE_OOB which is there from previous versions.
ORA-12637: Packet receive failed
Here is what comes from strace on my sqlplus client trying to connect to my docker container using the forwarded port when using the default docker-proxy:
This, just after sending the MSG_OOB packet, is waiting 400 seconds before failing with ORA-12637. More about this 400 seconds timeout later.
Here is how I reproduced this:
- I’m on CentOS 7.7 but I’m quite sure that you can do the same in OEL
- I have Docker docker-ce-19.03.4
- I installed Tim Hall dockerfiles:
git clone https://github.com/oraclebase/dockerfiles.git
- I added Oracle 19.3 install zip into the software subdirectory :
ls dockerfiles/database/ol7_19/software/LINUX.X64_193000_db_home.zip
- and apex:
ls dockerfiles/database/ol7_19/software/apex_19.1_en.zip
- I build the image:
cd dockerfiles/database/ol7_19
docker build -t ol7_19:latest .
- I installed 19.3 instant client:
wget https://download.oracle.com/otn_software/linux/instantclient/193000/oracle-instantclient19.3-basic-19.3.0.0.0-1.x86_64.rpm
wget https://download.oracle.com/otn_software/linux/instantclient/193000/oracle-instantclient19.3-sqlplus-19.3.0.0.0-1.x86_64.rpm
yum install -y oracle-instantclient19.3-basic-19.3.0.0.0–1.x86_64.rpm
yum install -y oracle-instantclient19.3-sqlplus-19.3.0.0.0–1.x86_64.rpm
- Created some volumes:
mkdir -p /u01/volumes/ol7_19_con_u02
chmod -R 777 /u01/volumes/ol7_19_con_u02
- and network:
docker network create my_network
- and create a docker container based on the image, redirecting the 1521 port from the container to the host:
docker run -dit - name ol7_19_con \
-p 1521:1521 -p 5500:5500 \
- shm-size="1G" \
- network=my_network \
-v /u01/volumes/ol7_19_con_u02/:/u02 \
ol7_19:latest
- Wait that the database is created:
until docker logs ol7_19_con | grep "Tail the alert log file as a background process" ; do sleep 1 ; done
- Now if I connect from the host, to the localhost 1521 (the redirected one):
ORACLE_HOME=/usr/lib/oracle/19.3/client64 /usr/lib/oracle/19.3/client64/bin/sqlplus -L system/oracle@//localhost:1521/pdb1
I get ORA-12637: Packet receive failed after 400 seconds.
- If I disable OOB, for example on the server:
docker exec -t ol7_19_con bash -ic 'echo DISABLE_OOB=ON > $ORACLE_HOME/network/admin/sqlnet.ora'
Then the connection is ok.
Note that if I disable DISABLE_OOB_AUTO=TRUE (the OOB detection introduced in 19c) the connection is ok but the client hangs if later I hit ^C
If I disable the docker-proxy then all is ok without the need to disable OOB (and I verified that ^C immediately cancels a query):
systemctl stop docker
echo ' { "userland-proxy": false } ' > /etc/docker/daemon.json
systemctl start docker
docker start ol7_19_con
docker exec -t ol7_19_con bash -ic 'rm $ORACLE_HOME/network/admin/sqlnet.ora'
The connection is ok and ^C cancels immediately.
Now about the 400 seconds, this is the connect timeout (INBOUND_CONNECT_TIMEOUT_LISTENER) set in this image:
Tracing
I discussed this with the Oracle Database Client product manager and SQL*Net developers. They cannot do much about this except document the error, as the problem is in the network layer. I did not open an issue for Docker because I did not easily reproduce a simple send/recv test case.
What I’ve observed is that the error in docker-proxy came between docker version 18.09.8 and 18.09.9 and if someone wants to do a small test case to open a docker issue, here is what I traced in the Oracle InstantClient.
- I trace the client + server + docker-proxy processes:
pid=$(ps -edf | grep docker-proxy | grep 1521 | awk '{print $2}')
strace -fyytttTo docker.strace -p $pid &pid=$(ps -edf | grep tnslsnr | grep LISTENER | awk '{print $2}')
strace -fyytttTo oracle.strace -p $pid &ORACLE_HOME=/usr/lib/oracle/19.3/client64 strace -fyytttTo sqlplus.strace /usr/lib/oracle/19.3/client64/bin/sqlplus -L demo/demo@//localhost/pdb1 </dev/null &
Here is the OOB checking from the client (sqlplus.strace) and sending other data:
13951 1573466279.299345 read(4<TCP:[127.0.0.1:54522->127.0.0.1:1521]>, "\0-\0\0\2\0\0\0\1>\fA\0\0\0\0\1\0\0\0\0-AA\0\0\0\0\0\0\0\0"..., 8208) = 45 <0.001236>13951 1573466279.300683 sendto(4<TCP:[127.0.0.1:54522->127.0.0.1:1521]>, "!", 1, MSG_OOB, NULL, 0) = 1 <0.000047>13951 1573466279.300793 write(4<TCP:[127.0.0.1:54522->127.0.0.1:1521]>, "\0\0\0\n\f \0\0\2\0", 10) = 10 <0.000033>13951 1573466279.301116 write(4<TCP:[127.0.0.1:54522->127.0.0.1:1521]>, "\0\0\0\237\6 \0\0\0\0\336\255\276\357\0\225\0\0\0\0\0\4\0\0\4\0\3\0\0\0\0\0"..., 159) = 159 <0.000047>13951 1573466279.301259 read(4<TCP:[127.0.0.1:54522->127.0.0.1:1521]>, "", 8208) = 0 <60.029279>
Everything was ok (last read of 45 bytes) until a MSG_OOB is sent. Then the client continues to send but nothing is received for 60 seconds.
On the server side, I see the 45 bytes sent and then poll() waiting for 60 seconds without nothing received:
13954 1573466279.300081 write(16<TCP:[4101540]>, "\0-\0\0\2\0\0\0\1>\fA\0\0\0\0\1\0\0\0\0-AA\0\0\0\0\0\0\0\0"..., 45 <unfinished ...>13954 1573466279.300325 <... write resumed> ) = 45 <0.000092>13954 1573466279.300424 setsockopt(16<TCP:[4101540]>, SOL_SOCKET, SO_KEEPALIVE, [1], 4 <unfinished ...>13954 1573466279.301296 <... setsockopt resumed> ) = 0 <0.000768>13954 1573466279.301555 poll([{fd=16<TCP:[4101540]>, events=POLLIN|POLLPRI|POLLRDNORM}], 1, -1 <unfinished ...>13954 1573466339.297913 <... poll resumed> ) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) <59.996248>
I’ve traces SQL*Net (later, so do not try to match the time):
2019-11-12 06:56:54.852 : nsaccept:Checking OOB Support
2019-11-12 06:56:54.853 : sntpoltsts:fd 16 need 43 readiness event, wait time -1*** 2019-11-12T06:57:54.784605+00:00 (CDB$ROOT(1))
2019-11-12 06:57:54.784 : nserror:entry
2019-11-12 06:57:54.785 : nserror:nsres: id=0, op=73, ns=12535, ns2=12606; nt[0]=0, nt[1]=0, nt[2]=0; ora[0]=0, ora[1]=0, ora[2]=0
2019-11-12 06:57:54.785 : nttdisc:entry
2019-11-12 06:57:54.786 : nttdisc:Closed socket 16
2019-11-12 06:57:54.787 : nttdisc:exit
2019-11-12 06:57:54.787 : sntpoltsts:POLL failed with 4
2019-11-12 06:57:54.788 : sntpoltsts:exit
2019-11-12 06:57:54.788 : ntctst:size of NTTEST list is 1 - not calling poll
2019-11-12 06:57:54.788 : sntpoltst:No of conn to test 1, wait time -1
2019-11-12 06:57:54.789 : sntpoltst:fd 16 need 6 readiness events
2019-11-12 06:57:54.789 : sntpoltst:fd 16 has 2 readiness events
2019-11-12 06:57:54.790 : sntpoltst:exit
2019-11-12 06:57:54.790 : nttctl:entry
2019-11-12 06:57:54.790 : ntt2err:entry
2019-11-12 06:57:54.791 : ntt2err:soc -1 error - operation=5, ntresnt[0]=530, ntresnt[1]=9, ntresnt[2]=0
2019-11-12 06:57:54.791 : ntt2err:exit
2019-11-12 06:57:54.791 : nsaccept:OOB is getting dropped
2019-11-12 06:57:54.792 : nsprecv:entry
2019-11-12 06:57:54.792 : nsprecv:reading from transport...
2019-11-12 06:57:54.792 : nttrd:entry
2019-11-12 06:57:54.793 : ntt2err:entry
2019-11-12 06:57:54.793 : ntt2err:soc -1 error - operation=5, ntresnt[0]=530, ntresnt[1]=9, ntresnt[2]=0
2019-11-12 06:57:54.793 : ntt2err:exit
2019-11-12 06:57:54.794 : nttrd:exit
2019-11-12 06:57:54.794 : nsprecv:error exit
2019-11-12 06:57:54.794 : nserror:entry
2019-11-12 06:57:54.795 : nserror:nsres: id=0, op=68, ns=12535, ns2=12606; nt[0]=0, nt[1]=0, nt[2]=0; ora[0]=0, ora[1]=0, ora[2]=0
2019-11-12 06:57:54.795 : nsaccept:error exit
2019-11-12 06:57:54.795 : nioqper: error from niotns: nsaccept failed...
2019-11-12 06:57:54.796 : nioqper: ns main err code: 12535
2019-11-12 06:57:54.796 : nioqper: ns (2) err code: 12606
2019-11-12 06:57:54.796 : nioqper: nt main err code: 0
2019-11-12 06:57:54.797 : nioqper: nt (2) err code: 0
2019-11-12 06:57:54.797 : nioqper: nt OS err code: 0
2019-11-12 06:57:54.797 : niotns:No broken-connection function available.
2019-11-12 06:57:54.798 : niomapnserror:entry
2019-11-12 06:57:54.798 : niqme:entry
2019-11-12 06:57:54.798 : niqme:reporting NS-12535 error as ORA-12535
2019-11-12 06:57:54.799 : niqme:exit
2019-11-12 06:57:54.799 : niomapnserror:exit
2019-11-12 06:57:54.799 : niotns:Couldn't connect, returning 12170
The server process receives nothing for one minute (neither the OOB nor the normal data).
And here is the docker-proxy trace, copying the messages with splice() between the two sockets:
12972 1573466279.300434 splice(5<TCP:[172.18.0.1:55476->172.18.0.2:1521]>, NULL, 8<pipe:[4101550]>, NULL, 4194304, SPLICE_F_NONBLOCK <unfinished ...>
12972 1573466279.300482 <... splice resumed> ) = 45 <0.000028>
12972 1573466279.300514 splice(7<pipe:[4101550]>, NULL, 3<TCPv6:[::ffff:127.0.0.1:1521->::ffff:127.0.0.1:54522]>, NULL, 45, SPLICE_F_NONBLOCK <unfinished ...>
12972 1573466279.300581 <... splice resumed> ) = 45 <0.000051>
12972 1573466279.300617 splice(5<TCP:[172.18.0.1:55476->172.18.0.2:1521]>, NULL, 8<pipe:[4101550]>, NULL, 4194304, <unfinished ...>
12972 1573466279.300666 <... splice resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000032>
12972 1573466279.300691 epoll_pwait(6<anon_inode:[eventpoll]>, [], 128, 0, NULL, 2) = 0 <0.000016>
12972 1573466279.300761 epoll_pwait(6<anon_inode:[eventpoll]>, <unfinished ...>
12972 1573466279.300800 <... epoll_pwait resumed> [{EPOLLOUT, {u32=2729139864, u64=140263476125336}}], 128, -1, NULL, 2) = 1 <0.000028>
12972 1573466279.300836 epoll_pwait(6<anon_inode:[eventpoll]>, [{EPOLLIN|EPOLLOUT, {u32=2729139864, u64=140263476125336}}], 128, -1, NULL, 2) = 1 <0.000017>
12972 1573466279.300903 futex(0x55dc1af342b0, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000021>
12972 1573466279.300963 splice(3<TCPv6:[::ffff:127.0.0.1:1521->::ffff:127.0.0.1:54522]>, NULL, 10<pipe:[4102384]>, NULL, 4194304, SPLICE_F_NONBLOCK <unfinished ...>
12972 1573466279.301007 <... splice resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000026>
12972 1573466279.301058 epoll_pwait(6<anon_inode:[eventpoll]>, <unfinished ...>
12972 1573466279.301096 <... epoll_pwait resumed> [], 128, 0, NULL, 2) = 0 <0.000025>
12972 1573466279.301128 epoll_pwait(6<anon_inode:[eventpoll]>, <unfinished ...>
12972 1573466279.301206 <... epoll_pwait resumed> [{EPOLLIN|EPOLLOUT, {u32=2729139864, u64=140263476125336}}], 128, -1, NULL, 2) = 1 <0.000066>
12972 1573466279.301254 splice(3<TCPv6:[::ffff:127.0.0.1:1521->::ffff:127.0.0.1:54522]>, NULL, 10<pipe:[4102384]>, NULL, 4194304, SPLICE_F_NONBLOCK <unfinished ...>
12972 1573466279.301306 <... splice resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000031>
12972 1573466279.301346 epoll_pwait(6<anon_inode:[eventpoll]>, [], 128, 0, NULL, 2) = 0 <0.000016>
12972 1573466279.301406 epoll_pwait(6<anon_inode:[eventpoll]>, <unfinished ...>
12972 1573466339.329699 <... epoll_pwait resumed> [{EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=2729139656, u64=140263476125128}}], 128, -1, NULL, 2) = 1 <60.028278>
I didn’t get further. The difference between Docker version 18.09.8 and 18.09.9 is also a difference of Go lang from 1.10.8 to 1.11.13 so there are many layers to troubleshoot and I wasted too much time on this.
The solution
Maybe the best solution is not trying to run Oracle Database in a Docker container. Docker makes things simple only for simple software. With such complex software as the Oracle Databases, it brings more problems.
For Docker, the best solution is to set the following in /etc/docker/daemon.json and restart docker:
{
"userland-proxy": false
}
This uses iptables to handle the port redirection rather than the docker-proxy which copies the messages with splice() which is a CPU overhead and adds latency, anyway.
For VirtualBox, there is a “ Network: scrub inbound TCP URG pointer, working around incorrect OOB handling” bug fixed in 6.0.12:
If you have no solution at network level, then you will need to disable Out-Of-Band breaks at SQL*Net level by adding DISABLE_OOB=ON in sqlnet.ora on the client or database (not the listener, the database one, and this does not need any restart). Then hitting ^C will be handled by in-band break where the server will have to stop at regular time and check for a break request.