Java Memory Allocation on Linux 2 AMI

Taylor · Jan 11, 2021

Hello

We've been testing concurrent streams on the media server to ensure it can handle enough, however when testing we've found that all of our streams didn't finish correctly.

We had 16 concurrent streams recording to the server, which is our benchmark, but once the streams were called to stop, there was no response from the media server for 60 seconds until later it did return the Stream Unpublished event but the stream.getRecordInfo() method gotten from it had a null value instead of the recorded stream name, which is what it would normally return.

Last time we tested this was in response to a similar issue here where we tested WCS 5.2.673 using same instance type and JVM options. We managed to test 16 streams easily here which is why it's our benchmark.

We're suspecting this time it's to do with the Java Heap Memory as we're not sure if it's being correctly configured on the new Linux 2 AMI.

WCS Version: 5.2.780 (Amazon Linux 2 AMI)
WebSDK: 0.5.28.2753.154
Amazon Instance Type: t3a.medium

WCS-Core Properties:

Code:

### JVM OPTIONS ###
-Xms2g -Xmx2g
#-Xcheck:jni

# Can be a better GC setting to avoid long pauses
-XX:+UseConcMarkSweepGC
-XX:NewSize=256m
#-XX:+CMSIncrementalMode
#-XX:+UseParNewGC"

I've been unable to confirm if the Java Heap Size is configuring to 2Gb. Can you please confirm if the settings are correct, including if we should still be setting the Garbage Collector?

Also can you recommend a way for us to confirm the heap size setting? I've tried using the command below which is why I got concerned about it not setting correctly.
Cheers.

Linux Printout:

Code:

java -XX:+PrintFlagsFinal -version | grep -iE 'HeapSize|PermSize|ThreadStackSize'
     intx CompilerThreadStackSize                  = 1024                                   {pd product} {default}
   size_t ErgoHeapSizeLimit                        = 0                                         {product} {default}
   size_t HeapSizePerGCThread                      = 43620760                                  {product} {default}
   size_t InitialHeapSize                          = 65011712                                  {product} {ergonomic}
   size_t LargePageHeapSizeThreshold               = 134217728                                 {product} {default}
   size_t MaxHeapSize                              = 1023410176                                {product} {ergonomic}
   size_t MinHeapSize                              = 8388608                                   {product} {ergonomic}
    uintx NonNMethodCodeHeapSize                   = 5826188                                {pd product} {ergonomic}
    uintx NonProfiledCodeHeapSize                  = 122916026                              {pd product} {ergonomic}
    uintx ProfiledCodeHeapSize                     = 122916026                              {pd product} {ergonomic}
   size_t SoftMaxHeapSize                          = 1023410176                             {manageable} {ergonomic}
     intx ThreadStackSize                          = 1024                                   {pd product} {default}
     intx VMThreadStackSize                        = 1024                                   {pd product} {default}

Max · Jan 11, 2021

Hello!

We tested your case.
Indeed, when using the AWS t3.medium instance, there is a freeze when the recording of 16 streams is completed with any WCS-core settings - when the memory allocated for Heap is increased and when ZGC is configured.

Changing the Heap and ZGC settings is ineffective, the bottleneck in this case is the CPU.

We recommend to use instances with at least 4 vCPUs.

Screenshot from t3.medium (2 vCPU 4 Gb RAM, -Xms2g -Xmx2g, ZGC) when recording 16 streams:

Screenshot from t3.xlarge (4 vCPU 16 Gb RAM, settings by default) when recording 16 streams:

Taylor · Jan 12, 2021

Hi Max

Thank you for answering but upgrading the instance type to t3.xlarge isn't ideal.

We are currently running t3a.medium for our media servers, which is at the cost of $0.085 per Hour (both EC2 instance and Software License)
Upgrading to t3.xlarge would then cost us $0.377 per Hour, which is 443.5% of the initial amount.
Even upgrading to t3a.xlarge would then cost us $0.34 per Hour, which is 400% of the initial amount.

It would make more sense to deploy 2 media servers of t3a.medium each, only costing 200% of initial amount.

But even still it doesn't make sense to use the new media server if the original servers we are running can handle the load.

The original servers are running 5.2.673 on Amazon Linux AMI which we just recently tested could handle the load of 16 streams and only used 65%-75% of the vCPUs usage.
The new servers are running 5.2.780 on Amazon Linux 2 AMI which, according to the data above, seems to be nearly half as efficient and thus cannot handle 16 streams.

We suspect the biggest change that would cause this drop in efficiency would be the different JDK in the migration from Amazon Linux AMI, which uses java version "1.8.0_66", to the new Amazon Linux 2 AMI, which uses openjdk version "14.0.1".

Sorry if I'm sounding blunt but from our position it just doesn't make sense to upgrade our production to 5.2.780.

Max · Jan 12, 2021

Taylor said:
We suspect the biggest change that would cause this drop in efficiency would be the different JDK in the migration from Amazon Linux AMI, which uses java version "1.8.0_66", to the new Amazon Linux 2 AMI, which uses openjdk version "14.0.1".

Unfortunately, Java version is not a biggest change. Amazon Linux is based on old Fedora, and Amazon Linux 2 is based on Centos 7.3. So OS is changed too: a new kernel, glibc, systemd, etc. Amazon claims to deprecate Amazon Linux in near future, so we have to move AMI to Linux 2.
We raised the ticket WCS-3035 to investigate it and check where is the bottleneck. We let you know here the results.
At your side, you can deploy an instance based on 5.2.673 AMI, then update WCS in the instance to 5.2.780 as described here. Then, you can test a newer WCS build on the same Java and OS. If the recording perfomance remains the same as in 5.2.673, you can upgrade your production this way.
Another option is to consider migration to Google Cloud, which may be cheaper than AWS. Please read this article to compare.

Taylor · Jan 12, 2021

Hi Max

Thanks for the advice, we decided to go with the old AMI and update WCS to 5.2.780, but when testing we found the CPU percentage was still pretty high.

Details about Test:
16 concurrent streams recordings go through, each one 3 minutes long.
Make sure all 16 receive the Unpublished status event so they can proceed normally.
Run 'top' command on server at same time to see CPU usage.

WCS 5.2.673 - Linux AMI (Old):
16 out of 16 streams finish correctly

WCS 5.2.780 - Linux AMI (Old):
12 out of 16 streams finish correctly

WCS 5.2.780 - Linux 2 AMI (New):
1 out of 16 streams finish correctly

The first two tests were done on the same AMI, with the only differences being the version numbers, and the difference in CPU% is quite substantial.

Note: The only confirguration change was `video_incoming_buffer_size=100` being on the first test but not the second. Ran out of time testing today but can test this configuration tomorrow to double check if need be.

The last test was strange. It showed only 0.3% usage but the low number of videos successfully finishing should've been reflected with a high CPU%. Probably an issue with the command on the new AMI.

Max · Jan 12, 2021

Please use htop or per-process top results (not integral), they should be more visual to compare.
We perform our tests in ticket WCS-3035 and let you know results.

Max · Jan 13, 2021

We performed recording tests and confirm simultaneous recording issue in 5.2.780 comparing with 5.2.673 on weak servers
We'll try to find a bottleneck in ticket WCS-3035 and let you know results here.
Please do not upgrade your production to 5.2.780 until fix will be found.

Taylor · Jan 13, 2021

Thank you for the update Max. Will hold off on upgrade.

Max · Jan 20, 2021

Good day.
We investigated the issue.
CPU load becomes higher in 5.2.780 because 2 audio channels recording is enabled by default. To reduce the load in t3a.medium instance, add the following parameter to flashphoner.properties file

Code:

record_audio_codec_channels=1

In this case, the load should not be higher than 75% per CPU core, and 16 streams simultaneous recording should finish in 1-2 seconds after streams stopping.
Out tests in AWS EC2 show the same result both for Amazon Linux AMI with JDK 8 and fro Amazon Linux 2 AMI with JDK 14.0.1. So with this setting applied you can upgrade to new WCS marketplace image based on AMazon Linux 2 with WCS 5.2.780.

Taylor · Jan 21, 2021

Hi Max

Thank you for investigating the issue. We've implemented the change and can confirm that it does improve the CPU usage dramatically and allows us to have 16 simultaneous streams without issue:

Screen Shot 2021-01-21 at 4.43.18 pm.png

Test results on Amazon Linux 2 AMI, recording 16 streams on "Several Streams Recording" demo page.

One interesting thing we found during testing: it turns out that if the records directory, where the streams are saved, is used to mount a Network File System, the 'top' and 'htop' commands incorrectly display the CPU usage which is why our previous tests on the new AMI didn't show the correct CPU usage.

Cheers

Max · Jan 21, 2021

Taylor said:
One interesting thing we found during testing: it turns out that if the records directory, where the streams are saved, is used to mount a Network File System, the 'top' and 'htop' commands incorrectly display the CPU usage which is why our previous tests on the new AMI didn't show the correct CPU usage.

This is probably NFS driver issue in Amazon Linux 2. Unfortunately, we cannot affect low level I/O API implementation from Java process by any way, just standard open-write-close operations are used.

Taylor · Feb 2, 2021

Hi Max

Thank you for the help before.

We've recently tried doing another stress test and found that 16 streams failed to safely record through, however this time the issue seems to be related to Amazon's EFS (the Network File System).

Our tests once again involve having 16 concurrent streams using the demo page:
Testing on 5.2.673 (Old Amazon Linux AMI): CPU usage reads about 65% and 16 videos are safely recorded.
Testing on 5.2.780 (Old Amazon Linux AMI): CPU usage reads about 65% and 16 videos are safely recorded.
Testing on 5.2.780 (New Amazon Linux 2 AMI): CPU usage reads between 20% and 50%, due to issue with Netword File System and 'htop' mentioned above. But the 16 videos fail to safely finish recording, which is usually caused by CPU overload that should've read as 100%.
Testing on 5.2.780 (New Amazon Linux 2 AMI) but without EFS: CPU usage reads about 65% and 16 videos are safely recorded.

N.B. For 5.2.780 I use the configuration record_audio_codec_channels=1 as you mentioned above

We use Amazon's EFS to mount the 'records' directory, where the streams are saved, to an Elastic File System (EFS).

I made a support issue to Amazon to see if they can understand what is going on and if they can help.

This issue feels like it's related to Amazon and not Flashphoner, so I don't expect Flashphoner to work on a solution to this at all, but I figured this may be important for you to know and if you have any ideas as to why this is occuring and how to fix it, that would be wonderful.

Cheers.

Max · Feb 2, 2021

Taylor said:
This issue feels like it's related to Amazon and not Flashphoner, so I don't expect Flashphoner to work on a solution to this at all, but I figured this may be important for you to know and if you have any ideas as to why this is occuring and how to fix it, that would be wonderful.

Unfortunately, at WCS side there is no way to detect file system low level implementation, so we cannot differ network file system from local hard disk one.
There is the parameter which can be used to spread recording to threads (4 threads by default are used). May be it can help to reduce NFS effect. Try to change this setting to 8 for example

Code:

file_recorder_thread_pool_max_size=8

Note that server perfomance may drop if too mush threads reserved for recording.

Taylor · Feb 5, 2021

Hi Max

Sorry for the late reply.

I tested your solution of increasing the thread pool size for the file recorder. This made it possible for us to record 16 streams and have them safely finish

Screen Shot 2021-02-05 at 12.30.10 pm.png

Thank you for suggestion. I will hopefully get some information back from Amazon regarding EFS and the new AMI that could also be of help.

Cheers

Taylor · Feb 22, 2021

Hi Max

Got some updates regarding the differences between the Linux AMIs and how that's affecting stream recording with Amazon EFS (Network File System).

So after some testing they've found behaviour differences in the number of NFS IO operations. In short the new AMI (Amazon Linux 2) is making more IO requests to the NFS.

In addition, I did further testing on the 'Several Streams Recording' demo page using file_recorder_thread_pool_max_size=16 and these are the results:
8 min length - 10 seconds for server to respond after finishing - 16/16 streams successful
2 min - 33 seconds for server to respond after finishing - 14/18 streams successful

1st, 2nd, 2nd last & last Streams Unsuccessful

3 min - 34 seconds for server to respond after finishing - 14/18 streams successful

1st, 2nd, 2nd last & last Streams Unsuccessful

2 min - 43 seconds for server to respond after finishing - 16/20 streams successful

1st, 2nd, 3rd & 4th Streams Unsuccessful

3 min - 1 minute for server to respond after finishing - 14/20 streams successful

1st, 2nd, 3rd, 4th, 4th last & 3rd last Streams Unsuccessful

N.B. The streams that are unsuccessful do return an 'UNPUBLISHED' status but calling stream.getRecordInfo() on the returned stream object yields no file name.

By increasing the threads, it serves a benefit for the NFS IO operations by spreading them across parallel processes, and when they have their own dedicated process the NFS operations don't have to wait in a serialized manner and not delay the streaming. The tests above show that when multiple streams occur on the single thread, there is a high risk for the response time to increase and cause issue with the returning 'UNPUBLISHED' status and stream object.

AWS is still cross-checking data and I'm still providing test data when possible, however this is as confident as we are so far. Some factors that AWS wouldn't know are how Flashphoner's code works, what internal calls it may make, and whether there are important differences it's had to make between the old and new Amazon images.

It would certainly help if you have any further ideas of information regarding this as it could help explain why the returning stream object cannot call getRecordInfo() successfully. I would be very grateful if you can provide further assistance and if you need any collected logs (not just tcpdump but also the nfs io stats) I would happy to provide.

Just to re-iterate the problem:
The issue is that getRecordInfo() doesn't always work when stream status returns 'UNPUBLISHED' and can only occur:

on the new image: Amazon Linux 2 AMI
when NFS is setup on the '/records' directory.
only when 2 (or more streams) share the same thread

Cheers

Max · Feb 22, 2021

Taylor said:
N.B. The streams that are unsuccessful do return an 'UNPUBLISHED' status but calling stream.getRecordInfo() on the returned stream object yields no file name.

Let's explain how H264 stream to MP4 recording works.
1. The temporary file mp4.tmp is created where stream data are stored.
2. When stream is stopped, the recording file mp4 is created, then MP4 header is evaluated (moov and [/ICODE]mdat[/ICODE] atoms are defined) and is written to the recording file
3. Recording data are written to the recording file
4. Recording file is closed
5. STREAM_STATUS.UNPUBLISHED even is sent to client with recording file name
So the problem is that we read at steps 2-3 from one file and write to another and, therefore, we have twice as much IO operations. On NFS volume, it may take more than 15 seconds per stream. In this case, a timer to wait for recoding finalization (15 seconds by default) expires, and UMPUBLISHED event is sent before the recording file is ready, so the file name is not sent to client.
The workaround is to use HDD for /records folder and on_record_hook script to copy recording files to NFS volume.
You can also increase the timer setting to 30 or more seconds

Code:

record_stop_timeout=30

We also raised the ticket WCS-3080 to add a separate folder setting to place temporary recording files which supposed to be set to /tmp or RAM disk.

Max · May 28, 2021

Good day.
Since build 5.2.963, it is possible to set a separate folder for temporary files using the following parameter

Code:

record_tmp_dir=/tmp

Please read details here

Java Memory Allocation on Linux 2 AMI

Taylor

Member

Max

Administrator

Taylor

Member

Max

Administrator

Taylor

Member

Max

Administrator

Max

Administrator

Taylor

Member

Max

Administrator

Taylor

Member

Max

Administrator

Taylor

Member

Max

Administrator

Taylor

Member

Taylor

Member

Max

Administrator

Max

Administrator