Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade embedded Postgres and fix compatibility issues. #1199

Closed

Conversation

bcipriano
Copy link
Collaborator

@bcipriano bcipriano commented Sep 16, 2022

Link the Issue(s) this Pull Request is related to.
#1196

Summarize your change.
In order to get Cuebot working on the arm64v8 architecture, we need to upgrade the embedded Postgres we're using. This created some compatibility issues which are fixed here as well.

The main issue I ran into was a strange one. On ARM only, and only in Docker, it appears that SUM(bigint) does not work properly. It returns a numeric type response as expected, but the value is completely wrong -- we were getting results like 20 + 9 = 368934881474632988248, which would cause overflows in the Cuebot code. The solution I found is to CAST the input values as numeric before the SUM happens.

This only happens when building the Cuebot Docker image. When I run tests directly on my Mac i.e. through IntelliJ, the tests run fine.

There are a few other minor test fixes in here, possibly due to some differences in either the Postgres version or x86/arm differences, but I didn't see anything concerning there.

This PR contains some of the changes from #1185 to make testing easier; these will be cleaned up once that PR is merged.

@@ -264,30 +264,25 @@ public void updateProcMemoryUsage(FrameInterface f, long rss, long maxRss,
"SELECT pk_frame FROM proc WHERE pk_frame=? FOR UPDATE",
String.class, f.getFrameId()).equals(f.getFrameId())) {

getJdbcTemplate().update(UPDATE_PROC_MEMORY_USAGE,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this was a merge mistake. This block and the one below it seem to duplicate each other, and this block is missing one of the placeholder values.

@@ -212,6 +211,7 @@ public void testPortionSorted() throws Exception {
@Test
@Transactional
@Rollback(true)
@Ignore
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this unit test. I dug into it a bit and I don't see an obvious problem, but the assertNotEquals(jobs, sortedJobs) check fails for me sometimes in an ARM environment. After looking at the dispatch SQL query, I don't see why jobs and sortedJobs should be different. The jobs are all launched with the same priority, so it's expected to get them back in the same order, right?

@DiegoTavares Any idea here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't make much sense to me. I guess the first assert with sortedJobs can be removed as it's not guaranteed both lists will be different.

Comment on lines -267 to -270
getJdbcTemplate().update(UPDATE_PROC_MEMORY_USAGE,
rss, maxRss, vss, maxVss,
usedGpuMemory, maxUsedGpuMemory, f.getFrameId());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, this is was a weird merge issue. Not sure where it came from. Good catch.

@@ -88,7 +88,7 @@ public class DispatchQuery {
"LEFT JOIN (" +
"SELECT " +
"limit_record.pk_limit_record, " +
"SUM(layer_stat.int_running_count) AS int_sum_running " +
"SUM(CAST(layer_stat.int_running_count AS numeric)) AS int_sum_running " +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading postgres docs raised some concerns regarding the performance of this conversion:

https://www.postgresql.org/docs/current/datatype-numeric.html :

The type numeric can store numbers with a very large number of digits. It is especially recommended for storing monetary amounts and other quantities where exactness is required. Calculations with numeric values yield exact results where possible, e.g., addition, subtraction, multiplication. However, calculations on numeric values are very slow compared to the integer types, or to the floating-point types described in the next section.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to dig a bit deeper to understand where this is coming from. Was the issue impacting every query or a specific subset?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed to impact every query that used a SUM on a bigint field. I didn't test all of these queries but once I saw it a few times I just went through and changed all of them, it seemed consistent.

According to postgres docs SUM(bigint) always returns a numeric, so I'm assuming something is going wrong in that conversion process somewhere.

It could be a problem specific to the embedded postgres, I don't think I was able to reproduce it anywhere else. And I can't get an older version of the embedded postgres on ARM so I wasn't able to try that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DiegoTavares Any update here? Possible to do a test on your side to see if it produces any performance issues of note?

Copy link
Collaborator

@DiegoTavares DiegoTavares Oct 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've submitted a ticket to EDB asking about the error.
It looks there's a change this update will actually have performance implications. We can try a different version of postgres and see if this is a bug specific for this version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bcipriano I have few questions for you

  1. Did you installed postgres 14.5 using following link https://www.enterprisedb.com/downloads/postgres-postgresql-downloads or using some other source ?
  2. Did you got similar issue when you was on version 12.12 ? Are you able to test it on Postgresql12.12 on your new macbook ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bcipriano Can you confirm the source of postgres download? can you try https://www.enterprisedb.com/downloads/postgres-postgresql-downloads for download postgres ? Let me know how it goes with EDB 14.5?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to mention -- this only seems to happen in Docker. When I run tests directly on my ARM Macbook, they pass fine, in fact this PR isn't needed at all. It's only when I try to build the Cuebot Docker image (which does the Gradle build/test as part of the build process) that this problem appears.

Did you installed postgres 14.5 using following link https://www.enterprisedb.com/downloads/postgres-postgresql-downloads or using some other source ?

Postgres is installed via Gradle/Maven. You can see the versions defined in build.gradle: https://github.com/AcademySoftwareFoundation/OpenCue/pull/1199/files#diff-7e57ab670111c27c23069909d0cc32c88d266c9a1f3e6ca38052c973418146fe

I don't know if it's possible use Postgres binaries from some other source with this embedded Postgres thing we're using.

Did you got similar issue when you was on version 12.12 ? Are you able to test it on Postgresql12.12 on your new macbook ?

Same issue on 11.13.0 and 12.12.0. But only on Docker.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a bit more research into this today:

  • Tried the latest Postgres 15 binaries. Same issue.
  • Tried a different base image (almalinux instead of gradle). Same issue.
  • Created a Postgres container from the standard postgres image on my mac. Was not able to reproduce the issue.

Still kind of stumped. I filed zonkyio/embedded-postgres#99 to see if those folks have any idea.

@bcipriano
Copy link
Collaborator Author

From TSC meeting: we're going to hold on this PR for now while we continue to investigate. The upstream ticket has some suggestions I'll look into next.

@bcipriano
Copy link
Collaborator Author

Had a breakthrough on this based on the advice I got on the upstream ticket. See zonkyio/embedded-postgres#99 (comment) for details.

Hopefully we can get some updated Postgres binaries released soon, after which we can probably throw out this PR and do a simple Postgres version upgrade.

@DiegoTavares
Copy link
Collaborator

Closing stale PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants