Efficiently querying data in batches using Absinthe

May 14, 2021

Building APIs with Absinthe and GraphQL gives you great power, but with great power comes great responsibility. (Or something like that.) It can be easy to make a bit of a mess and go about fetching data in very slow and inefficient ways, and it's up to us as developers to notice it happening and nip it in the bud.

Consider the following scenario:

Our app contains records for Courses, such as courses you might take at a college or university, and each Course has many associated Enrolments. Each Enrolment has a type field, holding either the value "auto" or the value "manual".

How would we query in GraphQL for a list of all courses, with its enrolment counts for each type of enrolment?

To make the API really simple, you might want to support this query:

query Courses {
  courses {
    id
    code
    name
    autoEnrolmentCount
    manualEnrolmentCount
  }
}

So far so good. Now how could we load that data efficiently on the backend?

🔗 The naive approach

⚡️ I've set up a GitHub repo with a Phoenix app and the code from this post - you can check it out here. ⚡️

Assuming you already have your Elixir app set up with Absinthe, you might have a course object type defined as part of your schema that looks like this:

object :course do
  field :id, non_null(:id)
  field :code, non_null(:string)
  field :name, non_null(:string)
end

If you add fields for :auto_enrolment_count and :manual_enrolment_count, they don't exist as part of the Course schema. The default resolution method for any field you list is to do Map.get(object, field_name) so these would always just be null.

Well, counts shouldn't ever be null, and you'd need to get the real value in there anyway, so you can add custom resolvers for those fields.

alias CourseLoaderWeb.GraphQL.Resolvers # Use the correct name for your resolver!

object :course do
  field :id, non_null(:id)
  field :code, non_null(:string)
  field :name, non_null(:string)

  field :auto_enrolment_count, non_null(:integer) do
    resolve(&Resolvers.course_auto_enrolment_count/3)
  end

  field :manual_enrolment_count, non_null(:integer) do
    resolve(&Resolvers.course_manual_enrolment_count/3)
  end
end

And in the resolver module:

def course_auto_enrolment_count(_, _, _), do: {:ok, 0}
def course_manual_enrolment_count(_, _, _): do: {:ok, 0}

This works, and you can query courses and find out that none of them have any enrolments.

To get the right number of enrolments, you could change the resolver to call a context function:

def course_auto_enrolment_count(course, _, _) do
  {:ok, Courses.count_course_enrolments(course, "auto")}
end

def course_manual_enrolment_count(course, _, _) do
  {:ok, Courses.count_course_enrolments(course, "manual")}
end

And in your context module, use Ecto to count the enrolments for the specified course ID and enrolment type:

def count_course_enrolments(%{id: id}, enrolment_type) do
  from(
    e in CourseLoader.Courses.Enrolment,
    select: count(e.course_id),
    where: e.course_id == ^id and e.type == ^enrolment_type
  )
  |> CourseLoader.Repo.one()
end

It works! But it's inefficient. This count_course_enrolments function will get called every time you need a value in the query response - so twice (once for each enrolment type) for each course in the response. For a lot of courses (or a more complicated query), this can add up fast.

# Oh no...
[debug] QUERY OK source="courses" db=0.5ms idle=1345.0ms
SELECT c0."id", c0."code", c0."name" FROM "courses" AS c0 []
[debug] QUERY OK source="enrolments" db=0.3ms idle=1345.8ms
SELECT count(e0."course_id") FROM "enrolments" AS e0 WHERE ((e0."course_id" = $1) AND (e0."type" = $2)) [1, "auto"]
[debug] QUERY OK source="enrolments" db=1.0ms idle=1346.3ms
SELECT count(e0."course_id") FROM "enrolments" AS e0 WHERE ((e0."course_id" = $1) AND (e0."type" = $2)) [1, "manual"]
[debug] QUERY OK source="enrolments" db=0.5ms idle=1347.8ms
SELECT count(e0."course_id") FROM "enrolments" AS e0 WHERE ((e0."course_id" = $1) AND (e0."type" = $2)) [2, "auto"]
[debug] QUERY OK source="enrolments" db=1.3ms idle=1349.0ms
SELECT count(e0."course_id") FROM "enrolments" AS e0 WHERE ((e0."course_id" = $1) AND (e0."type" = $2)) [2, "manual"]
...
...

It's a classic N+1 query problem, and you can use Absinthe's batching to solve it.

🔗 The batching approach!

Absinthe can take a batching approach to resolving values for fields, building a list of all the data that needs to be fetched, and then fetching it all at once before sending the query response.

For example, if you were fetching enrolment counts for courses 1, 2, 4 and 5, you can tell Absinthe "collate a list of all of the course IDs we need enrolment counts for, then call a function to get all of the data at once".

In practice, it might look something like this in your resolver:

import Absinthe.Resolution.Helpers, only: [batch: 3]

def course_auto_enrolment_count(course, _, _) do
  batch({Courses, :count_course_enrolments, "auto"}, course.id, fn results ->
    {:ok, Map.get(results, course.id, 0)}
  end)
end

def course_manual_enrolment_count(course, _, _) do
  batch({Courses, :count_course_enrolments, "manual"}, course.id, fn results ->
    {:ok, Map.get(results, course.id, 0)}
  end)
end

This is telling Absinthe, "every time I request a course_auto_enrolment_count value, remember that you'll need to call the function identified by the tuple {Courses, :count_course_enrolments, "auto"} (the first argument to batch/3) for this course.id (the second argument to batch/3)."

Absinthe is nodding and saying yes yes, writing it down, I will call that function... eventually. It does everything else it needs to do, loading a list of courses it needs to return, processing each course to get the values for each of the fields... but it doesn't actually do anything about all these batch requests it has sitting there waiting.

Only at the very end does Absinthe say oh yeah, about those enrolment counts you wanted! You wanted me to call {Courses, :count_course_enrolments, "auto"}, so that's the function Courses.count_course_enrolments. I'll give it the first argument "auto", and then also the list of all the course IDs you wanted counts for. Got it!

(Note: the third argument in the tuple can be anything, or nothing at all, it could be just a two-element tuple that calls the specified function with only the list of IDs.)

With these two resolver functions, our Courses.count_course_enrolments/2 function will only get called twice, no matter how many courses we request data for - once for each enrolment type. The results of the function get passed to the third argument to batch/3, which is where we can separate out the data for each course to build the proper response.

So you can rewrite the query that gets the enrolment count for a single enrolment type and course, to fetch counts for all of the requested courses at once.

def count_course_enrolments(enrolment_type, course_ids) do
  from(
    e in CourseLoader.Courses.Enrolment,
    select: {e.course_id, count(e.course_id)},
    group_by: e.course_id,
    where: e.course_id in ^course_ids and e.type == ^enrolment_type
  )
  |> CourseLoader.Repo.all()
  |> Enum.into(%{})
end

This could really return any kind of value you wanted, but in this example I've chosen a map, with the keys being course IDs, and the values being the requested enrolment count.

The resolver callback functions get called for all of the requested fields, and they can just pluck out the right value for the course ID in question (defaulting to 0 if there are no enrolments, therefore no key in the map.)

This is much more efficient on the database:

[debug] QUERY OK source="courses" db=0.2ms idle=583.3ms
SELECT c0."id", c0."code", c0."name" FROM "courses" AS c0 []
[debug] QUERY OK source="enrolments" db=1.5ms queue=0.1ms idle=584.2ms
SELECT e0."course_id", count(e0."course_id") FROM "enrolments" AS e0 WHERE (e0."course_id" = ANY($1) AND (e0."type" = $2)) GROUP BY e0."course_id" [[6, 5, 4, 3, 2, 1], "manual"]
[debug] QUERY OK source="enrolments" db=1.8ms queue=0.1ms idle=584.0ms
SELECT e0."course_id", count(e0."course_id") FROM "enrolments" AS e0 WHERE (e0."course_id" = ANY($1) AND (e0."type" = $2)) GROUP BY e0."course_id" [[6, 5, 4, 3, 2, 1], "auto"]
# And that's all!

And everyone is much happier for it.

You could likely also do a similar thing using the Dataloader library, but in practice I've found it simple enough to use this pattern to batch up queries. Maybe that will be the subject of another blog post! We'll see! 😀