Dan Schultzer: Chasing a Phoenix LiveView long poll reload issue

TLDR; The long poll fallback in Phoenix LiveView requires distribution. All Phoenix nodes must be connected to each other.

The issue

In a Phoenix 1.7 LiveView production app, we received reports of users experiencing a reload loop. The page loads, and after 7 seconds, a flash error message with Something went wrong! Hang in there while we get back on track appears. The page is force reloaded, and this repeats indefinitely.

The Phoenix app is running in AWS ECS, with Cloudflare in front.

Investigation

First, I had to reproduce the issue. I found out that this was related to long poll, and it was easy to reproduce by setting the session storage keys:

sessionStorage.setItem("phx:fallback:ge", true);
sessionStorage.setItem("phx:fallback:LongPoll", true);

Long poll worked just fine in our staging environment, and I couldn’t reproduce it locally either. I went through the network requests using the developer tools in Safari, Firefox, and Chrome. In all browsers, the request was just canceled. No errors, no reason, nothing.

Production and staging are essentially identical. At first, we did wonder if there were some infrastructure differences that could cause odd behavior, but nothing looked out of the ordinary.

It was always the same three requests, with the third being canceled. The difference between production and staging was that in staging the second request returned a JSON with {"status": 200} while in production it was {"status": 410}.

Enabling debugging with liveSocket.enableDebug() showed it was a timeout error:

[Log] phx-GAIBpoiUGJCzqWuC error: unable to join -  – {reason: "timeout"} (app-500773fd7d20466d8a29a0f19ba374a4.js, line 2)

Cloudflare confirmed that it is the browser closing the connection as it logged the request with 499 Client Closed Request HTTP status.

The issue is in Phoenix.

Phoenix.Transports.LongPoll

I looked into what could return a 410 status code. It happens when a new session is set up:

defp new_session(conn, endpoint, handler, opts) do
  # ...

  case DynamicSupervisor.start_child(Phoenix.Transports.LongPoll.Supervisor, spec) do
    :ignore ->
      conn |> put_status(:forbidden) |> status_json()

    {:ok, server_pid} ->
      data = {:v1, endpoint.config(:endpoint_id), server_pid, priv_topic}
      token = sign_token(endpoint, data, opts)
      conn |> put_status(:gone) |> status_token_messages_json(token, [])
  end
end

If a previous session can’t be resumed, a new session is set up:

defp dispatch(%{method: "GET"} = conn, endpoint, handler, opts) do
  case resume_session(conn, conn.params, endpoint, opts) do
    {:ok, new_conn, server_ref} ->
      listen(new_conn, server_ref, endpoint, opts)

    :error ->
      new_session(conn, endpoint, handler, opts)
  end
end

To resume an existing session, Phoenix.Transports.LongPoll will try to get the server reference from the PID stored in the token and then broadcast a subscribe message on that server reference:

defp resume_session(%Plug.Conn{} = conn, %{"token" => token}, endpoint, opts) do
  case verify_token(endpoint, token, opts) do
    {:ok, {:v1, id, pid, priv_topic}} ->
      server_ref = server_ref(endpoint.config(:endpoint_id), id, pid, priv_topic)

      new_conn =
        Plug.Conn.register_before_send(conn, fn conn ->
          unsubscribe(endpoint, server_ref)
          conn
        end)

      ref = make_ref()
      :ok = subscribe(endpoint, server_ref)
      broadcast_from!(endpoint, server_ref, {:subscribe, client_ref(server_ref), ref})

      receive do
        {:subscribe, ^ref} -> {:ok, new_conn, server_ref}
      after
        opts[:pubsub_timeout_ms] -> :error
      end

    _ ->
      :error
  end
end

That’s it!

The difference between WebSocket and LongPoll transports is that LongPoll requires all nodes to be connected, since to resume a session it needs to broadcast from the PID that the session was initiated on.

The difference between staging and production is that staging had just one server running, while production had two. The AWS load balancer was consistently flipping between the two servers for each long poll request.

So the solution is to enable distribution for LongPoll fallback to work.

Conclusion

While I’m not sure of the exact reasons why LongPoll requires distribution, I guess it is because WebSocket’s persistent connection makes it very simple to keep process state, while LongPoll has to tie multiple isolated REST requests together to achieve the same process state. In any case, you must enable distribution for your Phoenix app to function correctly with PubSub messages across all nodes.

I hope this post can save some time for others in a similar situation.