Tmdb: A Caching Wrapper for The Movie Database

honey the codewitch

4.14/5 (5 votes)

Sep 5, 2019

CPOL

11 min read

10979

122

Easily and efficiently query api.themoviedb.org/3/ using this wrapper

Introduction

While experimenting with my JSON libraries and building some new code around them, I decided to make a real world use case by plugging them into the content at themoviedb.org and accessing their API using the provided JSON/REST interface. It started as a demo app and use case.

That use case exploded into over 30 wrapper classes covering nearly all aspects of the API, and a virtually automatic in-memory caching system for JSON backed objects.

Disclaimer: This project started as a demo. I haven't written tests for it. Fortunately, it's easy to maintain, especially given the bugs that are most likely. Use at your own risk. I'm maintaining it though, and if you have bugs to report, I'll get right on them.

Background

themoviedb.org catalogs TV shows, movies, and actors so that you can retrieve all kinds of information on them. It exposes a JSON/REST API you can use to query the content. This data is useful for all kinds of things, from presenting friendly show information in programs like Plex to organizing and cataloging your own digital show and movie collection (if you're the type to rip your content like I am - I'm too lazy to swap discs!)

JSON meanwhile, is a simple interchange format that pretty much supplanted XML as the data transfer interchange format du jour on the web. It's lighter than XML, shockingly easy to parse, and very intuitive. It's as simple as humanly possible which can be a blessing, but sometimes is a curse (no built in schema information for example)

This project covers both but we'll start with a quick rundown of using the JSON API that ships with this.

Using the JSON API

This is just an overview. For the full writeup, including the caching rpc mechanism, see my article on this.

This TMDb library is actually just the demo project for that JSON library but I wanted something real-world to hit it with.

Conceptualizing this Mess

Most of the time, the JSON isn't represnted as text, but rather as a nested object graph. How that works is each JSON element maps to some .NET type; the JSON object {} maps to IDictionary<string,object>, the JSON array [] maps to IList<object>, the JSON string and boolean types map to their respective .NET types, null maps to System.Object and the JSON numeric numeric type either maps to System.Int32, System.Int64, System.Numerics.BigInteger, or System.Double depending on what will hold it - preferred being the first.

Mapping the objects this way creates a tree of objects out of the JSON. If we interconnect those trees, and recycle branches that are the same, we create a web or a graph, since one node can have multiple parents. That's why we call it an object graph rather than a tree. If we serialize the graph, it will be represented as a tree, with "recycled" branches being copied into all locations where they are referenced. If you create a parent that is referenced by one of its descendants in these graphs (a loop) and then try to serialize it, the behavior is undefined and generally very very naughty.

Coding this Mess

JsonObject doesn't have to be used, but it thinly wraps a dictionary and provides dynamic/DLR call sinking where the keys are property names, and it provides some methods for reading, writing and manipulating JSON, like CopyTo(), Get(), CreateSubtree(), and Select() - the latter taking a jsonpath expression and querying the graph with it. Once again. if you try to query a recursive graph, the results are undefined, and typically bad. As a general rule, don't try to make your JSON recursive. Just because you can do something doesn't mean you should. Equality is done using value semantics. That is to say, two objects are considered equal if they contain the same data. This allows entire JSON trees to be used as dictionary keys, but use it carefully. That feature is helpful, but not fast at all.

JsonArray similarly wraps a list and provides value semantics, and a ToArray() method that helps you convert the array to a more usable type. One of the overloads takes a lambda and allows you to create the destination array elements from the source json element.

// merge src into dst
var src = JsonObject.Parse("{\"id\":2,\"foo\":{\"foobar\":\"baz\"}");
Console.WriteLine(src);
var dst = JsonObject.Parse("{\"id\":1,\"foo\":{\"id\":3} }");
Console.WriteLine(dst);
JsonObject.CopyTo(src, dst);
Console.WriteLine(dst);

CreateSubtree() takes a JSON object and a series of path segments and creates that tree in the JSON. Either way, the path is followed and the final element is returned. This can be used to quickly and easily create or traverse paths in the JSON data. It throws if you try to replace something that already exists but isn't an object like when you try to create in the path of an existing array.

// create /foobar/baz under obj
var obj = JsonObject.Parse("{\"foo\":\"bar\"}");
var obj2 = JsonObject.CreateSubtree(obj, "foobar", "baz");
obj2.Add("result", 1);
Console.WriteLine(obj);

Then there's Parse(), LoadFrom(), ReadFrom(), and LoadFromUrl(), each taking some form of input (like some text or a file) and returning a JSON object tree from it. These complement ToString(), SaveTo(), and WriteTo() respectively, though there is no straightforward way to save to an URL! See the above code for Parse() but all of the loading and reading methods work the same way, and the writing and saving methods work that way in reverse. Simple.

Alternatively, you can use the JsonTextReader to stream JSON without creating an in memory model of the entire document - you can do so selectively. This works very much like XmlReader/XmlTextReader. However, it also includes ParseSubtree() which returns the current subtree as a JSON object, SkipSubtree() which quickly skips the subtree, and SkipToField() which advances to the specified field.

The library also includes a simple JSON/REST based rpc method with caching used extensively by the Tmdb project.

Using the Tmdb API

Conceptualizing the root API

This API loosely wraps the The Movie Database API described here. Most of the names have been kept the same, but renamed to .NET conventions. For example, the JSON field "imdb_id" becomes ImdbId.

Rooting this entire mess is Tmdb, which is a static class that exposes methods for accessing the TMDb API endpoint, like SearchMovies(), and ApiKey - go here once you have created an account at themoviedb.org and create your API key. Another property is Language which allows you to set the language to your desired ISO language like "en-US" for English. Neither of the previous properties should be set once you're going so set them first. Accessing the Json property gives you access to the thread-local cache for TMDb entities (see the section on caching).

Aside from the API access functions, there are several cache functions (covered later) and some helpers including GetImageUrl() which allows you to map an image "path"** you get from TMDb to a real URL.

** not actually a path. It's just part of a path, and invalidly rooted as returned from TMDb so it's not usable by itself.

Conceptualizing the Dual Level Caching

For performance reasons, there are two levels of cache.

Primary: The primary cache is in memory, it's per-thread, and is very aggressive. It never expires its cache. The only time it updates is when you do something that would otherwise cause it to refresh some content from the server. It's designed to be used for a batch and then thrown away by manually expiring the cache with Tmdb.ClearCache(). Some of the data returned from the server is never cached, like the sessions or search queries (however, in the latter case the individual results are cached, but not the search itself). Generally, cached items are accessed via properties and uncached items are accessed via method. It's also possible to load, save, and merge whole caches with an eye toward making this distributed (like for a server farm).

Secondary: The secondary caching is url based, and basically works like your web browser's cache. It will periodically recheck with the server, though you can change how frequently it does. When an item here is said to be uncached, it can be cached still at this level. This means that "uncached" as it's used in this article and the source comments refers exclusively to the primary cache. The secondary cache is global to the application and automatically managed by the OS and runtime environment. You can set it through the Tmdb.CacheLevel property.

Conceptualizing Tmdb Entities

Each of the significant items of data in the TMDb API derives from TmdbEntity. This class simply represents the base of all entity objects in the library. Each entity wraps a particular set of data from TMDb. For example, there is TmdbShow to represent a tv show, and TmdbMovie to represent a movie. The entity by itself doesn't do much. Mainly, it's just a contract that stipulates it take a constructor with normalized JSON data, and it exposes its data via the Json property. Implied is that it backs all of its state with that Json object. This is important.

Some entities can retrieve themselves from the remote endpoint and from the cache. These entities have a unique identity called PathIdentity represented as a series of path segments. This is the location of the object in both the remote repository and the local cache. They (mostly) share data layout so it stays simple. These objects have protected methods for managing and retrieving items from the cache. See the code for TmdbCachedEntity.

Furthermore, some entities may also support a numeric or string identifier (id). When the cached ones do it is represented by TmdbCachedEntityWithId for integer ids and TmdbCachedEntityWithId2 for string ids. These entities just provide an overloaded constructor that takes the id, which it exposes through the Id property.

Generally, an entity supports rooting itself in the Tmdb.Json object cache, and fetching itself from the remote TMDb API endpoint. An entity simply wraps the Json underneath it. These wrappers can be used and thrown away repeatedly. it's the underlying JSON that sticks around and is important. Every entity has an associated object graph representing its JSON, exposed by the Json property. The other properties on the object simply wrap this data. If the requested data is not found in its Json graph, it goes to the remote server to fetch what it needs, which it then stores in the Json graph. Each Json graph is rooted somewhere under Tmdb.Json which again, serves as the root of the entire graph.

Some entities cannot fetch themselves from a server. Maybe they just represent some child data of another entity and they don't have their own TMDb id. In this case, the object is cached as part of its parent entity's caching, and the entity can only be created with a full JSON graph representing all its properties. This is because obviously, it cannot fetch data from the server as needed, since it has no way to identify itself, so all of its data must be present at construction. This is needed because much of the TMDb API returns queries with multilevel nested data underneath, and all of this data must be represented. Some entities are never cached. Entities like the TmdbSession simply don't make sense to cache.

Conceptualizing a TmdbEntity Derivation

If we create a new TmdbShow object (a wrapper for a TMDb TV show) we can pass an id to the constructor. This will make a JSON object initially only id. The other JSON properties, like "name" and "first_air_date" will be fetched as needed when you query the show's Name and FirstAirDate respectively. Note that typically, fetching one property retrieves most of the properties for an entity because TMDb returns a lot of data for each query. We use it all, throwing it in the cache to avoid hitting the remote server more than necessary.

using TmdbApi;
...
Tmdb.ApiKey = "myApiKey";         // change this to something valid
...
var show = new TmdbShow(2919);    // This is the TV show "Burn Notice"
// doing this creates no network traffic yet
Console.WriteLine(show.Json);     // currently only writes {"id":2919}
// the following will cause an http request
Console.WriteLine(show.Name);     // writes "Burn Notice" to the console.
// the following will *not* cause a request because the last one
// already fetched this data.
Console.WriteLine(show.Overview); // writes the burn notice overview
Console.WriteLine(show.Json);     // writes a lot of json
...
var show2 = new TmdbShow(2919);
// none of the following causes network traffic
Console.WriteLine(show==show2);   // writes "true" since they have the same id.
Console.WriteLine(show2.TotalSeasons); // writes 7
Console.WriteLine(show2.Json);    // same as writing show.Json - both point to the same place.
...

Conceptualizing Paged Functions

Nobody wants to fetch movies and shows by id. You'll usually use SearchShows()/DiscoverShows() to find TV shows, (and SearchMovies()/DiscoverMovies() to find movies.) These are what are called "paged functions".

// fetch the top result (no error checking)
var show = Tmdb.SearchShows("Star Trek", minPage: 0, maxPage: 0)[0];

Paged functions return portions of a query one or more pages at a time. These functions take a minPage and maxPage. If unspecified, all results are returned. Pages are 1 based on the server, but zero based in this library, so the first page is 0, not 1. Each page requires its own HTTP request, so returning several pages can generate a lot of traffic.

Paged functions are not cached. That is to say, these may be cached down the road but they aren't currently. The individual results they return are each cached however, item by item, but the search request itself is not stored, so if it is run again, the search will be run again. For example, if you call SearchMovies() with "Hunting", each movie (like "Good Will Hunting") returned will be stored in the cache, but running the search again will still result in at least one HTTP request. Therefore, minimize your use of these functions where you can.

For documentation on the individual fields, please refer to the TMDb API documentation here.

History

5^th September, 2019 - Initial submission