Work Term #3: Scalable cloud metadata search

During my third co-op, I developed and shipped a scalable cloud metadata search service at Yelp. The core service is written in Python, and I used Terraform and Puppet to provision and configure the infrastructure promoting reproducability and scalability. This is hosted on AWS, more specifically using a serverless architecture leveraging API Gateway and Lambda.

What is a cloud metadata search service? Put simply, it keeps an inventory of all the servers that Yelp has and any facts about them. Some examples of facts include software that is provisioned on that machine, operating system information, memory available, geographical regions and hundreds of other fields.

This service presents the following features:

Expressive discovery using custom DSL querying capabilities in 2 languages (JSON and query string)
Significantly faster querying (<100ms per query, real-time event-based invalidation, 5 minute refreshes)
Highly scalable serverless architecture which is reproducable and configured using code
Deployment process using S3 and Jenkins, unit tests and end-to-end tests using Docker

My favourite part of this project is that I went through all stages of the development process, and got to see something created from scratch go through to production!

Formerly, there was a solution using MCollective that would distribute the request to all known hosts and fetch information then. The issue with that solution is that the query time is essentially a function of the slowest host to respond - since it would wait for all hosts to respond and filter afterwards.

Meanwhile, the inventory service works in the other direction. Instead, each host contains a cron-job that emits metadata on that machine to the inventory service.

The service has its own DSL query language, which can be provided in JSON:

{
    "filter" : {
        "type": "AND",
        "args": [{
             "type": "matching",
             "arg": "hostname: interndev1-us[east,west]1dev*"
         }, {
             "type": "matching",
             "arg": "cpu_count >= 16"
         }]
    },
    "sort": "uptime asc, hostname desc",
    "limit": 1000,
    "offset": 10,
    "index_epoch_threshold": 2000
}

Alternatively using the string query syntax (inspired by Apache Solr’s streaming syntax):

$ search(filter(AND("hostname: interndev1-us[east,west]1dev*", "cpu_count >= 16")), sort("uptime asc, hostname desc"), limit(1000), offset(10), index_epoch_threshold(2000))

As part of this project, I also implemented query parsers for both that convert the associated query into an Elasticsearch URI query. This included making some interesting optimizations along with a basic n-ary preorder tree traversal to assemble the query string.

One of the greatest challenges was dealing with a schemaless design. Since there’s hundreds of fields, it isn’t managable to manually introduce every field in a fixed schema since teams are constantly adding or removing fields. Luckily, Elasticsearch provides the ability to give a dynamic schema, where it infers type. However, this was quickly proven unreliable because we experienced type collisions (ex: initially a field was bool, later it would be indexed as int). Such a type collision would yield errors when ingesting metadata - later invalidating that instance from search results and returning incomplete data!

The solution to this was leveraging a dynamic templated mapping. We would have four types: *_str, *_bool, *_long, *_float. The service would append to each field being ingested the type it thinks would be the type. This type would always conform with the equivalent Elasticsearch type, hence all ingestion requests would go through. The only concern now becomes - fields can sometimes be more sparse and scattered across multiple fields - at worst 4x worse performance! However, this rarely the case, and providing the guaranteed that documents will be ingested is more critical than slightly slower retrieval.

How does this work for retrievals? All the retrieval queries would need to be appropriately tokenized, have type introspection occur and inspect the mapping of the index, and then adjusted accordingly. This includes expanding any numerical field queries to be (field_long:val OR field_float:val). While a tricky process for all the arguments that can be passed to the inventory service, the strict DSL and with a careful implementation and lots of unit tests it worked well.

The final benchmarks using Apache Bench and AWS CloudWatch yield <150ms p99 query times for indexing and <90ms p99 search queries while exceeding one million API requests daily.

To conclude, I learned a lot this term with an amazing mentor who was extremely helpful throughout the entire process - something which I think is invaluable and definitely made this one of my favourite terms.