Architecture and Overview¶
Contents
High-level¶
Mozilla builds versions of Firefox, Fennec etc. and the built files are
uploaded to an S3 bucket. With each build a buildhub.json
file is created
that has all the possible information we intend to store and make searchable.
When a new file is added (or edited) in S3 it triggers an event notification that goes to an AWS SQS queue.
The daemon consumes the SQS queue. The daemon script looks for the exact file match. Since the SQS message only contains the name of the S3 object, it triggers a function that downloads that file, validates its content and stores it in PostgreSQL and also in Elasticsearch.
The four parts of Buildhub are:
The Django web server
The SQS consumer daemon script
The PostgreSQL and Elasticsearch that makes it possible to search
A
create-react-app
based React app for the UI which essentially runs SearchKit
Flow¶
TaskCluster builds a, for example,
Firefox-79-installer.exe
and abuildhub.json
TaskCluster uploads these files into S3.
An S3 configuration triggers an SQS event that puts this S3-write into the queue.
Buildhub2 processor daemon polls the SQS queue and gets the file creation event.
Buildhub2 processor daemon downloads the
buildhub.json
file from S3 using Pythonboto3
.Buildhub2 processor daemon reads its payload and checks the JSON Schema validation.
Buildhub2 processor daemon inserts the JSON into PostgreSQL using the Django ORM.
The JSON is then inserted into Elasticsearch.
Validation¶
The validation step before storing anything is to check that the data in the
buildhub.json
file matches the schema.yaml
file. Since TaskCluster
builds the buildhub.json
file and this service picks it up asynchronous and
delayed, there is at the moment no easy way to know an invalid
buildhub.json
file was built.
If you want to change the schema.yaml
make sure it matches the schema used
inside mozilla-central
when the buildhub.json
files are created.
Keys¶
The following keys are tracked in the code. Each one with a different purpose.
Generally the pattern is that every key starts with a “context” keyword followed
by an underscore. For example sqs_
. That prefix is primarily to be able to
trace it back to the source code, but also as a form of namespace.
sqs_process_buildhub_json_key
¶
Timer.
How long it takes to consider a buildhub.json
S3 key. This involves both
downloading it from S3 and to attempt to insert it into our database. That
“attempt to insert” means the hash is calculated, looked up and depending on if
it was found makes an insert or does nothing.
sqs_inserted
¶
Incr.
Count of inserts that were new and actually inserted into the database coming from the SQS queue.
sqs_not_inserted
¶
Incr.
Count of inserts, from a buildhub.json
that were attempted to be inserted
but were rejected because it was already in the database.
sqs_messages
¶
Incr.
This is a count of messages received by consuming the SQS queue. Assume this to be equal to the number of messages deleted from the SQS queue. It can be less messages deleted and than received in the unexpected cases where messages trigger an unexpected Python exception (caught in Sentry).
Note! The total number of sqs_inserted
+ sqs_not_inserted
is not equal
to the sqs_messages
because of files that are not matching what we’re looking
to process.
sqs_key_matched
¶
Incr.
Every time an S3 record is received whose S3 key we match. Expect this number
to match sqs_inserted
+ sqs_not_inserted
.
sqs_not_key_matched
¶
Incr.
Every message received (see sqs_messages
) can contain multiple types of
messages. We only look into the S3 records. Of those, some S3 keys we can
quickly ignore as not matched. That is what this increment is counting.
So roughly, this number is sqs_messages
minus sqs_insert
minus
sqs_not_inserted
.
api_search
¶
Timer.
Timer of how long it takes to fullfil every /api/search
request. This time
involves the Django request/response overheads as well as the time it takes to
send and receive the actual query to Elasticsearch.
api_search_records
¶
Gauge.
A count of the number of builds found by Elasticsarch in each API/search request.
api_search_requests
¶
Incr.
Measurement of the number of requests received to be proxied to Elasticsearch.
Note that every incr is accompanied with a tag. That is method:$METHOD
.
For example, method:POST
.
backfill_inserted
¶
Incr.
When a build is inserted from the backfill job that we did not already have. If this number goes up it means the SQS consumption is failing.
backfill_not_inserted
¶
Incr.
When running the backfill, we iterate through all keys in the S3 bucket and to avoid having to download every single matched key, we maintain a the keys’ full path and ETag in the database to make the lookups faster. If a key and ETag is not recognized and we attempt to download and insert it but end up not needing to, then this increment goes up. Expect this number to stay very near zero in a healthy environment.
backfill_listed
¶
Incr.
When running the backfill, this is a count of the number of S3 objects we download per page. To get an insight into the number of S3 objects considered, in total, use this number but over a window of time.
backfill_matched
¶
Incr.
When running the backfill, we quickly filter all keys, per batch, down to the
ones that we consider. This is a count of that. It’s an increment per batch.
Similar to backfill_listed
, to get an insight into the total, look at this
count over a window of time.
backfill
¶
Timer.
How long it takes to run the whole backfill job. This includes iterating over every single S3 key.
kinto_migrated
¶
Incr.
When we run the migration from Kinto, a count of the number of messages (per batch) that we received from batch fetching from the legacy Kinto database.
kinto_inserted
¶
Incr.
A count of the number of builds that are inserted from the Kinto migration. One useful use of this is to that you can run the Kinto migration repeatedly until this number does not increment.