Mailgun
API for email
Optimized Delivery
Receiving, Parsing & Storage
Tracking
Events
New customer logs
Full text search.
Filtering on selected properties (API).
Storing events for 30+ days, limited retention.
Extendable by adding nodes.
Resilient to failing nodes.
Elasticsearch:
Near real-time
Apache Lucene
Balancing Shards
Replicas
Index Design
One big index
Per day
Per customer
TTLs
Dropping indices
Mappings
Field types
Templates
index.mapping.ignore_malformed
Types
string, integer, float, boolean
array, object, multi-field
ip, geo point/shape
attachment (blob)
Analysis
Analysis:
Analyzer: Composed of tokenizer and filters.
Tokenizer: Splits string in tokens
Filter: Case folding, stopwords, synonyms
Shipping to Elasticsearch
Graphite
vör
Cluster stats
Shipping to Elasticsearch
Select ES type based on field
Flatten JSON user-defined fields
Flush size
Use fields to form statsd metric name
Timings
→ show Graphite dashboard
→ show Elasticsearch dashboard
API authorization
Filtering proxy
vulcan
Authentication service
Custom headers
API gateway → ES API
Mailgun stats
60-80 million events per day
30 days log retention
30 indexes
5 shards, 1 replica
2x 90-110 GB per index
6TB total
Previous setup
Data: 8x 30GB RAM, 8vCPUs, 1TB disk
Logstash: 3x 8GB RAM, 4vCPUs
Vulcan/API: 2x 4GB RAM, 2vCPUs
Rackspace Cloud boxes. Note new performance boxes.
Current setup
Data+Logstash: 4x 64GB RAM, 24 cores, 2TB disk RAID 0
Vulcan/API: 2x 4GB RAM, 2vCPUs
ES config
Cluster name
Discovery
Lock 32GB JVM heap
indices.fielddata.cache.size
: 40%
gateway.expected_nodes
discovery.zen.minimum_master_nodes
Discovery: multicast v.s. unicast
Minimum master nodes: (#nodes / 2) + 1
ES cluster settings
c.r.a.node_concurrent_recoveries (2)
c.r.a.cluster_concurrent_rebalance (2)
indices.recovery.max_bytes_per_sec (20M)
indices.recovery.concurrent_streams (3)
Routing
Routing on account_id
Watch for hotspots
Parallel v.s. sequential searches
Connectivity failure
2 cloud machines lost connectivity with the cluster
Replica shards made master
New replicas allocated
Unallocated shards recovered upon reconnect
Rebalance after replication
Application logging
ssh
Find the log file
tail -f
sudo tail -F
udplog
udplog daemon
on localhost
Log events as UDP datagrams, category colon JSON blob (max
~64k)
Deamon ships to Scribe/Redis/RabbitMQ
Python logging facility
import logging
import socket
import warnings
from udplog.udplog import UDPLogger, UDPLogHandler
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.DEBUG)
logging.captureWarnings(True)
udplogger = UDPLogger(defaultFields={
'appname': 'example',
'hostname': socket.gethostname(),
})
root = logging.getLogger()
root.setLevel(logging.DEBUG)
root.addHandler(UDPLogHandler(udplogger, category="python_logging"))
Python logging facility
logger.debug("Starting!")
logger.info("This is a simple message")
logger.info("This is a message with %(what)s", {'what': 'variables'})
extra_logger = logging.LoggerAdapter(logger, {'bonus': 'extra data'})
extra_logger.info("Bonus ahead!")
a = {}
try:
print a['something']
except:
logger.exception("Oops!")
warnings.warn("Don't do foo, do bar instead!", stacklevel=2)
Python logging facility
{
"appname": "example",
"category": "python_logging",
"excText": "Traceback (most recent call last):\nFile \"doc/examples/python_logging.py\", line 39, in main\nprint a['something']\nKeyError: 'something'",
"excType": "exceptions.KeyError",
"excValue": "'something'",
"filename": "doc/examples/python_logging.py",
"funcName": "main",
"hostname": "localhost",
"lineno": 41,
"logLevel": "ERROR",
"logName": "__main__",
"message": "Oops!",
"timestamp": 1379508311.437895
}
→ Kibana screen for udplog
Thoughts
Reputation scoring
Elasticsearch for time series
Aggregations