Configuration Reference

The following items are inclulded in this section:

Configuring Cluster Components

Configuration files for locator, lead, and server should be created in the conf folder located in the SnappyData home directory with names locators, leads, and servers.

To do so, you can copy the existing template files servers.template, locators.template, leads.template, and rename them to servers, locators, leads. These files should contain the hostnames of the nodes (one per line) where you intend to start the member. You can modify the properties to configure individual members.

Tip

  • For system properties (set in the conf/lead, conf/servers and conf/locators file), -D and -XX: can be used. -J is NOT required for -D and -XX options.

  • Instead of starting the SnappyData cluster, you can start and stop individual components on a system locally.

Configuring Locators

Locators provide discovery service for the cluster. Clients (for example, JDBC) connect to the locator and discover the lead and data servers in the cluster. The clients automatically connect to the data servers upon discovery (upon initial connection). Cluster members (Data servers, Lead nodes) also discover each other using the locator. Refer to the Architecture section for more information on the core components.

It is recommended to configure two locators (for HA) in production using the conf/locators file located in the <SnappyData_home>/conf directory.

In this file, you can specify:

  • The hostname on which a SnappyData locator is started.

  • The startup directory where the logs and configuration files for that locator instance are located.

  • SnappyData specific properties that can be passed.

You can refer to the conf/locators.template file for some examples.

Example: To start two locators on node-a:9999 and node-b:8888, update the configuration file as follows:

$ cat conf/locators
node-a -peer-discovery-port=9999 -dir=/node-a/locator1 -heap-size=1024m -locators=node-b:8888
node-b -peer-discovery-port=8888 -dir=/node-b/locator2 -heap-size=1024m -locators=node-a:9999

Configuring Leads

Lead Nodes primarily runs the SnappyData managed Spark driver. There is one primary lead node at any given instance, but there can be multiple secondary lead node instances on standby for fault tolerance. Applications can run Jobs using the REST service provided by the Lead node. Most of the SQL queries are automatically routed to the Lead to be planned and executed through a scheduler. You can refer to the conf/leads.template file for some examples.

Create the configuration file (leads) for leads in the <SnappyData_home>/conf directory.

Note

In the conf/spark-env.sh file set the SPARK_PUBLIC_DNS property to the public DNS name of the lead node. This enables the Member Logs to be displayed correctly to users accessing SnappyData Monitoring Console from outside the network.

Example: To start a lead (node-l), set spark.executor.cores as 10 on all servers, and change the Spark UI port from 5050 to 9090, update the configuration file as follows:

$ cat conf/leads
node-l -heap-size=4096m -spark.ui.port=9090 -locators=node-b:8888,node-a:9999 -spark.executor.cores=10

Configuring Secondary Lead

To configure secondary leads, you must add the required number of entries in the conf/leads file.

For example:

$ cat conf/leads
node-l1 -heap-size=4096m -locators=node-b:8888,node-a:9999
node-l2 -heap-size=4096m -locators=node-b:8888,node-a:9999

In this example, two leads (one on node-l1 and another on node-l2) are configured. Using sbin/snappy-start-all.sh, when you launch the cluster, one of them becomes the primary lead and the other becomes the secondary lead.

Configuring Data Servers

Data Servers hosts data, embeds a Spark executor, and also contains a SQL engine capable of executing certain queries independently and more efficiently than the Spark engine. Data servers use intelligent query routing to either execute the query directly on the node or to pass it to the lead node for execution by Spark SQL. You can refer to the conf/servers.template file for some examples.

Create the configuration file (servers) for data servers in the <SnappyData_home>/conf directory.

Example: To start a two servers (node-c and node-c), update the configuration file as follows:

$ cat conf/servers
node-c -dir=/node-c/server1 -heap-size=4096m -memory-size=16g -locators=node-b:8888,node-a:9999
node-c -dir=/node-c/server2 -heap-size=4096m -memory-size=16g -locators=node-b:8888,node-a:9999

List of Properties

Refer SnappyData properties.

Specifying Configuration Properties using Environment Variables

SnappyData configuration properties can be specified using environment variables LOCATOR_STARTUP_OPTIONS, SERVER_STARTUP_OPTIONS, and LEAD_STARTUP_OPTIONS respectively for locators, leads and servers. These environment variables are useful to specify common properties for locators, servers, and leads. These startup environment variables can be specified in conf/spark-env.sh file. This file is sourced when SnappyData system is started. A template file conf/spark-env.sh.template is provided in conf directory for reference. You can copy this file and use it to configure properties.

For example:

# create a spark-env.sh from the template file
$cp conf/spark-env.sh.template conf/spark-env.sh 

# Following example configuration can be added to spark-env.sh, 
# it shows how to add security configuration using the environment variables

SECURITY_ARGS="-auth-provider=LDAP -J-Dgemfirexd.auth-ldap-server=ldap://192.168.1.162:389/ -user=user1 -password=password123 -J-Dgemfirexd.auth-ldap-search-base=cn=sales-group,ou=sales,dc=example,dc=com -J-Dgemfirexd.auth-ldap-search-dn=cn=admin,dc=example,dc=com -J-Dgemfirexd.auth-ldap-search-pw=password123"

#applies the configuration specified by SECURITY_ARGS to all locators
LOCATOR_STARTUP_OPTIONS=”$SECURITY_ARGS”
#applies the configuration specified by SECURITY_ARGS to all servers
SERVER_STARTUP_OPTIONS=”$SECURITY_ARGS”
#applies the configuration specified by SECURITY_ARGS to all leads
LEAD_STARTUP_OPTIONS=”$SECURITY_ARGS”


Configuring SnappyData Smart Connector

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program). In Smart connector mode, a Spark application connects to SnappyData cluster to store and process data. SnappyData currently works with Spark version 2.1.1. To work with SnappyData cluster, a Spark application must set the snappydata.connection property while starting.

Property Description
snappydata.connection SnappyData cluster's locator host and JDBC client port on which locator listens for connections. Has to be specified while starting a Spark application.

Example:

$ ./bin/spark-submit --deploy-mode cluster --class somePackage.someClass  
    --master spark://localhost:7077 --conf spark.snappydata.connection=localhost:1527 
    --packages 'SnappyDataInc:snappydata:1.1.1-s_2.11'

Environment Settings

Any Spark or SnappyData specific environment settings can be done by creating a snappy-env.sh or spark-env.sh in SNAPPY_HOME/conf.

Hadoop Provided Settings

If you want to run SnappyData with an already existing custom Hadoop cluster like MapR or Cloudera you should download Snappy without Hadoop from the download link. This allows you to provide Hadoop at runtime.

To do this, you need to put an entry in $SNAPPY-HOME/conf/spark-env.sh as below:

export SPARK_DIST_CLASSPATH=$($OTHER_HADOOP_HOME/bin/hadoop classpath)

Logging

Currently, log files for SnappyData components go inside the working directory. To change the log file directory, you can specify a property -log-file as the path of the directory.
The logging levels can be modified by adding a conf/log4j.properties file in the product directory.

$ cat conf/log4j.properties 
log4j.logger.org.apache.spark.scheduler.DAGScheduler=DEBUG
log4j.logger.org.apache.spark.scheduler.TaskSetManager=DEBUG

Note

For a set of applicable class names and default values see the file conf/log4j.properties.template, which can be used as a starting point. Consult the log4j 1.2.x documentation for more details on the configuration file.

Auto-Configuring Off-Heap Memory Size

Off-Heap memory size is auto-configured by default in the following scenarios:

  • When the lead, locator, and server are setup on different host machines:
    In this case, off-heap memory size is configured by default for the host machines with the server setup. The total size of heap and off-heap memory does not exceed more than 75% of the total RAM. For example, if the RAM is greater than 8GB, the heap memory is between 4-8 GB and the remaining becomes the off-heap memory.

  • When leads and one of the server node are on the same host:
    In this case, off-heap memory size is configured by default and is adjusted based on the number of leads that are present. The total size of heap and off-heap memory does not exceed more than 75% of the total RAM. However, here the heap memory is the total heap size of the server as well as that of the lead.

Note

The off-heap memory size is not auto-configured when the heap memory and the off-heap memory are explicitly configured through properties or when multiple servers are on the same host machine.

Firewalls and Connections

You may face possible connection problems that can result from running a firewall on your machine.

SnappyData is a network-centric distributed system, so if you have a firewall running on your machine it could cause connection problems. For example, your connections may fail if your firewall places restrictions on inbound or outbound permissions for Java-based sockets. You may need to modify your firewall configuration to permit traffic to Java applications running on your machine. The specific configuration depends on the firewall you are using.

As one example, firewalls may close connections to SnappyData due to timeout settings. If a firewall senses no activity in a certain time period, it may close a connection and open a new connection when activity resumes, which can cause some confusion about which connections you have.

Firewall and Port Considerations

You can configure and limit port usage for situations that involve firewalls, for example, between client-server or server-server connections.

Make sure your port settings are configured correctly for firewalls. For each SnappyData member, there are two different port settings you may need to be concerned with regarding firewalls:

  • The port that the server or locator listens on for client connections. This is configurable using the -client-port option to the snappy server or snappy locator command.

  • The peer discovery port. SnappyData members connect to the locator for peer-to-peer messaging. The locator port is configurable using the -peer-discovery-port option to the snappy server or snappy locator command.

    By default, SnappyData servers and locators discover each other on a pre-defined port (10334) on the localhost.

Limiting Ephemeral Ports for Peer-to-Peer Membership

By default, SnappyData utilizes ephemeral ports for UDP messaging and TCP failure detection. Ephemeral ports are temporary ports assigned from a designated range, which can encompass a large number of possible ports. When a firewall is present, the ephemeral port range usually must be limited to a much smaller number, for example six. If you are configuring P2P communications through a firewall, you must also set each the tcp port for each process and ensure that UDP traffic is allowed through the firewall.

Properties for Firewall and Port Configuration

Store Layer

This following tables contain properties potentially involved in firewall behavior, with a brief description of each property. The Configuration Properties section contains detailed information for each property.

Configuration Area Property or Setting Definition
peer-to-peer config locators The list of locators used by system members. The list must be configured consistently for every member of the distributed system.
peer-to-peer config membership-port-range The range of ephemeral ports available for unicast UDP messaging and for TCP failure detection in the peer-to-peer distributed system.
member config -J-Dgemfirexd.hostname-for-clients The IP address or host name that this server/locator sends to the JDBC/ODBC/thrift clients to use for the connection.
member config client-port option to the snappy server and snappy locator commands Port that the member listens on for client communication.
Locator locator command 10334
Spark Layer

The following table lists the Spark properties you can set to configure the ports required for Spark infrastructure.
Refer to Spark Configuration in the official documentation for detailed information.

Property Default Description
spark.blockManager.port random Port for all block managers to listen on. These exist on both the driver and the executors.
spark.driver.blockManager.port (value of spark.blockManager.port) Driver-specific port for the block manager to listen on, for cases where it cannot use the same configuration as executors.
spark.driver.port random Port for the driver to listen on. This is used for communicating with the executors and the standalone Master.
spark.port.maxRetries 16 Maximum number of retries when binding to a port before giving up. When a port is given a specific value (non 0), each subsequent retry will increment the port used in the previous attempt by 1 before retrying. This essentially allows it to try a range of ports from the start port specified to port + maxRetries.
spark.shuffle.service.port 7337 Port on which the external shuffle service will run.
spark.ui.port 4040 Port for your application's dashboard, which shows memory and workload data.
spark.ssl.[namespace].port None The port where the SSL service will listen on.

The port must be defined within a namespace configuration; see SSL Configuration for the available namespaces.

When not set, the SSL port will be derived from the non-SSL port for the same service. A value of "0" will make the service bind to an ephemeral port.
spark.history.ui.port The port to which the web interface of the history server binds. 18080
SPARK_MASTER_PORT Start the master on a different port. Default: 7077
SPARK_WORKER_PORT Start the Spark worker on a specific port. (Default: random

Locators and Ports

The ephemeral port range and TCP port range for locators must be accessible to members through the firewall.

Locators are used in the peer-to-peer cache to discover other processes. They can be used by clients to locate servers as an alternative to configuring clients with a collection of server addresses and ports.

Locators have a TCP/IP port that all members must be able to connect to. They also start a distributed system and so need to have their ephemeral port range and TCP port accessible to other members through the firewall.

Clients need only be able to connect to the locator's locator port. They don't interact with the locator's distributed system; clients get server names and ports from the locator and use these to connect to the servers. For more information, see Using Locators.