Skip to main content
Version: Next

MongoDB

Certified

Important Capabilities

CapabilityStatusNotes
Detect Deleted EntitiesOptionally enabled via stateful_ingestion.remove_stale_metadata
Platform InstanceEnabled by default
Schema MetadataEnabled by default

This plugin extracts the following:

  • Databases and associated metadata
  • Collections in each database and schemas for each collection (via schema inference)

By default, schema inference samples 1,000 documents from each collection. Setting schemaSamplingSize: null will scan the entire collection. Moreover, setting useRandomSampling: False will sample the first documents found without random selection, which may be faster for large collections.

Note that schemaSamplingSize has no effect if enableSchemaInference: False is set.

Really large schemas will be further truncated to a maximum of 300 schema fields. This is configurable using the maxSchemaSize parameter.

CLI based Ingestion

Install the Plugin

pip install 'acryl-datahub[mongodb]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: "mongodb"
config:
# Coordinates
connect_uri: "mongodb://localhost"

# Credentials
username: admin
password: password
authMechanism: "DEFAULT"

# Options
enableSchemaInference: True
useRandomSampling: True
maxSchemaSize: 300

sink:
# sink configs

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
authMechanism
string
MongoDB authentication mechanism.
connect_uri
string
MongoDB connection URI.
Default: mongodb://localhost
enableSchemaInference
boolean
Whether to infer schemas.
Default: True
hostingEnvironment
Enum
Hosting environment of MongoDB, default is SELF_HOSTED, currently support SELF_HOSTED, ATLAS, AWS_DOCUMENTDB
Default: SELF_HOSTED
maxDocumentSize
integer
Default: 16793600
maxSchemaSize
integer
Maximum number of fields to include in the schema.
Default: 300
options
object
Additional options to pass to pymongo.MongoClient().
Default: {}
password
string
MongoDB password.
platform_instance
string
The instance of the platform that all assets produced by this recipe belong to
schemaSamplingSize
integer
Number of documents to use when inferring schema size. If set to null, all documents will be scanned.
Default: 1000
useRandomSampling
boolean
If documents for schema inference should be randomly selected. If False, documents will be selected from start.
Default: True
username
string
MongoDB username.
env
string
The environment that all assets produced by this connector belong to
Default: PROD
collection_pattern
AllowDenyPattern
regex patterns for collections to filter in ingestion.
Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}
collection_pattern.allow
array(string)
collection_pattern.deny
array(string)
collection_pattern.ignoreCase
boolean
Whether to ignore case sensitivity during pattern matching.
Default: True
database_pattern
AllowDenyPattern
regex patterns for databases to filter in ingestion.
Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}
database_pattern.allow
array(string)
database_pattern.deny
array(string)
database_pattern.ignoreCase
boolean
Whether to ignore case sensitivity during pattern matching.
Default: True
stateful_ingestion
StatefulStaleMetadataRemovalConfig
Base specialized config for Stateful Ingestion with stale metadata removal capability.
stateful_ingestion.enabled
boolean
Default as True if datahub-rest sink is used or if datahub_api is specified, otherwise False
Default: False
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Code Coordinates

  • Class Name: datahub.ingestion.source.mongodb.MongoDBSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for MongoDB, feel free to ping us on our Slack.