Skip to content

Last updated: November 13, 2024

Creating a connection to a data source and profiling data

This guide shows how to connect a data source to DQOps, import the metadata, and start data quality assessment.

Creating a connection to a data source

After installation and starting DQOps, we describe how to import locally stored CSV file using the user interface, initiate automatic monitoring and run basic statistics.

For a full description of how to add a data source connection to other providers or add connection using the DQOps Shell, see Working with DQOps section.

You can find more information about navigating the DQOps user interface here.

How to analyze other data sources

Instead of importing a CSV file, you can follow the manuals for connecting to other data sources and continue this getting started guide from the import metadata using the user interface step.

DQOps supports integration with relational databases, data lakes, cloud data warehouses, object storage services, data processing frameworks, query engines and flat files. You can find the full list of supported datasources here.

Links to some supported data sources are shown below.

Athena       Redshift       BigQuery       Snowflake       PostgreSQL       Oracle       MySQL       SQL Server       Spark       Databricks       Parquet

Download the sample CSV file

In the example, we will use a sample CSV file named orders.csv. This file contains a fragment of the public dataset of a fictitious ecommerce clothing site TheLook, but feel free to use your own CSV file if you prefer.

To download the sample CSV file, open the GitHub webpage, click on the three dots button in the upper right corner, and select the Download command.

Download sample CSV file from GitHub

In our example, we have downloaded the file to a folder named demo_files. Remember the absolute path to the file, as we will need it when configuring the connection.

In our example, the path is C:\demo_files\orders.csv

Adding connection

Below is a table that shows a fragment of the data included in the sample CSV file.

order_id user_id status gender created_at returned_at shipped_at delivered_at num_of_item
1 3 Shipped F 8/12/2019 15:15:00 8/15/2019 14:08:00 1
2 5 Shipped M 1/20/2022 9:17:00 1/22/2022 2:25:00 1
3 6 Processing F 7/23/2022 11:33:00 4
4 6 Completed F 3/1/2020 11:33:00 3/2/2020 15:57:00 3/4/2020 16:32:00 3
... ... ... ... ... ... ... ... ...

Add a connection to a CSV file using the user interface

To navigate to the CSV connection settings:

  1. Go to the Data Sources section and click the + Add connection button in the upper left corner.

    Adding connection

  2. Select CSV connection type.

Selecting CSV database type

Fill in the connection settings

After opening the CSV connection settings, fill in the connection details.

If you are importing a CSV file stored locally, you only need to complete the settings described below. For information on other settings and how to import CSV files from AWS S3, refer to the CSV connection documentation."

Adding connection

CSV connection settings Description
Connection name The name of the connection that will be created in DQOps. This will also be the name of the folder where the connection configuration files are stored. The name of the connection must be unique and consist of alphanumeric characters.
Virtual schema name An alias for the parent directory with data. We leave the default value "files".
Path The path prefix to the parent directory with data. The path must be absolute.

After entering the connection settings, click on the Test Connection button to test the connection. It will inform you if the path to the CSV file is correct.

If the test is successful, click the Save button to add a new connection. If the test fails, you can review the details to identify the issue.

Import metadata using the user interface

When you add a new connection, it will appear in the tree view on the left. You will then be redirected to the Import metadata screen, where you can import files.

  1. Import the "files" schema by clicking the Import tables button.

    Importing schemas

  2. Import the "orders.csv" file by clicking on the Import all tables button in the upper right corner.

    Importing tables

DQOps will create a table from the orders.csv file in the virtual schema "files", which will allow you to profile it and monitor its data quality.

Start data quality assessment

DQOps simplifies the data quality assessment process which is a routine activity for data governance and quality teams.

Upon import, you will receive information that a new tables have been imported. You can then begin collecting basic statistics and profiling data by running default data profiling checks. Simply click on the Start profiling button to initiate this process.

Collect basic statistics and profile data with default profiling checks

Automatically activated checks

Once new tables are imported, DQOps automatically activates checks defined in data policies These checks detect volume anomalies, data freshness anomalies, empty tables, table availability, schema changes, anomalies in the count of distinct values, and null percent anomalies. By default, the profiling checks are scheduled to run at 12:00 p.m. on the 1st day of every month, and the monitoring checks are scheduled to run daily at 12:00 p.m.

Profiling checks are designed to assess the initial data quality score of a data source. Profiling checks are also useful for exploring and experimenting with various types of checks and determining the most suitable ones for regular data quality monitoring.

Monitoring checks are standard checks that monitor the data quality of a table or column. They can also be referred to as Data Observability checks. These checks capture a single data quality result for the entire table or column.

Default data profiling checks

Below is a list of the data profiling checks automatically activated according to default Table and Column-level data quality policies.

Target Check name Description
column profile contains usa phone percent Detects USA phone numbers inside text columns.
column profile contains email percent Detects emails inside text columns.

Default daily monitoring checks

Below is a list of the daily monitoring checks automatically activated according to default Table and Column-level data quality policies.

Target Check name Description
table daily row count Captures the row count of the table every day and identifies empty tables.
table daily row count anomaly Detects day-to-day anomalies in the table volume. Raises a data quality issue when the increase or decrease in the row count is in the top 1% of most significant changes.
table daily row count change Detects significant day-to-day changes in the table volume. Raises a data quality issue when the increase or decrease in the row count is greater than 10%.
table daily freshness anomaly Measures the freshness of the table and detects anomalous (outstanding) delays in data freshness. The table must be properly configured to support timeliness checks.
table daily table availability Verifies that a table exists, can be accessed, and queried without errors. Detects corrupted tables and expired credentials to data sources.
table daily column count Retrieves the metadata of the monitored table from the data source and counts the number of columns.
table daily column count changed Detects whether the number of columns in a table has changed since the last time the check (checkpoint) was run.
column daily nulls count Counts null values in every column without raising any data quality issues.
column daily nulls percent Measures the percentage of null values in every column without raising any data quality issues.
column daily nulls percent anomaly Measures the percentage of null values in every column and detects anomalous changes in the percentage of null value. Raises a data quality issue for the top 1% biggest day-to-day changes.
column daily not nulls count Verifies that a column contains a minimum number of non-null values
column daily not nulls percent Measures the percentage of null values in every column without raising any data quality issues.
column daily empty column found Detects empty columns that contain only null values.
column daily distinct count anomaly Counts distinct values in every column and detects anomalous changes in the percentage of null value. Raises a data quality issue for the top 1% biggest day-to-day changes of the count of distinct values.
column daily detected datatype in text changed Scans all values in a text column and detects the data type of all values in a column. Raises a data quality issue when the type of texts changes. For example, when a column contained always numeric values, but a text value was found.
column daily column exists Reads the metadata of the monitored table and verifies that the column still exists in the data source.
column daily column type changed Detects if the data type of the column has changed since the last time it was retrieved.

For more general information about checks, see the DQOps concepts section.

Review scheduling

To review scheduling for profiling and daily monitoring checks

  1. Go to the Data source section.
  2. Select the connection from the tree view on the left.
  3. Select the Schedule tab where you can review scheduling settings for the added connection.

Review scheduling

The scheduling is enabled by default. You can turn it off by clicking the Notification icon in the upper right corner and then clicking the Job scheduler toggle button.

Reviewing data source details

Explore the connection-level tabs in the Data sources section

There are several tabs to explore in the Data sources section that differ depending on the selection of the elements in the tree view on the left (connection, schema, table or column): The following tabs are shown at the connection level:

  • Connection: Provides details about the connection parameters.
  • Schedule: Allows you to configure of the check execution schedule at the connection level.
  • Comments: Allows adding comments to your connection.
  • Labels: Allows adding labels to your connection.
  • Schemas: Allow importing schemas and tables.
  • Data quality summary: Displays summaries of the data quality status. You have the option to choose one of two subtabs:
    • Tables: This subtab provides a summary of the data quality status for tables in this connection.
    • Columns: This subtab provides a summary of the data quality status for columns in this connection.
  • Default grouping template: Allows setting up data grouping globally at the data source level. Learn how to configure data grouping.
  • Incident grouping: Allows configuring incidents grouping level. Learn more about incidents that let you keep track of the issues that arise during data quality monitoring.
  • Notifications: Allows configuring incidents and Webhooks for notifications. Learn how to configure notifications whenever a new incident is created or modified.

Explore the table-level tabs in the Data sources section

At the table level in the Data sources section, there are the following tabs:

Reviewing table details

Next step

Now that you have connected a data source and initiated data assessment, it is time to review the results, automatically configure data quality checks to detect the most common data quality issues and create an incident notification.