Last updated: November 13, 2024
Creating a connection to a data source and profiling data
This guide shows how to connect a data source to DQOps, import the metadata, and start data quality assessment.
Creating a connection to a data source
After installation and starting DQOps, we describe how to import locally stored CSV file using the user interface, initiate automatic monitoring and run basic statistics.
For a full description of how to add a data source connection to other providers or add connection using the DQOps Shell, see Working with DQOps section.
You can find more information about navigating the DQOps user interface here.
How to analyze other data sources
Instead of importing a CSV file, you can follow the manuals for connecting to other data sources and continue this getting started guide from the import metadata using the user interface step.
DQOps supports integration with relational databases, data lakes, cloud data warehouses, object storage services, data processing frameworks, query engines and flat files. You can find the full list of supported datasources here.
Links to some supported data sources are shown below.
Download the sample CSV file
In the example, we will use a sample CSV file named orders.csv. This file contains a fragment of the public dataset of a fictitious ecommerce clothing site TheLook, but feel free to use your own CSV file if you prefer.
To download the sample CSV file, open the GitHub webpage, click on the three dots button in the upper right corner, and select the Download command.
In our example, we have downloaded the file to a folder named demo_files. Remember the absolute path to the file, as we will need it when configuring the connection.
In our example, the path is C:\demo_files\orders.csv
Below is a table that shows a fragment of the data included in the sample CSV file.
order_id | user_id | status | gender | created_at | returned_at | shipped_at | delivered_at | num_of_item |
---|---|---|---|---|---|---|---|---|
1 | 3 | Shipped | F | 8/12/2019 15:15:00 | 8/15/2019 14:08:00 | 1 | ||
2 | 5 | Shipped | M | 1/20/2022 9:17:00 | 1/22/2022 2:25:00 | 1 | ||
3 | 6 | Processing | F | 7/23/2022 11:33:00 | 4 | |||
4 | 6 | Completed | F | 3/1/2020 11:33:00 | 3/2/2020 15:57:00 | 3/4/2020 16:32:00 | 3 | |
... | ... | ... | ... | ... | ... | ... | ... | ... |
Add a connection to a CSV file using the user interface
Navigate to the connection settings
To navigate to the CSV connection settings:
-
Go to the Data Sources section and click the + Add connection button in the upper left corner.
-
Select CSV connection type.
Fill in the connection settings
After opening the CSV connection settings, fill in the connection details.
If you are importing a CSV file stored locally, you only need to complete the settings described below. For information on other settings and how to import CSV files from AWS S3, refer to the CSV connection documentation."
CSV connection settings | Description |
---|---|
Connection name | The name of the connection that will be created in DQOps. This will also be the name of the folder where the connection configuration files are stored. The name of the connection must be unique and consist of alphanumeric characters. |
Virtual schema name | An alias for the parent directory with data. We leave the default value "files". |
Path | The path prefix to the parent directory with data. The path must be absolute. |
After entering the connection settings, click on the Test Connection button to test the connection. It will inform you if the path to the CSV file is correct.
If the test is successful, click the Save button to add a new connection. If the test fails, you can review the details to identify the issue.
Import metadata using the user interface
When you add a new connection, it will appear in the tree view on the left. You will then be redirected to the Import metadata screen, where you can import files.
-
Import the "files" schema by clicking the Import tables button.
-
Import the "orders.csv" file by clicking on the Import all tables button in the upper right corner.
DQOps will create a table from the orders.csv file in the virtual schema "files", which will allow you to profile it and monitor its data quality.
Start data quality assessment
DQOps simplifies the data quality assessment process which is a routine activity for data governance and quality teams.
Upon import, you will receive information that a new tables have been imported. You can then begin collecting basic statistics and profiling data by running default data profiling checks. Simply click on the Start profiling button to initiate this process.
Automatically activated checks
Once new tables are imported, DQOps automatically activates checks defined in data policies These checks detect volume anomalies, data freshness anomalies, empty tables, table availability, schema changes, anomalies in the count of distinct values, and null percent anomalies. By default, the profiling checks are scheduled to run at 12:00 p.m. on the 1st day of every month, and the monitoring checks are scheduled to run daily at 12:00 p.m.
Profiling checks are designed to assess the initial data quality score of a data source. Profiling checks are also useful for exploring and experimenting with various types of checks and determining the most suitable ones for regular data quality monitoring.
Monitoring checks are standard checks that monitor the data quality of a table or column. They can also be referred to as Data Observability checks. These checks capture a single data quality result for the entire table or column.
Default data profiling checks
Below is a list of the data profiling checks automatically activated according to default Table and Column-level data quality policies.
Target | Check name | Description |
---|---|---|
column | profile contains usa phone percent | Detects USA phone numbers inside text columns. |
column | profile contains email percent | Detects emails inside text columns. |
Default daily monitoring checks
Below is a list of the daily monitoring checks automatically activated according to default Table and Column-level data quality policies.
Target | Check name | Description |
---|---|---|
table | daily row count | Captures the row count of the table every day and identifies empty tables. |
table | daily row count anomaly | Detects day-to-day anomalies in the table volume. Raises a data quality issue when the increase or decrease in the row count is in the top 1% of most significant changes. |
table | daily row count change | Detects significant day-to-day changes in the table volume. Raises a data quality issue when the increase or decrease in the row count is greater than 10%. |
table | daily freshness anomaly | Measures the freshness of the table and detects anomalous (outstanding) delays in data freshness. The table must be properly configured to support timeliness checks. |
table | daily table availability | Verifies that a table exists, can be accessed, and queried without errors. Detects corrupted tables and expired credentials to data sources. |
table | daily column count | Retrieves the metadata of the monitored table from the data source and counts the number of columns. |
table | daily column count changed | Detects whether the number of columns in a table has changed since the last time the check (checkpoint) was run. |
column | daily nulls count | Counts null values in every column without raising any data quality issues. |
column | daily nulls percent | Measures the percentage of null values in every column without raising any data quality issues. |
column | daily nulls percent anomaly | Measures the percentage of null values in every column and detects anomalous changes in the percentage of null value. Raises a data quality issue for the top 1% biggest day-to-day changes. |
column | daily not nulls count | Verifies that a column contains a minimum number of non-null values |
column | daily not nulls percent | Measures the percentage of null values in every column without raising any data quality issues. |
column | daily empty column found | Detects empty columns that contain only null values. |
column | daily distinct count anomaly | Counts distinct values in every column and detects anomalous changes in the percentage of null value. Raises a data quality issue for the top 1% biggest day-to-day changes of the count of distinct values. |
column | daily detected datatype in text changed | Scans all values in a text column and detects the data type of all values in a column. Raises a data quality issue when the type of texts changes. For example, when a column contained always numeric values, but a text value was found. |
column | daily column exists | Reads the metadata of the monitored table and verifies that the column still exists in the data source. |
column | daily column type changed | Detects if the data type of the column has changed since the last time it was retrieved. |
For more general information about checks, see the DQOps concepts section.
Review scheduling
To review scheduling for profiling and daily monitoring checks
- Go to the Data source section.
- Select the connection from the tree view on the left.
- Select the Schedule tab where you can review scheduling settings for the added connection.
The scheduling is enabled by default. You can turn it off by clicking the Notification icon in the upper right corner and then clicking the Job scheduler toggle button.
Explore the connection-level tabs in the Data sources section
There are several tabs to explore in the Data sources section that differ depending on the selection of the elements in the tree view on the left (connection, schema, table or column): The following tabs are shown at the connection level:
- Connection: Provides details about the connection parameters.
- Schedule: Allows you to configure of the check execution schedule at the connection level.
- Comments: Allows adding comments to your connection.
- Labels: Allows adding labels to your connection.
- Schemas: Allow importing schemas and tables.
- Data quality summary: Displays summaries of the data quality status. You have the option to choose one of two subtabs:
- Tables: This subtab provides a summary of the data quality status for tables in this connection.
- Columns: This subtab provides a summary of the data quality status for columns in this connection.
- Default grouping template: Allows setting up data grouping globally at the data source level. Learn how to configure data grouping.
- Incident grouping: Allows configuring incidents grouping level. Learn more about incidents that let you keep track of the issues that arise during data quality monitoring.
- Notifications: Allows configuring incidents and Webhooks for notifications. Learn how to configure notifications whenever a new incident is created or modified.
Explore the table-level tabs in the Data sources section
At the table level in the Data sources section, there are the following tabs:
- Table: Provides details about the table and allows you to add filter, priority or stage name (for example, "Ingestion").
- Schedule: Allows setting schedule for running checks. Learn how to configure schedules.
- Comments: Allows adding comments to your tables.
- Labels: Allows adding labels to your tables.
- Data groupings: Allows setting up data grouping at the table level. Learn how to configure data grouping.
- Date and time columns: Allows configuring event and ingestion timestamp columns for timeliness checks, as well as date or datetime column for partition checks.
- Incident configuration: Allows configuring incidents. Learn more about incidents that let you keep track of the issues that arise during data quality monitoring.
Next step
Now that you have connected a data source and initiated data assessment, it is time to review the results, automatically configure data quality checks to detect the most common data quality issues and create an incident notification.