Bottomline: A data connector is used to copy the data from one(or more) source(s) to the data warehouse (usually a single location where it is easy to perform certain operations)
What is data connector?
A data connector is the logic that enable developers (or data-science folks) to transfer the data from source (after extracting) to the destination. For example you need to prepare a model through which you can predict the attendance of a particular student with the reason of absence so far that you will need to fetch the basic details like attendance details through LMS (Learning Management Software) and for the leave reason you might need to look into other application then after getting the required data you can use it in your ML(machine learning) or statistics model and do the prediction. So in this LMS is considered as “Souce” and the software that you use to predict the attendance is the “destination”.
Source: Source is something that we make us of in fectching the data. Usually the source include databases, digital documents (spreadsheets, excel-sheets, pdfs, docs etc), SAAS applications like LMS etc.
Destination: Destination is the tool (usually data warehouse like Redshift) in which the calculations are performed and we obtain the desired results. These data warehouses allow analysts and developers store large amount of data while providing super-fast computational speed.
Running on a schedule: While using data connectors we usually prefer to transfer data running on a schedule. As the data we are dealing with is usually very large and can not be copied in single transfer, moreover the data might be updating continously. So it is better to fix a schedule using which the data can be copied periodically. For example in a 2 hourly schedule, the data will be copied every two hours.
Extraction of data (from Source):
We need to fetch (extract) the data from the data source, so that we can use it for further work. There are two modes through which we can extract data.
1. Snapshot mode: In this mode, the entire data is fetched in one go. Suppose you are running on an hourly schedule so the entire data (including updates) will be fetched every hour.
2. Incremental mode: In this mode of operation, instead of extracting the whole data only the updated data will be fetched. Let’s say, someone added the details of the new employee (Sushil) in the employee details table. So when we extract the data in the increment mode, only Sushil’s data will be fetched into the destination (usually data warehouse). However, not all the sources support both modes of operation.
Writing of data (to destination):
So after we have fetched the data the next step is to write the data into destination (data warehouse). This again can be done in three different ways.
1. Replace mode: In this mode whole data(old) gets replaced with the new data on every successful run.
2. Append mode: In this mode, the new data gets added to the previous data on every successful run.
3. Merge mode: In this mode of operation the new data gets added if there were no previous data (upserted). Primary key is mandatory in this mode.
Which Database is the best:
Although there are many databases to choose from when it comes to use the data-connector , but MongoDB is the recommended one, as it is flexible while having dynamic schema.
Practical example of Data-connector:
Suppose you are having an API that has all the desired fields and as a developer you need to transform some of the fields into more user-friendly format. So to use those fields into the application that you are devloping, you would require a data-connector so that the data can be pulled from the APIs and can be consumed by the application that you are developing. For instance, someone has to make the data-connector according to the fields we want to fetch from the APIs and after implementing the required changes in the data-connector usually a jar file is generated in some folder (let’s say “folder1”). The next step is to import the data-connector from folder1 into your project (the application you are working on) and consume the services provided by the API using that data-connector. However, keep in mind I have just taken an example and the situation can be different based on the project requirements and technologies that you are using for that particular project.