Data Access with Knime

Reading Time: 5 minutes

One of the most common challenges faced by aspiring data scientists and data enthusiasts is to handle data science components and it’s coding simultaneously. KNIME is a solution to that challenge. It is a GUI based tool to create workflows without worrying about the coding behind it. It allows the user to perform various functions ranging from basic I/O to data manipulations, transformations, and data mining without coding.

Of course, the first step to start any workflow is to read data from different sources. KNIME Analytics Platform offers many nodes to access different data sources: formatted text files, binary files, SOAP and REST web services, databases, Big Data Platforms. But before going through the nodes, the user needs to know the different ways by which you can specify the path from which data is to be read.

Absolute and Relative Paths

Every data access node requires access to a file path. The user can specify the file’s full path or relative path.

  • Local Path: The user can specify the file’s full path of its local machine, like:
    /home/knoldus/Desktop/acted_in.csv
    A disadvantage with this approach is that while sharing workflow with other users, it is not necessary that the file will be present in the path specified for their machine.
  • Absolute URI: The user can use Absolute Paths when using knime:// protocol for accessing files relative to the Knime workspace. These paths are relative to the specific mount point, like LOCAL for local workspace.
    knime://LOCAL/My_Project/data/sales_2008-2011.csv
  • Mountpoint – relative URI: If the user is deploying a workflow to the server, the LOCAL in the file path is not possible. It is better to replace it with knime.mountpoint.
    knime.mountpoint refers to the uppermost folder level, which can be local or the mount ID of the server.
    knime://knime.mountpoint/My_Project/data/sales_2008-2011.csv
  • Workflow-relative Path: It is the most flexible file path and specifies the file path relative to the currently executing workflow. Using this the user can access data files in workflows in local workspaces on different systems, or on a server, as long as the folder structure between the workflow and the data file is the same.
    For example, while working in the Local workspace, if we want to access data files present in the folder named DataWrangler, then a relative path can be specified like:
    knime://knime.workflow/../DataWrangler/data/CustomerInfo.csv
    Here, /../ denotes an upper folder level.

Now, that we know the path of the different way to a file can be set, we can look at the various nodes for Data Access.

Knime Data Access Node

File Reader Node

The most versatile node to access a text file with any format is File Reader Node.

File Reader Node Configuration

Apart from selecting the path of the file, the user can configure the node to select the delimiter for the file, indicate whether there are column headers present or not, and whether there is a separate column for unique Row Ids in the data or not.

Table Reader Node

Another way of reading data is to use Knime’s proprietary .table format. Though this format is optimized for speed and size, it is not portable to other platforms. The Table Reader node reads files in the .table format. It already contains information about the file structure and the user only needs to specify the file path to read.

Excel Reader Node

Most Data Analytics Application reads some of its data from an excel file. Knime provides an Excel Reader Node to read Excel files.

Excel Reader Node Configuration

In the node’s configuration, we specify the file path. Once the file path is specified, the node auto-fills the sheet names in the window. The user can select only one of the sheets. Also, the user can select the columns to read data from, the column to be used for unique Row IDs (if any) and determine how to handle missing values in a column. A preview option is also present which allows the user to see what the node is currently reading.

CSV Reader Node

This knime node reads CSV files. It is recommended to be used when the workflow is to be used in a server or batch environment. Upon executing, the node scans the input file to determine number and types of the columns. Then, it outputs a table with the auto-guessed structure. If the workflow is not for server, then the user can use the File Reader node only.

CSV Reader Node Configuration

JSON Reader

This node reads the .json file and parses it as a JSON value. The user has to use the standard JSON format for JSON reader to parse.

{"employees":[
{"firstName":"John", "lastName":"Doe"},
{"firstName":"Anna", "lastName":"Smith"},
{"firstName":"Peter", "lastName":"Jones"}
]}

JSON Reader Node Configuration

In the configuration window, the user needs to specify the JSON file’s path. The node writes the complete JSON data in a single column, whose name can be changed in the configuration window. Also, to select only a particular key of the JSON, we can specify the key’s JSON Path in the panel. For the above JSON file, if we only want to access first names, then the corresponding JSON Path would be $['employees'][*]['firstName']

Database Access Nodes

Generic DB Access Workflow

In order to access data from a database, knime provides a node to connect to various JDBC – compliant databases. For example, MySQL Connector, Oracle Connector, SQLite Connector. If no dedicated connector node exists for the database, then the user can use the generic DB Connector node. It requires an appropriate driver and JDBC URL of the database to connect to the required database. Once the connection is established, the user can use a DB Table Selector node to select a table or view from the database. This node allows the user to enter a customized SELECT statement too. Other DB queries can be added as a node using the DB Query node. After the DB workflow is ready, knime’s DB Reader node executes the input query in the selected database. Then, it retrieves the result into a Knime data table.

Data from REST Services

Generic REST GET Request Workflow

Data from the web is easily available via REST services. Knime’s GET Request node sends a GET request to a REST service from within the knime workflow. The node allows you to either send a request to a fixed URL or to a list of URLs provided by an input table. The user can also send additional request parameters to the service using the configuration panel. By default, the GET request stores its output in a single column in JSON Format. Thus, the user can use JSON to Table or JSON Path node to parse the JSON result into table columns.

GET Request Node Configuration

Conclusion

Apart from these most commonly used nodes to access data, Knime provides a bunch of other reader nodes too. There are nodes for reading XML files, lines from a file or URL, .wav files, and images too. Also, these reader nodes have corresponding writer nodes to write data of an appropriate format to different files.

Knoldus-blog-footer-image