Author: Juoko Virtanen

Using Windows in Spark to Avoid Joins

Reading Time: 3 minutes When you think of windows in Spark you might think of Spark Streaming, but windows can be used on regular DataFrames. Window functions calculate an output value for every row of a DataFrame based on a group of rows. I have been working on optimizing some Spark code and have noticed a few places where the use of a window function eliminates the need for Continue Reading

Getting Oracle and H2 to work together

Reading Time: 4 minutes In a previous blog post I wrote about integration testing with H2, an in-memory database. I mentioned that H2 does not perfectly emulate other databases such as Oracle. This means that H2 cannot execute all queries meant to be executed by an Oracle database. This has been a problem for me as I have been writing integration tests for code that makes many calls to Continue Reading

Writing Independent Tests

Reading Time: 2 minutes What does it mean for a test to be independent? It means that a test should not depend upon the presence or absence of other tests, the order of the tests, or whether or not previous tests failed or not. It also means that the tests should not be dependent upon external things such environment variables, an internet connection, or the local time. Why is Continue Reading

Integration Testing with H2

Reading Time: 3 minutes It goes without saying that testing code is essential, if you don’t want to have buggy code in production, but how can you test code that queries a database? One solution is an in-memory database, and a common in-memory database is H2. Here is some code that queries a database. More specifically the method getNumDistinctInColumn returns the number of distinct values in a specified column Continue Reading

Testing Code: A Case Study

Reading Time: 2 minutes Testing code is essential for finding bugs before they get into production. Without automated tests it will take longer and longer to debug code as it grows in size and complexity and development will grind to a halt. I am consulting for a company that has not made it a priority to test its code. Thus, one of our early priorities was to write unit Continue Reading

Unit Testing with ScalaCheck

Reading Time: 5 minutes Unit testing is essential to writing good code. Unit testing allows us to capture bugs as they are created, not long after they are deployed. A little time spent writing unit tests can save a lot of time debugging. If unit testing is not done it can become increasing more difficult and time consuming to fix bugs as code complexity and size increases. However, in Continue Reading

A Machine Learning Case Study

Reading Time: 5 minutes This is my third blog post on MR-REX, a software package used to help determine protein structures given experimental X-ray crystallographic data, which I created while a postdoctoral fellow at the University of Michigan. You can find the other two here and here. MR-REX uses several terms to assess how well a particular placement of the protein in the unit cell reproduces the experimental X-ray Continue Reading

Maximum Likelihood

Reading Time: 7 minutes Maximum likelihood is the procedure of finding the value of some parameters for a given statistic which makes the likelihood of the the known likelihood distribution a maximum. Maximum likelihood is a method with many uses. A classic example is linear regression. If it is assumed that the errors on the x variable follow a Gaussian distribution we can compute the probability density of the Continue Reading

Storing and querying triples using Apache Rya

Reading Time: 3 minutes Apache Rya is a tool for storing and querying triples at scale. It is not used much and consequently it is poorly documented and it is difficult to get started using it. This blog post is intended to give people the information they need to get started with Rya. In order to run Apache Rya you will need Accumulo, Hadoop, Zookeeper, and of course Rya Continue Reading

Protein Structure determination aided by Stochastic Search (Replica Exchange Monte-Carlo Method)

Reading Time: 8 minutes Introduction Proteins are large molecules, which occur in abundance in every single living organism. They carry out vital functions such as transporting oxygen, converting the food you eat into energy your body can use, and many more. Proteins are long chains of linked units called amino acids. There are 20 types of amino acids. Proteins fold into different shapes depending upon their sequence of amino Continue Reading