1. Tools

Let’s start the description of the project with definition of working tools: the directory for storing generated files, the list of downloaded packages and the collection of useful links.

1.1 Setup Working Directory

‘/Users/olgabelitskaya/projects/nd002/Data_Analyst_ND_Project4’

https://github.com/OlgaBelitskaya/data-analyst-nd002/tree/master/Data_Analyst_ND_Project4

1.2 Useful links

1.3 Libraries

List of the libraries:

devtools, knitr, markdown, ggplot2, ggthemes, RColorBrewer, gridExtra, scales, reshape2, plyr, GGally, dplyr, tidyr, xlsx, lubridate, plotly, etc.

2. Data

2.1 Introduction

I have chosen an open database of quotations of currencies and precious metals located on the site of the Bank of Russia.

http://www.cbr.ru/Eng/hd_base/

I think it is easy to visualize the analysis of variables and dependencies in this case. From publicly available data the file in xlsx format has been generated and downloaded.

2.2 Load the Data and Review

Using suitable functions (read.xlsx, data.frame) I had read the file as a dataframe. It was called “centrobank”. The function head() shows us several rows as an example.

        date dual_currency_basket EUR_978 USD_840 k_JPY JPY_392_100
1 2012-01-11              35.8717 40.7591 31.8729   100     41.4931
2 2012-01-12              35.6115 40.4061 31.6886   100     41.1968
3 2012-01-13              35.5527 40.2852 31.6807   100     41.1999
   JPY_392 k_CNY CNY_156_k CNY_156 BRL_986 k_INR INR_356_k  INR_356
1 0.414931    10   50.4805 5.04805 17.3884   100   60.9978 0.609978
2 0.411968    10   50.1576 5.01576 17.6077   100   61.3318 0.613318
3 0.411999    10   50.1467 5.01467 17.5711   100   61.1597 0.611597
     gold silver platinum palladium foreign_exchange_reserves
1 1667.25  29.56  1495.09    648.66                   453 952
2 1671.87  30.25  1512.93    653.06                   453 952
3 1683.17  30.36  1527.84    652.90                   453 952
  monetary_gold
1        44 697
2        44 697
3        44 697

First, there were no values “NA” in this dataframe. It was checked with a specially created set of rows with values “NA” - row.with.na.

The code displays the number of this kind of rows. As we can see, it’s zero.

sum(row.with.na) =  0

Second, let us have a look on the dataframe components (1128 observations of 20 variables; their types, values, etc.). Most of them have a numerical type.

The function str() is very helpful and informative in this case.

'data.frame':   1128 obs. of  20 variables:
 $ date                     : Date, format: "2012-01-11" "2012-01-12" ...
 $ dual_currency_basket     : num  35.9 35.6 35.6 35.6 35.7 ...
 $ EUR_978                  : num  40.8 40.4 40.3 40.6 40.4 ...
 $ USD_840                  : num  31.9 31.7 31.7 31.6 31.9 ...
 $ k_JPY                    : num  100 100 100 100 100 100 100 100 100 100 ...
 $ JPY_392_100              : num  41.5 41.2 41.2 41.2 41.6 ...
 $ JPY_392                  : num  0.415 0.412 0.412 0.412 0.416 ...
 $ k_CNY                    : num  10 10 10 10 10 10 10 10 10 10 ...
 $ CNY_156_k                : num  50.5 50.2 50.1 50.1 50.6 ...
 $ CNY_156                  : num  5.05 5.02 5.01 5.01 5.06 ...
 $ BRL_986                  : num  17.4 17.6 17.6 17.8 17.9 ...
 $ k_INR                    : num  100 100 100 100 100 100 100 100 100 100 ...
 $ INR_356_k                : num  61 61.3 61.2 61.5 62 ...
 $ INR_356                  : num  0.61 0.613 0.612 0.615 0.62 ...
 $ gold                     : num  1667 1672 1683 1667 1687 ...
 $ silver                   : num  29.6 30.2 30.4 31.1 30.4 ...
 $ platinum                 : num  1495 1513 1528 1506 1529 ...
 $ palladium                : num  649 653 653 642 657 ...
 $ foreign_exchange_reserves: Factor w/ 55 levels "307 718","308 895",..: 30 30 30 30 30 30 30 30 30 30 ...
 $ monetary_gold            : Factor w/ 54 levels "38 547","39 990",..: 12 12 12 12 12 12 12 12 12 12 ...

Third, we should notice variables which are indicating exchange rates are measured in rubles, the prices of precious metals are denoted in rubles per gram, foreign exchange reserves and monetary gold reserves - in millions of US dollars for every month.

The special variable “dual currency basket” is calculated according to the formula: 0.55 USD + 0.45 EUR.

Foreign exchange reserves and monetary gold reserves consist of official datapoints for every month about the state reserves in Russia.

It’s naturally to use the function summary() for revising all varibles. It gives us the complete observation of the dataset.

      date            dual_currency_basket    EUR_978         USD_840     
 Min.   :2012-01-11   Min.   :33.30        Min.   :38.41   Min.   :28.95  
 1st Qu.:2013-02-25   1st Qu.:35.73        1st Qu.:40.63   1st Qu.:31.80  
 Median :2014-04-16   Median :40.59        Median :47.54   Median :35.02  
 Mean   :2014-04-18   Mean   :48.46        Mean   :53.63   Mean   :44.24  
 3rd Qu.:2015-06-11   3rd Qu.:65.05        3rd Qu.:69.02   3rd Qu.:61.51  
 Max.   :2016-07-30   Max.   :87.01        Max.   :91.18   Max.   :83.59  
                                                                          
     k_JPY      JPY_392_100       JPY_392           k_CNY       
 Min.   :100   Min.   :30.38   Min.   :0.3038   Min.   : 1.000  
 1st Qu.:100   1st Qu.:33.96   1st Qu.:0.3396   1st Qu.:10.000  
 Median :100   Median :39.09   Median :0.3909   Median :10.000  
 Mean   :100   Mean   :42.49   Mean   :0.4249   Mean   : 7.941  
 3rd Qu.:100   3rd Qu.:51.04   3rd Qu.:0.5104   3rd Qu.:10.000  
 Max.   :100   Max.   :71.39   Max.   :0.7139   Max.   :10.000  
                                                                
   CNY_156_k         CNY_156          BRL_986          k_INR       
 Min.   : 9.416   Min.   : 4.585   Min.   :13.54   Min.   : 10.00  
 1st Qu.:46.638   1st Qu.: 5.089   1st Qu.:15.33   1st Qu.:100.00  
 Median :51.125   Median : 5.690   Median :16.04   Median :100.00  
 Mean   :48.586   Mean   : 7.027   Mean   :16.71   Mean   : 87.55  
 3rd Qu.:57.364   3rd Qu.: 9.658   3rd Qu.:17.92   3rd Qu.:100.00  
 Max.   :99.987   Max.   :12.705   Max.   :26.68   Max.   :100.00  
                                                                   
   INR_356_k        INR_356            gold          silver     
 Min.   :10.00   Min.   :0.4880   Min.   :1255   Min.   :19.57  
 1st Qu.:54.60   1st Qu.:0.5669   1st Qu.:1493   1st Qu.:23.16  
 Median :57.78   Median :0.5913   Median :1667   Median :28.71  
 Mean   :58.93   Mean   :0.7215   Mean   :1862   Mean   :28.49  
 3rd Qu.:66.39   3rd Qu.:0.9504   3rd Qu.:2263   3rd Qu.:32.27  
 Max.   :99.99   Max.   :1.2305   Max.   :3168   Max.   :43.38  
                                                                
    platinum      palladium      foreign_exchange_reserves monetary_gold
 Min.   :1392   Min.   : 575.1   313 342: 23               57 269 : 37  
 1st Qu.:1547   1st Qu.: 707.4   317 028: 23               42 630 : 23  
 Median :1643   Median : 904.6   322 375: 23               43 129 : 23  
 Mean   :1764   Mean   : 969.0   409 224: 23               45 016 : 23  
 3rd Qu.:1975   3rd Qu.:1235.7   431 958: 23               46 292 : 23  
 Max.   :2747   Max.   :1745.7   461 865: 23               47 680 : 23  
                                 (Other):990               (Other):976

Finally, as we could see some variables (foreign_exchange_reserves, monetary_gold) have the wrong type. It means the investigation proceeds to the next step.

2.3 Cleaning and Managing Data

With some special functions I added the right category (numerical) for foreign_exchange_reserves and monetary_gold.

The problem was a white space in every value of these columns. Two temporary variables (‘temp1’ and ‘temp2’) were created for transformed values without white spaces, then the type of these values was changed from “factor” to “numeric” together with inserting them into the columns foreign_exchange_reserves and monetary_gold.

After the transformation process we can use the function summary() for these columns too.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 307700  327100  442800  409800  473400  486600

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  38550   45020   47680   48120   50440   63500

The temporary variables were removed from our dataset. I created a variable “drop” for temporary columns and excluded them from the original dataset.

For the next steps it would be useful to have variables with separated date indicators (days, months and years), so I added these variables. For “days” and “years” I had chosen the numerical type, for “months” - categorical (chr).

Let’s check the characteristics of the variables again:

'data.frame':   1128 obs. of  23 variables:
 $ date                     : Date, format: "2012-01-11" "2012-01-12" ...
 $ dual_currency_basket     : num  35.9 35.6 35.6 35.6 35.7 ...
 $ EUR_978                  : num  40.8 40.4 40.3 40.6 40.4 ...
 $ USD_840                  : num  31.9 31.7 31.7 31.6 31.9 ...
 $ k_JPY                    : num  100 100 100 100 100 100 100 100 100 100 ...
 $ JPY_392_100              : num  41.5 41.2 41.2 41.2 41.6 ...
 $ JPY_392                  : num  0.415 0.412 0.412 0.412 0.416 ...
 $ k_CNY                    : num  10 10 10 10 10 10 10 10 10 10 ...
 $ CNY_156_k                : num  50.5 50.2 50.1 50.1 50.6 ...
 $ CNY_156                  : num  5.05 5.02 5.01 5.01 5.06 ...
 $ BRL_986                  : num  17.4 17.6 17.6 17.8 17.9 ...
 $ k_INR                    : num  100 100 100 100 100 100 100 100 100 100 ...
 $ INR_356_k                : num  61 61.3 61.2 61.5 62 ...
 $ INR_356                  : num  0.61 0.613 0.612 0.615 0.62 ...
 $ gold                     : num  1667 1672 1683 1667 1687 ...
 $ silver                   : num  29.6 30.2 30.4 31.1 30.4 ...
 $ platinum                 : num  1495 1513 1528 1506 1529 ...
 $ palladium                : num  649 653 653 642 657 ...
 $ foreign_exchange_reserves: num  453952 453952 453952 453952 453952 ...
 $ monetary_gold            : num  44697 44697 44697 44697 44697 ...
 $ day                      : num  11 12 13 14 17 18 19 20 21 24 ...
 $ month                    : chr  "January" "January" "January" "January" ...
 $ year                     : num  2012 2012 2012 2012 2012 ...

Now there are 23 variables in the dataset without problems for analyzing.

3. Univariate Plots Section

3.1 Univariate Plots

3.1.1 Histograms of the distribution for all the analyzed variables.

The package “ggplot2” helps us to create them with a wide set of functions.

The histogram for the distribution of the dual currency basket I put first because I think it is the most important indicator for the economic situation in the currency market.

Then the histograms for the distribution of prices for precious metals in rubles were placed. They were built and grouped with using the packages “ggplot2” and “gridExtra”.

We will also examine the histograms for the distribution of prices for the main currencies in the Russian market (euro and dollars), as well as the histograms of foreign exchange reserves and monetary gold.

To understand and predict the situation in the currency market it is necessary to discover the histograms of other currencies (Brazilian real, Japanese yen, Chinese yuan and Indian rupee).

3.1.2 Box plots for every significant variable.

I used for creating them the functions of “gglot2” as well.

It helps to discover outliers for the indicators. I put together all the graphics for understanding outliers of economical variables in general. With color I divided them into groups: orange - USD and EUR, brown - Asian currencies, purple - Brazilian real, blue - precious metals, red - foreign exchange reserves, green - monetary gold.

I didn’t include the variable “dual_currency_basket”, it depends from EUR and USD and its outliers as well.

All box plots have important indicators: min, max, median, quartiles. Means were marked with points, they were made with the function stat_summary().

X-axis is just a unit for creating box plots.

3.2. Univariate Analysis

The structure of my data is the set of the main financial indicators. Only with the first look at the data we could see the similar tendencies in many different cases. So we could predict many dependencies between our variables.

Also these indicators show the situation in the country economy in general. For sure it’s an interesting subject for investigations.

Now we will discuss the obtained visualization.

The graphic description of histograms: most of them have a non-symmetrical form, skewed right and multimodal. For histograms where we could see a tendency to be unimodal I have used the heat palette of colors, for histograms that looks to have a tendency to be bimodal - the terrain palette of colors. In practice it means that the variables with unimodal histograms were stable at a certain interval for a sufficiently long period of time; for bimodal histograms there were two not so long periods of stability. Dispersion of all indicators is also quite large.

The histograms for “foreign_exchange_reserves” and “monetary_gold” have a complicated form with many gaps. I have marked it with a special palette.

I didn’t describe every histogram separately because many variables have very similar behaviour and distribution. We will see it much clearly on the next steps.

Let’s discuss outliers on the last graph in this section. When we see all of them we can find immediately even without exact calculation of deviations from the general trend: there are too many outliers here. It shows the influence of the economical variables which are not included into this dataset. If we exclude outliers in this case we could lose important tendency and the material for next discoveries. If the reasons for many frequent and sharp deviations are not explained I want to keep these data points in observation.

4 Bivariate Plots Section

4.1 Bivariate Plots

4.1.1 Plots for all pairs of variables.

Let’s start the section “Bivariate Plots” with graphs of the correlation analysis. It helps to improve understanding the strong relationships between variables in this dataset.

I had choosed the R-packages “PerformanceAnalytics” and “corrplot”. I created a new dataset only with 13 important variables.

'data.frame':   1128 obs. of  13 variables:
 $ dual_currency_basket     : num  35.9 35.6 35.6 35.6 35.7 ...
 $ EUR_978                  : num  40.8 40.4 40.3 40.6 40.4 ...
 $ USD_840                  : num  31.9 31.7 31.7 31.6 31.9 ...
 $ JPY_392                  : num  0.415 0.412 0.412 0.412 0.416 ...
 $ CNY_156                  : num  5.05 5.02 5.01 5.01 5.06 ...
 $ BRL_986                  : num  17.4 17.6 17.6 17.8 17.9 ...
 $ INR_356                  : num  0.61 0.613 0.612 0.615 0.62 ...
 $ gold                     : num  1667 1672 1683 1667 1687 ...
 $ silver                   : num  29.6 30.2 30.4 31.1 30.4 ...
 $ platinum                 : num  1495 1513 1528 1506 1529 ...
 $ palladium                : num  649 653 653 642 657 ...
 $ foreign_exchange_reserves: num  453952 453952 453952 453952 453952 ...
 $ monetary_gold            : num  44697 44697 44697 44697 44697 ...

The function chart.Correlation() let us display correlations between all these variables.

And I added one correlation chart else with a special color indicator for levels of relationships between variables.

The tables show a strong correlation between the number of variables, we will examine in details only some of them.

4.1.2 Plots for exchange rates (BRL 986, JPY 392, CNY 156, INR 356).

Let’s have a look on behavior of rates for Brazilian real, Japanese yen, Chinese yuan and Indian rupee. We will explore changes with time.

The packages “ggplot2” and “gridExtra” are very convenient for combining many plots together.

For creating the common legend for several graphs the function get_legend() from the website

http://stackoverflow.com/questions/24954912/sharing-a-legend-between-two-combined-ggplots

was applied. The image was saved as a file as well.

The similarity of tendencies in rates of Asian currencies is underlined by the common color solution for them.

4.1.3 Plots for foreign exchange and monetary gold reserves.

And for other two variables “foreign_exchange_reserves” and “monetary_gold” could see the changes step by step also.

The breadth of fluctuations in values is emphasized by color decision with the function scale_colour_gradientn().

4.1.4 Plots for prices of precious metals and the dual currency basket.

The graph for variables “silver” and “dual currency basket” has the most complex structure. The rest of the graphics in this section is quite close to the linear tendency (especially for the pair “gold” and “dual currency basket”).

4.1.5 Plots for Asian exchange rates and the dual currency basket.

The function geom_smooth() helps to evaluate the dependence of one variable on the other.

It shows again how identical the trends which Asian currencies demonstrate.

4.1.6 Plots for foreign exchange reserves and monetary gold reserves.

There is a complex nonlinear dependency on the last graph, so let’s discuss it in the next section.

4.2 Bivariate Analysis

Let’s start this section with some notes about visual observation.

The significant price increase for precious metals in relation to the national currency allows us to consider that the precious metals are reasonable reserves for the long terms.

Bivariate graphs show various exchange trends for the Brazilian real and the Asian currencies. The Brazilian real has one rise relative to the Russian ruble, Asian currencies have many peaks. However, there are certain similarities too: a period of small oscillations in the beginning of the chart gave way to a period of sharp spikes.

The plots for foreign exchange and monetary gold reserves demonstrate diametrically opposed situation in the resources in the second half of the time interval: the foreign exchange reserves show a downward trend, stocks of monetary gold - a steady rise.

Then we can continue to analize with correlation tests for these plotted pairs of variables.

The function cor.test() helps to test for association between paired samples, using one of Pearson’s product moment correlation coefficient.


    Pearson's product-moment correlation

data:  dual_currency_basket and gold
t = 86.302, df = 1126, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9239207 0.9392967
sample estimates:
      cor 
0.9320269


    Pearson's product-moment correlation

data:  dual_currency_basket and silver
t = 23.544, df = 1126, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5338973 0.6122110
sample estimates:
      cor 
0.5743669


    Pearson's product-moment correlation

data:  dual_currency_basket and platinum
t = 70.021, df = 1126, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8902876 0.9121507
sample estimates:
      cor 
0.9017946


    Pearson's product-moment correlation

data:  dual_currency_basket and palladium
t = 64.112, df = 1126, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8727475 0.8979171
sample estimates:
      cor 
0.8859831


    Pearson's product-moment correlation

data:  dual_currency_basket and JPY_392
t = 85.698, df = 1126, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9229565 0.9385212
sample estimates:
      cor 
0.9311618


    Pearson's product-moment correlation

data:  dual_currency_basket and CNY_156
t = 412.21, df = 1126, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9962950 0.9970661
sample estimates:
     cor 
0.996703


    Pearson's product-moment correlation

data:  dual_currency_basket and INR_356
t = 182.94, df = 1126, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9815756 0.9853877
sample estimates:
      cor 
0.9835911


    Pearson's product-moment correlation

data:  JPY_392 and CNY_156
t = 77.832, df = 1126, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9086244 0.9269746
sample estimates:
      cor 
0.9182913


    Pearson's product-moment correlation

data:  JPY_392 and INR_356
t = 96.989, df = 1126, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9384341 0.9509517
sample estimates:
      cor 
0.9450382


    Pearson's product-moment correlation

data:  INR_356 and CNY_156
t = 202.67, df = 1126, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9849159 0.9880411
sample estimates:
      cor 
0.9865685


    Pearson's product-moment correlation

data:  foreign_exchange_reserves and monetary_gold
t = -14.395, df = 1126, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.4424186 -0.3437725
sample estimates:
       cor 
-0.3942305

Most of the observed relationships have strong positive correlation (dual_currency_basket & gold, dual_currency_basket & platinum, dual_currency_basket & palladium, dual_currency_basket & JPY-392, dual_currency_basket & INR-356, dual_currency_basket & CNY-156, JPY-392 & CNY-156, JPY-392 & INR-356, INR-356 & CNY-156). A moderate positive relationship is between dual_currency_basket & silver; a weak negative relationship - between foreign_exchange_reserves & monetary_gold.

The strongest relationship I had found is between dual_currency_basket & CNY-156.

The revealed dependencies are based on cause-and-effect relationships, the fall of the national currency value in relation to a set of major currencies undoubtedly leads to the fall of this indicator in relation to other currencies and precious metals.

Even a weak relationship between foreign exchange reserves and monetary gold reserves does not look accidental: reduced confidence to the foreign exchange reserves of the state leads to the accumulation of resources in another equivalent.

5 Multivariate Plots Section

5.1 Multivariate Plots

5.1.1 A plot for dual currency basket and gold price per year.

5.1.2 A plot for exchange rates (JPY, CNY, INR) and gold price.

Here one of the most effective packages “plotly” was applied. I used a 3D scatterplot.

I have purposely used the 3-dimensional image to show how limited the space which contains the graph of three indicators for exchange rates, it means there is a strong connection between all these variables. At the same time the periods of increasing the exchange rates in many cases coincides with the growth of the prices of gold. And at least twice we can see the returning to the previous levels of prices for this metal. So we should discover the reason for these events.

With color I highlighted specially some other important tendencies: I wanted to show once again the width of changing for the variable “price of gold” and identify for myself tentative boundaries of these changes. Fragments of the chart on which significant changes have occurred indicate the influence of factors that must be find and identify (the prices of oil, economic sanctions, etc.).

5.1.3 A plot for prices of precious metals (gold, silver, platinum) and the dual currency basket.

To organize the graph space (titles, axes, etc.) the function layout() was used.

By adding a variable “price of silver”, we can see a different picture: a region of space occupied by the graph is much wider, and therefore the relationship between three variables (gold, silver and platinum) is weaker.

The periods of increasing the cost of the dual currency basket (and for exchange rates USD and EUR) in general coincides with the growth of the prices of precious metals.

5.1.4 A plot for the ratio of year means

Let me introduce artificial variables, showing the ratio of year means of the precious metals prices to year means of the cost of dual currency basket. For ease of combining on the same graph, the prices of gold, platinum and palladium are for 1 g, the price of silver - for 100 g.

I created the dataset centrobank_y with means per year and the variable centrobank_y$silver100 for this goal.

To combine charts for four metals I used the function add_trace().

I combined all the indicators to watch the overall trend: how the ratio of the prices of precious metals to the value of the dual currency basket varies with time.

5.1.5 Plots for exchange rates and prices of precious metals.

For combining all the rates and prices on the same graphs I created and added the variables for indicators with coefficients: 50 USD, 50 EUR, 300 CNY, 4000 JPY, 100 BRL, 3000 INR, 100g of silver.

For applying the package “dygraphs”, I created from the source database a new data set with time series. The function xts() was very helpful.

The function dygraph() let us combine all rates on the same chart.

This method is useful for prices of precious metals also.

The apparent advantage of this method is the ability to see the exact values of exchange rates and prices for each specific date.

5.2 Multivariate Analysis

The first graph shows the danger of some statistical predictions and conclusions. According the researches, the gold price is quite strongly dependent on the price of the dual currency basket, currency exchange rates and the prices of other precious metals, but in 2013 this indicator demonstrates a diametrically opposite behavior. This database does not allow to identify the factors that had such an influence. So, more deep researches are needed.

It is easy to see in the long time period that the prices of precious metals (for example, gold) and currencies show almost the same trend concerning the national currency. Only silver prices were different from everyone else. For this metal price dynamics in most cases is decreasing in the 2012-2014 period.

These conclusions are confirmed by three-dimensional graphics. Adding the price of silver as a variable had led to the considerable spread of the data.

The artificial variables combining on the bar chart show us a very interesting tendency. Yes, precious metals are traditionally considered one of the best investment funds. Their cost in the long period has been steadily increased in monetary terms. But the ratio of the prices for precious metals to the value of the dual currency basket falls from 2012 to 2016. This is particularly noticeable when the four main metals are analyzed together.

5.3 OPTIONAL: Building the Linear Model.

We can construct a model of the dependence the price of gold on the price of the currency basket and the prices of other precious metals.

For fitting the linear model the functions lm() and mtable() were used.

## 
## Calls:
## m1: lm(formula = gold ~ dual_currency_basket, data = centrobank)
## m2: lm(formula = gold ~ dual_currency_basket + platinum, data = centrobank)
## m3: lm(formula = gold ~ dual_currency_basket + platinum + palladium, 
##     data = centrobank)
## m4: lm(formula = gold ~ dual_currency_basket + platinum + palladium + 
##     silver, data = centrobank)
## 
## ===========================================================================
##                             m1          m2           m3           m4       
## ---------------------------------------------------------------------------
##   (Intercept)           422.931***  -134.441***  -588.201***  -407.192***  
##                         (17.455)     (36.930)     (27.267)     (13.677)    
##   dual_currency_basket   29.692***    18.975***    25.786***    21.744***  
##                          (0.344)      (0.713)      (0.506)      (0.257)    
##   platinum                             0.611***     1.247***     0.215***  
##                                       (0.037)      (0.030)      (0.023)    
##   palladium                                        -1.031***    -0.155***  
##                                                    (0.027)      (0.020)    
##   silver                                                        34.608***  
##                                                                 (0.579)    
## ---------------------------------------------------------------------------
##   R-squared                    0.9         0.9          1.0          1.0   
##   adj. R-squared               0.9         0.9          1.0          1.0   
##   sigma                      173.5       155.4        103.1         50.4   
##   F                         7448.1      4777.6       7718.9      25087.7   
##   p                            0.0         0.0          0.0          0.0   
##   Log-likelihood           -7415.4     -7291.1      -6827.3      -6020.2   
##   Deviance              33877079.8  27172697.1   11941503.5    2854836.1   
##   AIC                      14836.9     14590.1      13664.7      12052.5   
##   BIC                      14851.9     14610.2      13689.8      12082.7   
##   N                         1128        1128         1128         1128     
## ===========================================================================

All derived statistical coefficients unambiguously confirm the existence of relationships between these indicators and the fairly good quality of the constructed model.

At the same time this model is very primary and primitive. For improving prediction it needs to add some important economic indicators: prices for basic resources, prices on other important world markets and the variable that reflects the impact of uncertainty in the economy (political reasons, force majeure, etc.). The introduction of the last variable requires the use of more advanced mathematical apparatus, especially for its determination.

6 Final Plots and Summary

In this section I placed three graphs corresponding to the most interesting facts from my point of view.

6.1.1 Plot One

The first surprising finding for me is decreasing of the ratio of year means of prices of the precious metals to year means of the cost of dual currency basket.

6.1.2 Description One

We can see the tendency: in general the ratio of the prices for precious metals to the value of the dual currency basket falls from 2012 to 2015. Only the beginning of 2016 shows a small increase in this ratio, but not for all metals.

In practice this means that precious metals are important reserves and constantly increasing in price, but they cannot be considered as an effective investment, especially for the short terms.

6.2.1 Plot Two

The next interesting fact is rigid connectivity of the main economic indicators in a unified financial system.

Let’s look at one place on the changes in exchange rates and the prices of precious metals.

6.2.2 Description Two

And once again we can see a period of small oscillations in the beginning of the chart and a period of sharp spikes in the end.

This finding may lead to a very important conclusion: for example, for the prediction of the economic situation it will be enough to set a value of 2-3 key indicators. But this dataset does not allow to find which variables should be included for that.

6.3.1 Plot Three

The third surprising fact was significantly various tendencies for the economic indicators in short and long terms. Let us estimate the difference in the behavior of prices for precious metals.

6.3.2 Description Three

For the next step I had launched the arithmetic mean of the prices of precious metals. It helps me to simplify the graphical visualization.

The third graph represents a great difference in behavior of economic indicators: the dependence is very close to linear in the long term, but in the short terms linear relationships are very rare (actually only in 2014).

It shows us clearly the influence of some unknown variables which are not represented in this dataset. In practice it also means many difficulties in predictions for short periods.

7 Reflection

The data set “centrobank” contains information on almost 1,100 days of measuring the economic indicators. The data description can be found in the file with this database on the second page.

In general it’s a very difficult sphere for prediction because of many external factors. On my opinion, for more correct researching it needs to have the data for the period not less than 10 years, some additional indicators (prices for basic resources and prices for the same things on other important world markets).

The directions for the next interesting discoveries are easy to see:

1) to collect more data for better analysis,

2) to determine the advanced mathematical tools for taking into account the impact of non-economic factors,

3) to identify the basis (2-3 indicator) for prediction in a rigidly connected economic system.

Working over this project, I have set for myself the task of training rather than a researching. I have been studying and applying the programming language R for the first time so I had chosen the lightest model for investigating, where economic relations between the variables are obvious.

For these goals different plotting techniques had been used. R has a very wide range of instruments, I hope I was able to learn at least some of them.

In addition, I have been studying to apply this language in the R-Studio and the Jupyter Notebook. In some cases, I was unable to apply the same methodology. For example, I would like to know how to use the plotly package in the Jupyter Notebook in the Anaconda environments, but for now I use it more effective in R-Studio. Maybe I could improve it a little bit later.

I want to note also that the color decisions really helped me in the research. They allowed to unite or to mark not only the variables on individual charts, but graphics in general on certain groups. As an author I want to retain the right to use the colors that help to reach these goals.

The broad field of research is ahead with the addition of more and more wide range of methods and indicators.

P4: Explore and Summarize Data. Data of the Bank of Russia

Olga Belitskaya

21 августа 2016 г

1. Tools

Let’s start the description of the project with definition of working tools: the directory for storing generated files, the list of downloaded packages and the collection of useful links.

1.1 Setup Working Directory

‘/Users/olgabelitskaya/projects/nd002/Data_Analyst_ND_Project4’

1.2 Useful links

1.3 Libraries

List of the libraries:

2. Data

2.1 Introduction

I have chosen an open database of quotations of currencies and precious metals located on the site of the Bank of Russia.

I think it is easy to visualize the analysis of variables and dependencies in this case. From publicly available data the file in xlsx format has been generated and downloaded.

2.2 Load the Data and Review

Using suitable functions (read.xlsx, data.frame) I had read the file as a dataframe. It was called “centrobank”. The function head() shows us several rows as an example.

First, there were no values “NA” in this dataframe. It was checked with a specially created set of rows with values “NA” - row.with.na.

The code displays the number of this kind of rows. As we can see, it’s zero.

Second, let us have a look on the dataframe components (1128 observations of 20 variables; their types, values, etc.). Most of them have a numerical type.

The function str() is very helpful and informative in this case.

Third, we should notice variables which are indicating exchange rates are measured in rubles, the prices of precious metals are denoted in rubles per gram, foreign exchange reserves and monetary gold reserves - in millions of US dollars for every month.

The special variable “dual currency basket” is calculated according to the formula: 0.55 USD + 0.45 EUR.

Foreign exchange reserves and monetary gold reserves consist of official datapoints for every month about the state reserves in Russia.

It’s naturally to use the function summary() for revising all varibles. It gives us the complete observation of the dataset.

Finally, as we could see some variables (foreign_exchange_reserves, monetary_gold) have the wrong type. It means the investigation proceeds to the next step.

2.3 Cleaning and Managing Data

With some special functions I added the right category (numerical) for foreign_exchange_reserves and monetary_gold.

After the transformation process we can use the function summary() for these columns too.

The temporary variables were removed from our dataset. I created a variable “drop” for temporary columns and excluded them from the original dataset.

For the next steps it would be useful to have variables with separated date indicators (days, months and years), so I added these variables. For “days” and “years” I had chosen the numerical type, for “months” - categorical (chr).

Let’s check the characteristics of the variables again:

Now there are 23 variables in the dataset without problems for analyzing.

3. Univariate Plots Section

3.1 Univariate Plots

3.1.1 Histograms of the distribution for all the analyzed variables.

The package “ggplot2” helps us to create them with a wide set of functions.

The histogram for the distribution of the dual currency basket I put first because I think it is the most important indicator for the economic situation in the currency market.

Then the histograms for the distribution of prices for precious metals in rubles were placed. They were built and grouped with using the packages “ggplot2” and “gridExtra”.

We will also examine the histograms for the distribution of prices for the main currencies in the Russian market (euro and dollars), as well as the histograms of foreign exchange reserves and monetary gold.

To understand and predict the situation in the currency market it is necessary to discover the histograms of other currencies (Brazilian real, Japanese yen, Chinese yuan and Indian rupee).

3.1.2 Box plots for every significant variable.

I used for creating them the functions of “gglot2” as well.

I didn’t include the variable “dual_currency_basket”, it depends from EUR and USD and its outliers as well.

All box plots have important indicators: min, max, median, quartiles. Means were marked with points, they were made with the function stat_summary().

X-axis is just a unit for creating box plots.

3.2. Univariate Analysis

The structure of my data is the set of the main financial indicators. Only with the first look at the data we could see the similar tendencies in many different cases. So we could predict many dependencies between our variables.

Also these indicators show the situation in the country economy in general. For sure it’s an interesting subject for investigations.

Now we will discuss the obtained visualization.

The histograms for “foreign_exchange_reserves” and “monetary_gold” have a complicated form with many gaps. I have marked it with a special palette.

I didn’t describe every histogram separately because many variables have very similar behaviour and distribution. We will see it much clearly on the next steps.

4 Bivariate Plots Section

4.1 Bivariate Plots

4.1.1 Plots for all pairs of variables.

Let’s start the section “Bivariate Plots” with graphs of the correlation analysis. It helps to improve understanding the strong relationships between variables in this dataset.

I had choosed the R-packages “PerformanceAnalytics” and “corrplot”. I created a new dataset only with 13 important variables.

The function chart.Correlation() let us display correlations between all these variables.

And I added one correlation chart else with a special color indicator for levels of relationships between variables.

The tables show a strong correlation between the number of variables, we will examine in details only some of them.

4.1.2 Plots for exchange rates (BRL 986, JPY 392, CNY 156, INR 356).

Let’s have a look on behavior of rates for Brazilian real, Japanese yen, Chinese yuan and Indian rupee. We will explore changes with time.

The packages “ggplot2” and “gridExtra” are very convenient for combining many plots together.

For creating the common legend for several graphs the function get_legend() from the website

was applied. The image was saved as a file as well.

The similarity of tendencies in rates of Asian currencies is underlined by the common color solution for them.

4.1.3 Plots for foreign exchange and monetary gold reserves.

And for other two variables “foreign_exchange_reserves” and “monetary_gold” could see the changes step by step also.

The breadth of fluctuations in values is emphasized by color decision with the function scale_colour_gradientn().

4.1.4 Plots for prices of precious metals and the dual currency basket.

The graph for variables “silver” and “dual currency basket” has the most complex structure. The rest of the graphics in this section is quite close to the linear tendency (especially for the pair “gold” and “dual currency basket”).

4.1.5 Plots for Asian exchange rates and the dual currency basket.

The function geom_smooth() helps to evaluate the dependence of one variable on the other.

It shows again how identical the trends which Asian currencies demonstrate.

4.1.6 Plots for foreign exchange reserves and monetary gold reserves.

There is a complex nonlinear dependency on the last graph, so let’s discuss it in the next section.

4.2 Bivariate Analysis

Let’s start this section with some notes about visual observation.

The significant price increase for precious metals in relation to the national currency allows us to consider that the precious metals are reasonable reserves for the long terms.

The plots for foreign exchange and monetary gold reserves demonstrate diametrically opposed situation in the resources in the second half of the time interval: the foreign exchange reserves show a downward trend, stocks of monetary gold - a steady rise.

Then we can continue to analize with correlation tests for these plotted pairs of variables.