merge_data¶

hydrostats.data.merge_data(sim_fpath=None, obs_fpath=None, sim_df=None, obs_df=None, interpolate=None, column_names=('Simulated', 'Observed'), simulated_tz=None, observed_tz=None, interp_type='pchip', return_tz='Etc/UTC', julian=False, julian_freq=None)¶

Merges two dataframes or csv files, depending on the input.

Parameters:

sim_fpath (str) – The filepath to the simulated csv of data. Can be a url if the page is formatted correctly. The csv must be formatted with the dates in the left column and the data in the right column.
obs_fpath (str) – The filepath to the observed csv. Can be a url if the page is formatted correctly. The csv must be formatted with the dates in the left column and the data in the right column.
sim_df (DataFrame) – A pandas DataFrame containing the simulated data. Must be formatted with a datetime index and the simulated data values in column 0.
obs_df (DataFrame) – A pandas DataFrame containing the simulated data. Must be formatted with a datetime index and the simulated data values in column 0.
interpolate (str) – Must be either ‘observed’ or ‘simulated’. Specifies which data set you would like to interpolate if interpolation is needed to properly merge the data.
column_names (tuple of str) – Tuple of length two containing the column names that the user would like to set for the DataFrame that is returned. Note that the simulated data will be in the left column and the observed data will be in the right column
simulated_tz (str) – The timezone of the simulated data. A full list of timezones can be found in the List of Timezones.
observed_tz (str) – The timezone of the simulated data. A full list of timezones can be found in the List of Timezones.
interp_type (str) – Which interpolation method to use. Uses the default pandas interpolater. Available types are found at http://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.DataFrame.interpolate.html
return_tz (str) – What timezone the merged dataframe’s index should be returned as. Default is ‘Etc/UTC’, which is recommended for simplicity.
julian (bool) – If True, will parse the first column of the file to a datetime index from julian floating point time representation, this is only valid when supplying the sim_fpath and obs_fpath parameters. Users supplying two DataFrame objects must convert the index from Julian to Gregorian using the julian_to_gregorian function in this module
julian_freq (str) – A string representing the frequency of the julian dates so that they can be rounded. See examples for usage.

Notes

The only acceptable time frequencies in the data are 15min, 30min, 45min, and any number of hours or days in between.

There are three scenarios to consider when merging your data:

The first scenario is that the timezones and the spacing of the time series matches (eg. 1 Day). In this case, you will want to leave the simulated_tz, observed_tz, and interpolate arguments empty, and the function will simply join the two csv’s into a dataframe.
The second scenario is that you have two time series with matching time zones but not matching spacing. In this case you will want to leave the simulated_tz and observed_tz empty, and use the interpolate argument to tell the function which time series you would like to interpolate to match the other time series.
The third scenario is that you have two time series with different time zones and possibly different spacings. In this case you will want to fill in the simulated_tz, observed_tz, and interpolate arguments. This will then take timezones into account when interpolating the selected time series.

Examples

>>> import hydrostats.data as hd
>>> import pandas as pd
>>> pd.options.display.max_rows = 15

The data URLs contain streamflow data from two different models, and are provided from the Hydrostats Github page

>>> sfpt_url = r'https://github.com/waderoberts123/Hydrostats/raw/master/Sample_data/sfpt_data/magdalena-calamar_interim_data.csv'
>>> glofas_url = r'https://github.com/waderoberts123/Hydrostats/raw/master/Sample_data/GLOFAS_Data/magdalena-calamar_ECMWF_data.csv'
>>> merged_df = hd.merge_data(sfpt_url, glofas_url, column_names=('Streamflow Prediction Tool', 'GLOFAS'))