Repository for assignment of getting and cleaning data
Instructions on how to run the scripts
- Download file in your working directory Samsung dataset
- Unzip the Samsung dataset in your working directory.
- Download the run_analysis.R file in your working directory.
- Execute in R terminal > source("run_analysis.R")
- Output file will be created as tidy_data.txt
Important Note before running the script
This script is meant to clean and tidy this particular dataset.
INPUT DATA SET The contents of the zip file should be extracted to working directory.
OUTPUT FILE: tidy_data.txt
Format of the output file is described in Codebook.md
Script will quit mid-way with no results if following files do not exist relative to the working directory.
- ./UCI HAR Dataset/features.txt
- ./UCI HAR Dataset/activity_labels.txt
- ./UCI HAR Dataset/test/subject_test.txt
- ./UCI HAR Dataset/test/X_test.txt
- ./UCI HAR Dataset/test/y_test.txt
- ./UCI HAR Dataset/train/subject_train.txt
- ./UCI HAR Dataset/train/X_train.txt
- ./UCI HAR Dataset/train/y_train.txt
This script contains the following four functions.
This is the main driver function which returns a tidy data set. This does not output a tidy dataset to a file. This function calls all other read functions to read corresponding files.
get_features function returns a tidy feature names list from features.txt file This function returns columns names which contain mean and std as substring This function also cleans out "(", ")" and "-" characters from the column names
get_activity_labels function reads and returns the activity labels read from activity_labels.txt file
get_data function will perform the following tasks
- First read subject, X, y files in test and train directory
- Second, it will keep only those column indexes specified by keepfeatures
- Third, name them with proper feature names from keep features
- Fourth, label activities with proper descriptive names