Research Guides: Scholarly Communication Support: Planning, Conducting, Disseminating, Promoting, & Assessing Research: Collecting & Managing Your Research Data

Librarian

she/her/hers

Contact:

936-294-4567
eowens@shsu.edu
NGL 223D
ORCID: 0000-0001-9520-9314

I'm Creative Commons Certified! I can advise on CC licenses to reuse, remix, and publish open resources.

Subjects: OER: Open (or Free) Educational Materials, Scholarly Communication

Collecting & Managing Your Research Data

The Data Management Association (DAMA) defines data management as "the development of architectures, policies, practices, and procedures to manage the data lifecycle."

In other words: Data management is "the process of collecting, keeping, and using data in a cost-effective, secure, and efficient manner" (Simplilearn).

High-quality data management in your research will improve: efficiency, data quality, security, accessibility, compliance and governance, disaster recovery, decision-making, and other aspects.

Research Data Management (RDM) processes include:

Creating a data management plan before beginning data collection -> For more on data management plans, see section 1 of this guide, Planning Research -> Planning Your Data (DMP/DMSP).
Organizing your data logically, including directory structure and folder naming, file naming, file versioning, and file formats
Understanding copyright and licensing aspects of your data
Collecting documentation, throughout your research, about your files and contents; this should include metadata/README files and might also include lab notebooks
Ethically protecting sensitive data in terms of how it is collected, where it is stored, anonymization, and whether/how it is shared
Storing and backing up your data according to best practices, including considerations of physical security (e.g., locked lab, locked file cabinet) and digital security (e.g., anti-virus program, password protection, encryption)
Preserving and/or sharing data after research is complete

Research Data Management Guide
The Washington State University Libraries created this thorough guide on RDM in pre-research, research, and post-research stages.
Bare Necessities of Data Management
A proposed list of core practices that should be implemented early on, before data collection begins, in order for your project to be as efficient and successful as possible.
The Turing Way: Research Data Management (chapter)

More Resources

DataONE Data Management Best Practices Primer
For students and others new to data management, the Best Practices Primer provides an introduction to the DataONE Best Practices database and data management in general.
DataONE Data Management Best Practices Database
The DataONE Best Practices database provides individuals with recommendations on how to effectively work with their data through all stages of the data lifecycle. Users can access best practices within the database by either clicking on a stage of the lifecycle or selecting keywords under search.
UK Data Service: Research Data Management Learning Hub
Well-organized guide to the data lifecycle, storing, formatting, anonymizing, and documenting data, as well as rights, collaboration, data protection, ethical issues.
ICPSR Data Management and Curation Guide
Inter-University Consortium for Social and Political Research
File Organization Best Practices
From Simmons University LIbrary
Organising: Folder/File Structure and File Names
UK Data Service
Recommended File Format by Data Type
UK Data Service
Primer for Protecting Sensitive Data in Academic Research
"This primer helps individual researchers, managers of research data services, and organizational leaders understand how and why to integrate data protection activities into their practices through the collaborative endeavor of research data management."
Final NIH Policy for Data Management and Sharing

Simply creating data files is not enough for robust data management and preservation: You should also create documentation which explains your data clearly. Such documentation may take several forms, such as:

README metadata file
Data dictionary
Codebook

Resources are provided below for creating README files. Additional information about data dictionaries and codebooks will be added soon!

Guide to Writing "readme" Style Metadata
A readme file provides information about a data file and is intended to help ensure that the data can be correctly interpreted, by yourself at a later date or by others when sharing or publishing data. Template and best practices from the Research Data Management Service Group at Cornell University.

Although the video below is for the Texas Data Repository, its guidance on creating README metadata files is still broadly applicable.

Research projects often involve some form of data entry. Entering this information into a database is often preferable for many reasons.

Tips for Data Entry in Excel
If Excel is your easiest or only option for data entry, follow these tips for optimal efficiency and accuracy.

Tools

The following are a few tools that may be useful in organizing and managing your data.

Microsoft Access from SHSU
SHSU provides access to Microsoft Office 365, including downloads for use on home computers. Microsoft Access can be an accessible way to harness the data entry power of a relational database.
Qualtrics from SHSU
Qualtrics is a powerful web-based tool for creating surveys and reviewing results.
DataLad
"DataLad is a free and open source distributed data management system that keeps track of your data, creates structure, ensures reproducibility, supports collaboration, and integrates with widely used data infrastructure."
AirTable
At its heart, AirTable is a relational database. It is flexible enough to build a diverse array of data tracking, project tracking, and other tools, complete with app interfaces when desired. Free and paid service levels are available.
GitHub
GitHub supports collaboration and version control that will improve your data management compared to simple spreadsheets.

Are Numbers Really Neutral?
Numbers generated from statistics are powerful and can shape policy, but no, they’re not neutral. ...The numbers you use depend on the story you’re trying to tell.
What Does a Statistical Method Assume?
All statistical procedures have assumptions. Even simple descriptive statistics have assumptions as described below. But what does it mean that an assumption is required for using a statistical procedure?

Invisible Women: Data bias in a world designed for men by Caroline Criado Perez
Call Number: HQ1237 .C75 2019 (4th floor)

Publication Date: 2019

Data is fundamental to the modern world. From economic development, to healthcare, to education and public policy, we rely on numbers to allocate resources and make crucial decisions. But because so much data fails to take into account gender, because it treats men as the default and women as atypical, bias and discrimination are baked into our systems.
Data Feminism by Catherine D'Ignazio; Lauren F. Klein
Call Number: Read Online

Publication Date: 2020

Today, data science is a form of power. It has been used to expose injustice, improve health outcomes, and topple governments. But it has also been used to discriminate, police, and surveil. This potential for good, on the one hand, and harm, on the other, makes it essential to ask: Data science by whom? Data science for whom? Data science with whose interests in mind?
A Field Guide to Lies: Critical thinking in the information age by Daniel J. Levitin
Call Number: BC177 .L486 2016 (4th floor)

Publication Date: 2016

Daniel J. Levitin shows how to recognize misleading announcements, statistics, graphs, and written reports, revealing the ways lying weasels can use them. It's becoming harder to separate the wheat from the digital chaff. How do we distinguish misinformation, pseudo-facts, and distortions from reliable information?
Bad Data: Why we measure the wrong things and often miss the metrics that matter by Peter Schryvers
Call Number: QA76.9 .Q36 S37 2020 (3rd floor)

Publication Date: 2020

Highlights the pitfalls of data analysis and emphasizes the importance of using the appropriate metrics before making key decisions.
Bad Data Handbook by Q. Ethan McCallum
Publication Date: 2012

What is bad data? Some people consider it a technical phenomenon, like missing values or malformed records, but bad data includes a lot more. Bottom line? Bad data is data that gets in the way. This book explains effective ways to get around it. Among the many topics covered, you'll discover how to: Test drive your data to see if it's ready for analysis; Work spreadsheet data into a usable form; Handle encoding problems that lurk in text data; Develop a successful web-scraping effort; Use NLP tools to reveal the real sentiment of online reviews; Address cloud computing issues that can impact your analysis effort; Avoid policies that create data analysis roadblocks; Take a systematic approach to data quality analysis

Principles for Advancing Equitable Data Practice [Brief] by Marcus Gaddy, Kassie Scott
Publication Date: June 8, 2020

People in research, government, nonprofits, and other sectors create and use data in ways that may have serious consequences for others’ lives. This brief introduces the Belmont Report’s principles and provides selected principle-aligned practices and resources to help data experts at all levels integrate the principles into their work and move toward more equitable data practice.

The Do No Harm Project
Multiple reports and resources from the Urban Institute. "The Do No Harm Guide body of work consists of several guides for how researchers and analysts can approach their work through a lens of diversity, equity, and inclusion. These guides aim to encourage thoughtfulness in how analysts work with and present their data rather than prescribing what to do or not do."

Mindmap with the central bubble labeled 6 Ideas to Get Started with Data Equity; the six branching nodes are Build Data Values, Data Collection Assessment, Parternships for Data Equity, Team Trainings on Data Equity, Build Diverse Data Teams, and Community-Centric Data Collection; each of those nodes then has two to three branching nodes with details on that strategy

Plain-text of infographic above:

6 Strategies to Get Started with Data Equity

Build Data Values
- Org-wide understanding of data practices
- Include this step as part of strategic plan, so it lives for an extended time
Data Collection Assessment
- Bring all data collection sources
- Assess if the intention of data collection aligns with strategic and DEI plans
Partnerships for Data Equity
- Collaborate with other organizations
- Share best practices and resources
Team Trainings on Data Equity
- Build a data culture
- Include internal and external experts
- Build this as a continuous learning mechanism
Build Diverse Data Teams
- Include all forms of diversity
- Build cultural sensitivity in data teams around collection, analysis, and interpretation
Community-Centric Data Collection
- Include the community
- Accessible data collection tools
- Humanized design and language

Source: @namaste data

NamasteData From website: "Our vision is to empower nonprofits to do good with data by centering equity and inclusion in all aspects of their data practices. We believe that data can be a powerful tool for social change, but only if it is collected, analyzed, and applied in ways that promote fairness, justice, and inclusion."

Image source: Institute of Mathematical Statistics, 2019

In order to support open science and increase access to and reuse of data, proposed best practices emphasize that research data should be FAIR: Findable, Accessible, Interoperable, Reusable

The following resources will help you better understand what FAIR means and how to achieve it.

The FAIR Guiding Principles for scientific data management and stewardship (Scientific Data, 2016)
The FAIR guiding principles in detail, as formally published in 2016 in Scientific data, 3.
Open science is all very well but how do you make it FAIR in practice?
by Rachel Bruce and Bas Cordewener, 12 July 2018, on the Jisc blog.
Discussion of Jisc's recent "FAIR in Practice" report. "Open science is about increasing the re-use of research, and making sure that publicly funded research is accessible to all. It sounds straightforward, but there are some issues that we need to iron out first, and this is where FAIR comes in."

Graphic reading reads be fair and care; combines the elements of the FAIR and CARE acronyms for data ethics

The Global Indigenous Data Alliance (GIDA) observes that "the current movement toward open data and open science does not fully engage with Indigenous Peoples rights and interests," and they assert "the right to create value from Indigenous data in ways that are grounded in Indigenous worldviews and realise opportunities within the knowledge economy."

The CARE principles ask "researchers and those who manage or control research infrastructures to examine the data lifecycle from a people and purpose orientation. ...These principles complement the existing FAIR principles encouraging open and other data movements to consider both people and purpose in their advocacy and pursuits."

The full and summary CARE principles documents on the GIDA website provide significantly more detail, but in summary:

Collective Benefit: Data ecosystems shall be designed and function in ways that enable Indigenous Peoples to derive benefit from the data.
Authority to Control: Indigenous Peoples’ rights and interests in Indigenous data must be recognised and their authority to control such data be empowered. Indigenous data governance enables Indigenous Peoples and governing bodies to determine how Indigenous Peoples, as well as Indigenous lands, territories, resources, knowledges and geographical indicators, are represented and identified within data.
Responsibility: Those working with Indigenous data have a responsibility to share how those data are used to support Indigenous Peoples’ self-determination and collective benefit. Accountability requires meaningful and openly available evidence of these efforts and the benefits accruing to Indigenous Peoples.
Ethics: Indigenous Peoples’ rights and wellbeing should be the primary concern at all stages of the data life cycle and across the data ecosystem.

Data wrangling is the process that transforms raw data into usable data -- cleaning, merging, adapting, and otherwise preparing for analysis!

Some Data Wrangling Tools:

Tabula
Tabula allows you to extract data tables from PDF files into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use interface. Tabula works on Mac, Windows and Linux.
WebPlotDigitizer
Estimate and extract data points from a raster image of a graph. Browser-based.
Mr. Data Converter
This tool will convert Excel data into web-friendly formats, such as HTML, JSON and XML.
R (statistical computing environment)
R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.
RStudio Desktop
(Choose the open source license for a free download.) Bundle of RStudio’s popular professional software to help you be more productive with R and Python for cleaning, transforming, and analyzing data.
Posit Cloud
Posit Cloud (formerly RStudio Cloud) lets you access Posit’s powerful set of data science tools right in your browser – no installation or complex configuration required.

R for the Rest of Us by David Keyes
Publication Date: 2024

"If you just want to automate repetitive tasks or visualize your data, without the need for complex math, R for the Rest of Us is for you. Inside you’ll find a crash course in R, a quick tour of the RStudio programming environment, and a collection of real-word applications that you can put to use right away."

Cleaning your data involves taking steps to ensure that the compiled data points are complete, consistent, and correct. Data should conform to all rules in your data dictionary. In many cases, "clean" will also indicate that the data has been de-identified.

In short, "clean" means you have critically examined all the data as it was entered by human or machine, and you have verified that it is ready to be analyzed and to produce valid results.

The link below, Part 1 of a 3-part tutorial, provides a more detailed discussion what "clean" data entails and what aspects to think about.

Tutorial Part 1: Data cleaning for data sharing

In order to clean our data effectively and efficiently, we should establish a basic workflow that we can follow, rather than approaching the problem haphazardly. Using reproducible methods as much as possible--for example, using code, and creating robust documentation and change logs.

The link below, Part 2 of a 3-part tutorial, suggests workflow steps and documentation to consider.

Tutorial Part 2: Creating a data cleaning workflow

Finally, the link below -- Part 3 of the 3-part tutorial -- walks through a real-world example to illustrate how to follow a data cleaning workflow.

Tutorial Part 3: Cleaning sample data in standardized way

Bad Data Handbook by Q. Ethan McCallum
Publication Date: 2012

What is bad data? Some people consider it a technical phenomenon, like missing values or malformed records, but bad data includes a lot more. Bottom line? Bad data is data that gets in the way. This book explains effective ways to get around it. Among the many topics covered, you'll discover how to: Test drive your data to see if it's ready for analysis; Work spreadsheet data into a usable form; Handle encoding problems that lurk in text data; Develop a successful web-scraping effort; Use NLP tools to reveal the real sentiment of online reviews; Address cloud computing issues that can impact your analysis effort; Avoid policies that create data analysis roadblocks; Take a systematic approach to data quality analysis

OpenRefine
OpenRefine (previously Google Refine) is a powerful tool for cleaning and manipulating messy data. Data remains on your computer, not in the cloud, so privacy is maintained.
Data Wrangler
Wrangler allows interactive transformation of messy, real-world data into the data tables analysis tools expect. Export data for use in Excel, R, Tableau, Protovis, ..
Trifacta Wrangler
This free cloud service helps clean and prepare messy data quickly and accurately. As soon as you import datasets to Wrangler, it begins to organize and structure your data automatically, then suggest common transformations and aggregations.

Primer for Protecting Sensitive Data in Academic Research
"This primer helps individual researchers, managers of research data services, and organizational leaders understand how and why to integrate data protection activities into their practices through the collaborative endeavor of research data management."