Technical Resources

Data Sharing

How will the students acquire the data?

One of the most important aspects of a data science project is, of course, the data! It is essential to understand how the students will obtain the data and plan for potential obstacles (account set up, data requests, transfer to RCAC data depot, etc.). It’s best if the students can access the data early in the project, but as we mentioned below it’s not always best to dive right in on the first day.

From a company’s perspective it’s important to consider what it will take to gather the data well before the project starts. Consider a few of the questions below:

  • Will the data need to be approved by legal? (Almost always yes.)

  • Do we need to anonymize the data before it’s provided to the students?

  • Is there any work from our side that will require effort to gather the data from different sources?

    • Often if data is required from multiple sources the lead time is helpful to get everyone involved.

  • Do we know the correct contacts to pull the data?

    • Sometimes you may have to track the data down in a new system or from a different team. This can take some time as well.

What if I don’t have data right away?

If you don’t have data at the beginning of the semester don’t worry! Believe it or not teams that don’t dive into the data on day 1 often perform better in the long run.

We included a list of our suggestions for the first few weeks below:

  • Have the teams think through the business problem or project design.

  • Dive into the research topic and become familiar with the techniques using example data.

  • Interview business stakeholders to understand the problem and business impact.

  • Create a project plan and guidelines for the academic year.

What type of data do students like?

This one is easy. In our experience students like all data! Even more than data though students like the problem solving and collaboration.

The Data Mine students work with all types of data (geospatial, financial, imagery, and more), but the thing that they really enjoy is understanding the problem, solving it, and impacting the business in a positive way.

When defining your project, it’s helpful to give a high-level overview of the general topic areas that your project will cover. That way the students can easily identify the projects that are most applicable to their interests.

Is the data complex enough for the project?

A data project will not be intense enough if the data only contains a few observations and a handful of variables. Anytime you’re dealing with a small number of variables or samples data science can be challenging. However, students can often test to incorporate other types of data to see if it is beneficial to the core data set.

It’s important to think through the data that you have available, other data that the students could use, and the problem statement to see if it’s enough to fill an academic year. If it’s not and you’re still interested, you could always scope the project for a semester and start a new project the second semester.

If you have any concerns on need any help, please contact our team! We love brainstorming projects and are more than happy to think through possible solutions.

How do I upload data?

The first step in uploading data for a project is to setup your Purdue guest account. To set up an account please contact our team ([email protected]) and we’ll gather the information that is required.

Once your account is set up the most common tool to copy files is scp. This allows you to connect to Purdue’s HPC and copy files from your local computer.

If you have any issues, please let us know and we’d be happy to assist!

Can students incorporate other sources of data?

Yes! Students will often incorporate other sources of data in the projects. This can be open data sources, such as weather data from NOAA or soil data from the USGS. This can also take the form of web scraped data that the students gather using code from public websites.

What type of Purdue environments are available to students?

As we mention above, the primary environments that students in The Data Mine utilize are Purdue’s high performance computing clusters or HPCs. These allow students access to world class research systems and provide a single platform for collaboration.

In addition, Purdue students have access to student application licenses and other platforms, such as virtual machines, when the project requires it.

What’s the environment of Anvil like?

The HPC environments (like Anvil) allow us a lot of flexibility in how we build the environments for the students. From a student’s perspective we try to abstract a lot of the complicated build process and make it easy to use.

In their day-to-day Anvil looks like a website that the students navigate to in their computer. From there they choose the amount of resources and time required and launch a Jupyter Lab session. This gives them access to all the code and data that’s stored on the HPC for their specific group project.

If the project does require more technical work the students do have access to the underpinnings on the HPC and can dive in when required.

Can I learn more about Anvil’s technical specifications? Yes! The Purdue RCAC team has put together an Anvil information page that has information about the technical specifications as well as a helpful user guide.

Do students have to use the Purdue environments?

Not at all! We recognize that as many companies continue their analytics journey you may want the students to develop on your own internal analytics platform. Companies can send hardware (such as laptops) for students to use. This is an awesome experience for the students!

The Data Mine staff is also happy to help test the platform security and access prior to the students starting the project.

What are some tips to get a project started?

When a project is starting students are super excited and ready to dive right into the data. In our experience it’s often very beneficial to take some time at the start of the semester before diving in.

Think through the project design, have the students plan out their process, research the available data or the technical space, and do some team building to help with collaboration.

Also if you haven’t read through the Project Charter it’s an awesome resource.

What if I need a guest account at Purdue?

If you need a guest account at Purdue, please contact The Data Mine team ([email protected]). We’ll send you the instructions to create your account and get you added to our system.

Resources available to students

  1. Weekly seminar

  2. Most open-source tools

  3. Environment to deploy beta product

  4. Support from Data Science team

Will students have access to GPUs?

Yes! When required we can requisition GPUs for the students to utilize. This isn’t included in our standard environment though so be sure to let our team know if you’d like to make GPU resources available.

Do students have access to GitHub?

Also yes! With GitHub’s popularity as a tool, we encourage teams to utilize it. We can either host the repo in the secure DataMine GitHub or it can be hosted on a company’s GitHub.

Using GitHub helps the students collaborate, makes the code easier to handoff, and builds valuable real-world skills during the project.

What is the role of the Data Scientists within the project?

As with all The Data Mine staff, the data scientists are here to help. Due to the large number of different topics we cover and the number of student teams, our primary focus is technical guidance.

At the start of the year, we’ll meet with each team in lab to help get them off to a good start. After a few weeks we transition to a support role. This doesn’t mean that we stop interacting with the team. Our focus shifts to helping to empower the TAs, researching technical resources for student questions, developing new content for student learning, and assisting with technical support for the mentors. If the team’s need more one-on-one help at any point, we are happy to meet up in lab until the questions are resolved.

What if the company is working on learning a new topic?

We love it! As mentioned above, the awesome thing about The Data Mine is the number of different topics that the projects cover. Due to this we are never going to be an expert in everything that the students are researching.

Our goal is to leverage The Examples Book to provide a library of different links that we’ve found helpful. That way if we haven’t gotten around to something like a new NLP technique yet, we’ve provided all the links for students or mentors to research as well.

As with anything in the examples book, we also want feedback from you! If there’s a link that you’ve found helpful either send it our way or add it to the repository directly. Your input is crucial to both our support and the student teams.

Hardware/Software

What are some required software and hardware?

To work on the projects, the students need to know what they will be using to accomplish the tasks. We will discuss specifics tools you may have and what students have access to in The Data Mine.

Can we host code that students have developed?

Yes! As the students develop code, we are more than happy to share it with the sponsor company. However, it should be noted that we don’t have the resources to support hosted code within the Purdue systems.

The students can stand up examples here, but any long-term support would need to be internal to your company.

What if we want the students to use a different application, like Tableau?

Depending on the application we’ll work with you to see if we can make it available to the students. Many applications have student licenses that allow them to download the app and work with it throughout the year. We can also make resources, like Windows VMs, available to the teams to run the applications.

It is important to consider the use case. Many applications with student licenses have verbiage that prohibits the commercial use of the app. It’s always good to think through these use cases before the teams start their work.