Are you a data scientist looking for an exciting and informative read? Look no further, because I’ve got a treat for you! My latest blog post is jam-packed with fun and innovative experiments that I conducted with ChatGPT over the weekend. In this experiment, I put ChatGPT to the test and challenged it to generate the solution to a Data Science problem automatically. You won’t want to miss the incredible results that we achieved together. Join me as we dive into the nitty-gritty of how we created the prompts to achieve our desired outcome and see for yourself just how accurate the solutions were. Trust me, this is a blog post you won’t want to miss! Come, let’s find out how to use ChatGPT prompts as a Data Scientist?
I will run through 2 different experiments. In the first experiment, I want to see if ChatGPT can help me with the code for building the machine learning model on a specific dataset. We will also evaluate the code in the jupyter notebook to see if it’s accurate or not. And in the second experiment, we will take the learnings of experiment 1 and redesign prompts for desired outcomes. Broadly, we will evaluate the following points-
Let’s start the first experiment now.
I will consider the Black Friday Sales dataset. You can download the dataset from here. The dataset contains the customer transactions of a retail store containing customer demographics, product details, and total purchase amount. The company wants to understand customer purchase behavior for personalization. So, the ask is to build a machine learning model to predict the purchase amount based on the customer demographics and past products purchased.
In the first prompt, I am going to tell ChatGPT about the dataset and what is it about.
You are provided with the dataset of the retail store containing customer transactions. Each row contains customer demographics, product details, and the total purchase amount from last month. The sample dataset is given below.
Now, the ChatGPT responds back requesting the dataset. In the next prompt, I will provide the sample dataset of the Black Friday sales dataset.
Note: You can neither upload the datasets directly to ChatGPT nor copy-paste the entire dataset.
So, we will copy and paste around 100-150 rows from the dataset.
User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
1005915,P00372445,M,18-25,4,C,0,0,20,,,371
1005916,P00370853,M,51-55,20,B,1,1,19,,,24
1005918,P00370853,M,26-35,12,A,3,1,19,,,12
1005919,P00370853,M,18-25,0,C,0,0,19,,,48
1005920,P00375436,F,26-35,1,C,2,0,20,,,244
1005922,P00370853,M,55+,3,C,3,0,19,,,12
1005923,P00371644,M,26-35,7,C,1,1,20,,,129
1005924,P00370293,M,36-45,0,B,0,1,19,,,49
1005925,P00371644,F,26-35,0,C,1,1,20,,,592
1005927,P00372445,M,36-45,14,B,4+,1,20,,,358
1005929,P00370853,F,36-45,0,C,2,0,19,,,50
1005931,P00372445,F,18-25,7,A,3,0,20,,,129
1005932,P00371644,M,18-25,14,C,3,0,20,,,131
1005933,P00375436,M,26-35,2,C,3,1,20,,,364
Now, let’s ask ChatGPT to write a code for building a model to predict the target variable “Purchase”.
I want you to act as a data scientist and write code for me. Please build a machine learning model to predict the Purchase variable from the above dataset.
As you can see, ChatGPT provided us with the code for building the machine-learning model. We will run the code in the jupyter notebook and see if it’s working or not.
The above code throws the error.
ChatGPT missed out on a couple of data preprocessing steps-
Now, in the next prompt, let me ask ChatGPT to update the data preprocessing steps in the code without explicitly mentioning the kind of steps to perform. Let’s find out if it can do it.
The above code is incomplete. Update the above code with the necessary data preprocessing steps depending on the provided dataset.
The above code throws the error.
As expected, it included the code for missing value imputation and handling categorical variables. But missed out on encoding product id and user id columns.
Let’s inquire about ChatGPT to encode product id and user id columns in the next prompt.
The above code gives an error. You missed encoding the user id and product id columns.
The above code throws the error. It encoded the product id and user id into new columns but didn’t drop the actual columns itself. As you can see, this is the glitchy content generated by ChatGPT.
Let’s prompt ChatGPT to revise the code.
You are wrong. The above code still throws an error.
ChatGPT responds back looking for an error. Let’s copy and paste the error faced running the code. This will be our next prompt.
ValueError: could not convert string to float: ‘P00233842’.
Is anything wrong with the code? Now you can see that ChatGPT missed encoding the rest of the categorical columns. This is glitchy and flaw content. It is expected to include the rest of the categorical columns since it encoded the rest of the categorical columns earlier. While fixing the encoding of the product id and user id, it missed out on the other columns.
Now, let’s inquire about ChatGPT to encode the rest of the categorical variables.
You missed encoding the rest of the categorical columns. Update the code.
This time, it provided me with all the data preprocessing steps required. Lets run it in the notebook. It stills throws the error. Let’s ask ChatGPT to fix it. Hope this is our last prompt.
Update the code. The code throws TypeError: Feature names are only supported if all input features have string names, but your input has [‘int’, ‘str’] as feature name / column name types
Finally, we achieved an error-free code.
A couple of learnings from the first experiment are that
Now, we will start experiment 2 with our learnings.
You are provided with the dataset of the retail store containing customer transactions. Each row contains customer demographics, product details, and the total purchase amount from last month. The sample dataset is given below.
User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
1005915,P00372445,M,18-25,4,C,0,0,20,,,371
1005916,P00370853,M,51-55,20,B,1,1,19,,,24
1005918,P00370853,M,26-35,12,A,3,1,19,,,12
1005919,P00370853,M,18-25,0,C,0,0,19,,,48
1005920,P00375436,F,26-35,1,C,2,0,20,,,244
1005922,P00370853,M,55+,3,C,3,0,19,,,12
1005923,P00371644,M,26-35,7,C,1,1,20,,,129
1005924,P00370293,M,36-45,0,B,0,1,19,,,49
1005925,P00371644,F,26-35,0,C,1,1,20,,,592
1005927,P00372445,M,36-45,14,B,4+,1,20,,,358
1005929,P00370853,F,36-45,0,C,2,0,19,,,50
1005931,P00372445,F,18-25,7,A,3,0,20,,,129
1005932,P00371644,M,18-25,14,C,3,0,20,,,131
1005933,P00375436,M,26-35,2,C,3,1,20,,,364
I want you to act as a data scientist and write code for me. Please build a machine learning model to predict the Purchase variable from the above dataset. Include data preprocessing steps like dropping unnecessary ID columns, encoding categorical variables, handling missing values, and so on.
Update the code that includes model evaluation.
Another inappropriate and glitchy content from ChatGPT! It generated the code for the classification problem for the regression dataset.
The above code is incorrect. The given dataset is a regression problem.
Update the code that includes feature engineering. Keep the rest of the steps the same.
Write a code to tune the hyperparameters of the random forest. Use the smartest hyper-tuning technique to achieve the best results in less time.
Write a code to visualize the most important features.
I would like to explain the model results. Please write a code to interpret the model results.
Please write a code to interpret the model results using lime.
Incredible! No longer programming is required. Coding just got a whole lot easier with ChatGPT.
In conclusion, ChatGPT emerges as a valuable tool for data scientists and programmers, automating coding tasks specific to datasets. Despite occasional glitches, ChatGPT can self-correct and learn from errors. Crafting precise prompts is essential for optimal outcomes in data analytics. This collaborative approach enhances efficiency in data science jobs. As GPT-4 advances, it promises further refinement, solidifying ChatGPT’s role as a valuable asset in the dynamic landscape of data science.
Finally, we understood the importance of the right prompts to get the desired outcomes from ChatGPT for data scientist. We have also seen some of the top useful Data Science prompts as well.
A. No, ChatGPT is not designed for data analysis. It is more suitable for natural language processing tasks and generating human-like text.
A. While ChatGPT can provide information, learning Python fast is best achieved through hands-on practice, tutorials, and interactive coding exercises.
A. Use Pandas for correlation analysis, and Matplotlib (or Seaborn) to create a heatmap. Example code: correlation_matrix = df.corr(); sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
.
A. Use ChatGPT for generating text, ideas, or explanations. Verify information, and complement it with specialized data science tools for analysis.
A. ChatGPT doesn’t directly integrate with data visualization tools. It’s more suitable for generating textual content. Visualization tools like Tableau or Matplotlib are separate entities.
A. Yes, you can create a chatbot project using ChatGPT with a focus on prompt engineering for refining interactions. A common use case involves employing SQL queries for efficient data retrieval and Excel for initial data organization. Additionally, applying data cleaning and exploratory data analysis techniques can enhance input quality. Implementing generative AI and deep learning algorithms ensures contextually relevant responses. AI tools can aid in debugging and optimization processes. Python code is utilized for programming, and metrics are employed for evaluation. This project integrates openAI’s ChatGPT into a user-friendly chatbot, incorporating artificial intelligence and algorithms for an engaging experience.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
Good initiative. Thanks a lot.
Thanks Jacques.
This is a good article helpful
Thanks, Malani for the feedback.
Nice content. Well written.
Thanks Harish
Why GPT-3.5-turbo? GPT-4 is frastically better than 3.5. Use GPT-4 and post another article ASAP. I want to see, I am curious. Though I am a programmer, but I am not a data scientist and can't test it myself.
Sure, Sahil. Thanks for the comment. The article will be out soon on GPT-4.